Introduction

Biological invasions are major ecological phenomena that influence the worldwide distribution of species. They can be a major driver of ecological change, affecting conservation (loss of biodiversity and species extinction1,2), human health (e.g., the Zika virus transmitted by the invasive Aedes aegypti3) and agriculture (e.g., the spread of the cotton boll weevil, Anthonomus grandis through the Americas4; the introduction of fall army worm Spodoptera frugiperda into Africa from the New World5). In Brazil alone, 35 new pests have been detected6 or confirmed7,8 in the last 10 years.

Invasive insects cost a minimum of US$70 billion per year globally for goods and services9. The Old World cotton bollworm, Helicoverpa armigera is considered a major agricultural pest with an estimated annual global cost to agriculture of over USD$5 billion10, and Kriticos et al.11 estimated that the arrival of H. armigera into North America would put at risk an extra USD$78 billion of agricultural output. A number of life history traits predispose H. armigera to be a highly successful insect pest12. These include: (i) high polyphagy, where larvae of this insect pest are known to feed on over 180 plant hosts from at least 68 plant families13; (ii) its long distance migratory ability with migrations of up to 2,000 km14,15,16,17; (iii) the ability to enter a facultative diapause as pupae under unfavourable environmental conditions such as extreme high or low temperatures12; and (iv) high fecundity and a short generation time, capable of completing up to 10 to 11 generations per year12,18,19. Such fast generation times could aid in building population size, and thus contribute to a successful invasion20.

In Brazil, H. armigera was confirmed in January-February 201321,22, and the incursion has resulted in over USD$800 million in losses and control costs since 20126,23,24. The presence of H. armigera was also reported in rapid succession in neighbouring western and south-western countries, including Paraguay in October 201325,26, Argentina in August-October 201327, and in 2014/2015 in Uruguay7,28. H. armigera was also confirmed in the Caribbean countries of Puerto Rico29 and Dominican Republic30,31 and on the mainland of United States of America29. The detection of H. armigera from southern Florida down to Argentina over just 3 years demonstrates the speed at which this species has established in the New World. It clearly has the ability to spread through a connected landscape and via an island hopping, stepping-stone dispersal model. The pathway of H. armigera into the South American continent may be linked with commodity movements involving importations of agricultural and horticultural products from multiple Old World destinations31. Whilst multiple origins of H. armigera in Brazil are likely, it remained unclear whether these founder individuals arrived as a single or as multiple event(s). By examining the genetic signature of the populations found in the South American continent using mtDNA markers, it may be possible to utilise spatial genetic signatures of the insect to infer frequencies of introductions.

It is almost certain that H. armigera had been present in South America for a period of time [e.g22,32,33,34.] prior to its first identification in Brazil21, remaining undetected due to its close morphology with the New World sister species H. zea, and the difficulty of detecting invasive pests at the early stages of incursions35. It is unknown whether the H. armigera populations detected across South America arose from a single, or multiple, original introductions, and where these were. A detailed population genetic study is needed to test the hypothesis that there were multiple unrelated introduction events into South America, which would have significant implications for biosecurity preparedness for the South American continent and the potential reintroduction of novel adaptive ecotypes36 into the Old World33,34. To test this hypothesis, we undertook the present study, in which we show that gene flow and spatial distribution patterns of H. armigera mtDNA haplotypes support multiple introductions of the H. armigera into South America, with the incursion(s) in the southern regions likely independent from the northern/central Brazilian incursions.

Results

PCR amplification and sequence analysis

All specimens from the southern/south-western regions of South America were successfully sequenced for the mtDNA COI fragment (GenBank accession numbers MG230495 -MG230526; KU255535-KU255543 from7) using the Noc-COI-F/R primer pairs. Sequence identity searches against the NCBI GenBank database confirmed that all suspected moths matched (i.e., 99–100% nucleotide identity) published H. armigera sequences, and did not contain premature stop codons.

Unique amino acid substitutions were detected in three haplotypes (i.e., Harm_BC47, Harm_BC42, Harm_BC43). For Harm_BC47, this involved an L/V change, and in Harm_BC42 a I/M change. Both substitutions involved amino acids with hydrophobic side chain. In Harm_BC43 the unique amino acid involved an A to G change where both amino acids belonged to the ‘small’ category. All remaining unique haplotypes (e.g., Harm_BC06, BC23, BC24. BC34, BC37, BC39, BC44, BC45, BC46) had nucleotide transition substitutions at 3rd codon positions. The most significant unique non-Brazilian haplotypes detected were Harm_BC13, Harm_BC16, and Harm_BC17. All three unique haplotypes shared SNPs with other haplotypes, indicating that they did not have unexpected base changes. Furthermore, these haplotypes were also detected multiple times in separate sequencing efforts.

The range of genetic distances (i.e., measures of genetic divergence/degree of differentiation) of H. armigera within Asia (China, India and Pakistan) and within Australia were both 0.00–0.04%, while within Europe (Germany and unknown sites) and within South America (Brazil, Argentina, Uruguay and Paraguay) were both 0.00–0.02%. Estimates of evolutionary divergence between all H. armigera sequences from Australia, Asia, Europe and South America were therefore likewise low and ranged between 0.00–0.04%. Observed nucleotide diversity between countries/continents in H. armigera ranged from 0.0024 ± 0.0004 (s.e.) to 0.0040 ± 0.0006 (s.e.) (Table 1).

Table 1 Comparison of Helicoverpa armigera partial mtDNA COI gene nucleotide diversity (π ± s.e.) and haplotype diversity (h ± s.e.) between different countries/continents.

Haplotypes

Haplotypes were collated from previous works7,37,38 and from this study. A total of 47 haplotypes were identified from 314 individuals that consisted of 44 individuals from Asia, 18 from Australia, 28 from Europe and 226 from South America (Suppl. Table 3). Four most prevalent mtDNA COI haplotypes identified in this study were designated Harm_BC01, Harm_BC02, Harm_BC03, and Harm_BC04.

Brazil shares 5 mtDNA COI haplotypes with Asia (Harm_BC01, Harm_BC02, Harm_BC04, Harm_BC06 and Harm_BC07) and also 5 haplotypes with Europe (Harm_BC01, Harm_BC02, Harm_BC03, Harm_BC04 and Harm_BC06). Haplotypes Harm_BC01 and Harm_BC02 were present in all Brazilian states, and the Harm_BC01 haplotype was shared with all other locations, except Argentina, Paraguay and Uruguay, which had unique haplotypes (Harm_BC43, Harm_BC44, Harm_BC46 and Harm_BC47). The haplotype Harm_BC02 in Brazil was shared with Asia, Europe and Uruguay. Brazil also had 12 unique haplotypes (Harm_BC05, Harm_BC14, Harm_BC23, Harm_BC24, Harm_BC34, Harm_BC35, Harm_BC36, Harm_BC37, Harm_BC38, Harm_BC39, Harm_BC42 and Harm_BC45), and 80% of Brazilian individuals belonged to haplotypes Harm_BC01, Harm_BC02, Harm_BC03, Harm_BC04 and Harm_BC05.

Helicoverpa armigera haplotypes distribution and validation of the null model

Results from validation of the null model indicated no tendency for the test to generate either Type I or Type II errors (i.e., having an almost perfect rectangular distribution of P-values across 10,000 statistical tests, with each test based on 10,000 random realisations of pseudo-observed data). At the 5% confidence level, almost exactly 5% of tests yielded statistically significant results with randomly generated data for the full dataset (501/10,000 = 0.0501 (Suppl. Fig. 1a); and when testing for Brazil vs. non-Brazil 496/10,000 = 0.0496 (Suppl. Fig. 1b)).

Whole table analysis

Figure 1 shows the distribution of \({\chi }_{Ran}^{\,2}\) under the null model. No random realisations were found to have a value of the test statistic greater than or equal to that observed (\({\chi }_{Obs}^{\,2}\) = 719.2), thereby indicating highly significant non-randomness within the observations (P = 0.000), and strong support that at least one haplotype/location observation has observations that are either more or less than expected by chance alone.

Figure 1
figure 1

Distribution of \({\chi }_{Ran}^{2}\) under the null model, and the location of \({\chi }_{Obs}^{2}\).

Sub-table analysis

Evidence from above (Whole Table Analysis) strongly supported some mtDNA haplotypes were differentially distributed across sampling sites, and the \(T{S}_{DIFF}\) analysis was therefore used to further identify those haplotypes that were unduly rare or common across the locations. The results indicated haplotype Harm_BC01 and Harm_BC02 were both simultaneously under-represented in Argentina (ARG), Paraguay (PRY) and one Brazilian state (BA), and overly represented in another Brazilian state (PI) (Fig. 2). The analysis also found evidence to support haplotypes Harm_BC03 and Harm_BC04 as being sporadically overly represented in three Brazilian states (BA, MT and RS) (dark blue cells, Fig. 2).

Figure 2
figure 2

TSDIFF values and indication of statistically significant deviations from the null model based of a false discovery rate (FDR) of 0.0553. Pale red cells are cells with a non-significantly lower number of observations than expected at random, and pale blue cells non-significantly higher numbers of observations. Dark red cells indicate significantly lower numbers of observations. Dark blue cells indicate significantly higher numbers of observations.

Brazil vs. Non-Brazil

In the ‘Brazil vs. non-Brazil’ treatment of haplotype distribution data (Suppl. Table 4; Fig. 3), non-randomness of haplotype distribution within the matrix was again confirmed by the \({\chi }_{Obs}^{2}\) analysis (\({\chi }_{Obs}^{2}\)= 164.0, P-value < 0.000; Fig. 3). The irregular distribution of the test statistic in Fig. 2 reflected a smaller dataset and, therefore, fewer possible combinations of allowable observations to fulfil the row and column constraints.

Figure 3
figure 3

TSDIFF for the Brazil vs Non-Brazil comparison of associated P-values.

Consistent with the Whole Table Analysis (i.e., Fig. 2), at the scale of the individual haplotypes, Harm_BC01 and Harm_BC02 were simultaneously under-represented in non-Brazil and over-represented in Brazil. For haplotypes Harm_BC13, Harm_BC16 and Harm_BC17 the opposite was true. A trend for over-representation of unique haplotypes Harm_BC43, Harm_BC44, Harm_BC46 and Harm_BC47 at Non-Brazil locations was also detected in this analysis, but deemed statistically insignificant under the Benjamin and Hochberg (1995) false discovery rate correction. The difference in the strength of statistical significance between the two analyses (Figs. 2 and 3) is due to a smaller allowable number of permutations when allocating observations at random to the two aggregated spatial categories.

AMOVA and FST analysis

Overall FST estimates based on the partial mtDNA COI gene, when treating our data set as Brazilian, non-Brazilian (Uruguay, Argentina and Paraguay), and Old World samples, showed that a significant FST value (0.2879) was detected between Brazilian and non-Brazilian samples, indicating significant population sub-structure (i.e., low gene flow) between these populations (Suppl. Table 5). In contrast, the low FST estimate (0.0742, see also Tay et al.31) suggested consistent gene flow between Brazilian and Old World populations. Between the Non-Brazilian and the Old World populations, the significant pairwise FST estimate (0.2022) obtained suggested limited gene flow between these populations as inferred from the maternally inherited mtDNA COI gene.

At the fine-scale level, population pairwise FST (Suppl. Table 6) indicated limited gene flow between the Paraguay population and all of Brazilian populations, but not between the Paraguay, Argentina and Uruguay populations. Interestingly, limited gene flow also occurred between the Brazilian state of Rio Grande do Sul and the northern states of Piauí and Bahia, as well as with central Brazilian state of Mato Grosso. Similarly, the Argentinian population showed limited gene flow with Maranhão, Piauí and Bahia. Significant population substructure exhibited by Paraguay analysed with the whole Brazilian population would suggest the H. armigera population from Paraguay is unlikely to be recently established from Brazilian populations.

Analysis of molecular variance (AMOVA) among populations of H. armigera indicated that low levels of genetic variation existed between populations. Approximately 7% of variation detected could be attributed to that between individuals within groups, while the majority ( > 90%) of variation could be explained by the heterogeneous populations present across different geographic regions in the South American continent (Suppl. Table 7).

Discussion

The rapid spread and establishment of H. armigera across much of the South American continent has generated a very large population with significant impacts on agricultural production. In contrast to usual invasions such as the African incursion of the New World fall armyworm S. frugiperda16,34,39,40,41,42, the invasive H. armigera population in the New World appears to be very diverse. Furthermore, the spatial distribution of this diversity strongly suggests that the population has spread from two different regions of introduction. Statistical analyses of the haplotype distribution patterns show that some of the H. armigera haplotypes most commonly found in Brazil appeared to be uncommon in Argentina (i.e., Harm_BC01) and Paraguay (Harm_BC02), and conversely, that some of the less common and/or unique haplotypes found in the non-Brazilian countries (e.g., Harm_BC13, Harm_BC16, Harm_BC17) appeared disproportionately uncommon in Brazil. Furthermore, FST analyses suggest reduced gene flow between populations from the Cone Sul region (Southern Brazil, Argentina, Paraguay and Uruguay) and populations from either northern/central Brazil or from the Old World. Within individual countries, disproportionally over-represented haplotypes were identified (e.g., see Suppl. Table 2 and Fig. 2), and within Brazil, the rare Harm_BC13 haplotype was detected in the State of Santa Catarina at sites located only approximately 350 km from Paraguay where this rare haplotype was also identified. Two other unique haplotypes (Harm_BC42, Harm_BC45) were also only identified in the southern state of Rio Grande do Sul in Brazil.

It is necessary to keep in mind that the northern and central Brazilian H. armigera may contain the same rare haplotypes as our southern region populations, although studies involving greater sample sizes from similar sampling periods37,38 did not detect these rare haplotypes. Nevertheless, rare haplotypes detected in Argentina, Paraguay and Uruguay have also not been detected in other Brazilian samples to-date. We were unable to include the mtDNA COI sequences of Tay et al.31 due to different mtDNA COI regions being characterised. The study of Tay et al.31 analysed substantial Brazilian populations from similar period and had identified similar haplotype frequency patterns as this study. Furthermore, Tay et al.31 showed that within Brazil, the H. armigera population contained both globally common and globally rare haplotypes. At the national level, there were 13 rare haplotypes identified in Brazil reported, and is similar to the total number of 12 rare haplotypes identified in this study. Despite using different parts of the mtDNA COI gene, both studies have detected similar patterns of rare and common partial mtDNA COI haplotypes, suggesting that the results of our analyses are representative of the diversity present in Brazil at sampling time.

Population structure studies in H. armigera based on the mtDNA genes have found a general lack of substructure even for populations separated by considerable geographic distances (e.g43,44.), and this finding is supported by studies based on limited nuclear markers (e.g.15,17,45,46,47,48,49). Anderson et al.36 demonstrated differences between sub-species of H. armigera present in Australia/New Zealand (i.e., H. armigera conferta) and the Old World sub-species (i.e., H. armigera armigera), but not between global H. armigera armigera populations using genome-wide SNP markers. In northern/central Brazil, gene flow patterns of H. armigera showed non-significant levels of population substructure24,37 but exhibited reduced gene flow with southern South American populations of H. armigera. These southern South American populations, and particularly that of Paraguay and Argentina, also exhibited significant population substructure with Old World populations, suggesting that their origins differed from the origins of the founding populations present in northern/central Brazil.

Given that sufficient gene flow to prevent population structure had previously been detected in Brazilian H. armigera populations24,37,38, and that H. armigera in Argentina, Paraguay and Uruguay7,25,27 was confirmed shortly after the Brazilian detections, it would be a logical assumption that this likely represented natural migration, and/or movements of contaminated agricultural commodities from Brazil to the southern regions of South America. It was therefore unexpected to find unique mtDNA haplotypes in the southern South American populations that were as yet unreported in Brazil. Equally as unexpected was the lack of the most common haplotypes (e.g., Harm_BC01, Harm_BC02) in these countries given that they represented the major Brazil, and in fact, global, haplotypes.

To further explain the observed heterogeneous haplotype patterns across Paraguay, Uruguay, Argentina, and the southern Brazilian states of Rio Grande do Sul, Santa Catarina and Paraná (i.e., the Cone Sul region), two hypotheses may be put forward: (I) intrinsic factors associated with new biological incursions (e.g., stochastic lineage sorting, survival/reproductive variability, etc.) in a new environment, and (II) independent incursion pathways of H. armigera into South America.

In Brazil where the incursion of H. armigera was first reported, stochastic lineage sorting of founding populations (i.e., hypothesis I) could lead to the observed heterogeneous haplotype distribution patterns. This could involve factors such as lower population density and variability at population introduction phase and/or the lag-phase, high variability of reproductive success rates (e.g., see Gaither et al.50), variable adaptation success rates (e.g., differential response to attacks by parasitoids and/or predation rates51), susceptibility to viral/bacterial/fungal pathogen attacks, climatic stress, etc. to the novel New World environments at the early incursion stages.

In Brazil, across a number of studies, unexpectedly high genetic diversity of H. armigera has been detected22,24,31,37 as represented by multiple maternal lineages (Suppl. Table 2 e.g., Suppl. Fig. 2). This would likely indicate a complex incursion history of H. armigera into Brazil, and may be associated with agricultural commodity movements from Old World regions into Brazil31. For example, repeated incursions/releases at different Brazilian sites could function as population source increasing propagule pressure, raising and maintaining diversity which is important in sustaining an incipient population52,53,54 (i.e. hypothesis II). These suggested scenarios involving the highly volatile and variable periods of an exotic organism’s biology contrast the scenarios offered by Leite et al.37, where repeated bottleneck effects such as potentially associated with differential pest control/management strategies were deemed likely factors that underpinned the rapid population expansion signatures in both H. armigera and the New World endemic and closely related H. zea (but see31,44). In fact, H. zea in the New World was hypothesised as the outcome of an earlier incursion and the subsequent divergence (ca. 1.5 million years ago) from its common ancestor with H. armigera44,55, and involved a founder population with limited genetic diversity56. The H. zea genome, sequenced prior to the recent arrival of H. armigera in the New World, showed no evidence for subsequent introgression with H. armigera, and no evidence for the gain of additional genes affecting host use, but rather for the loss of genes already present in H. armigera57.

Repeated introductions and high propagule pressure are increasingly being recognised as important factors that underpin the establishment of an alien species2,54. With repeated introduction events, the likelihood of diverse maternal lineages that ultimately contribute to propagule pressure is high. Together with lineage sorting and stochastic processes (e.g., demographic, environmental53) experienced by the invasive species in the new environment, sampling of the mtDNA COI gene and the construction of a haplotype network will likely appear similar to one of a rapid population expansion (i.e., a ‘star-shaped’ haplotype network). This scenario of a ‘star-shaped’ haplotype network, as detected in H. armigera populations in the South Americas (see Fig. 2 of37), differed fundamentally to that reported for H. zea (i.e., Fig. 1 of44; Fig. 2 of37). Multiple introductions of an invasive pest insect that resulted in a mtDNA genetic signature similar to a rapid population expansion signature, has also been previously reported in Brazil (e.g. the Asian citrus psyllid Diaphorina citri, see58).

With the migration and dispersal ability of H. armigera in mind, the high frequency (i.e., 68%) of the two most common Harm_BC01 and Harm_BC02 haplotypes in Brazil populations, and a lack of population structures in northern/central Brazil (e.g.37) and the rest of the world, it was perhaps unexpected to observe statistically significant spatial mtDNA COI haplotype patterns and FST estimates in the Cone Sul region. The significant over- and underrepresented haplotypes in the Cone Sul region suggest that this population likely originated from somewhere outside the extensively sampled areas of central and northern Brazilian populations (i.e., hypothesis II). For example, haplotypes Harm_BC13 and Harm_BC17 for Paraguay and Harm_BC44 and Harm_BC47 for Uruguay and Argentina, respectively, were over-represented in these countries and to a lesser extent, also the over-representation of haplotypes Harm_BC43, Harm_BC44, Harm_BC46 and Harm_BC47,although the P-values (0.100–0.114) lie outside of the range usually considered significant, thereby adding support that these maternal lineages likely originated from non-Brazil source populations (i.e., hypothesis II).

The detection of the H. armigera genetic spatial signatures identified in this work provide the first insights into potential transnational patterns of the incursion of this species into the New World. This reflects opportunities to improve biosecurity protocols relating to phytosanitary practices of agricultural and horticultural commodity movements in the Cone Sul region. Although the haplotype distribution pattern from Rio Grande do Sul was excluded by our spatial analysis, pairwise FST estimates indicated a limited gene flow of the Rio Grande do Sul population with populations from three Brazilian states (see Suppl. Table 5), and coupled with the statistically significant over-representation of unique haplotypes in the Cone Sul region, our results therefore added support to the hypothesis of multiple introduction pathways of H. armigera into South America.

Similarly, the presence of limited gene flow between Rio Grande do Sul and Argentina populations and the various Brazilian populations might suggest different degrees of admixture between H. armigera from non-Brazilian and Brazilian populations. This observation should be further explored and should consider utilising genome-wide SNP markers to increase detection efficiency of population admixture59. Although the low sample size (n = 2) in Uruguay has prevented meaningful interpretations of gene flow patterns based on the mtDNA COI marker, future genome-wide SNP markers studies on these individuals may enable migration patterns and the introduction history to be interpreted with more confidence. Taken as a whole, these FST results suggest that non-Brazilian H. armigera populations, and particularly those from Paraguay, followed by Argentina, and to a small extend that of the Rio Grande do Sul population, likely represent populations that were wholly or partially derived from alternative incursions event(s) from that detected in northern/central Brazil.

Single locus markers have clear limitations, however recent studies from multiple mtDNA markers31 and from genome-wide SNP markers59 have also demonstrated multiple introductions from globally diverse H. armigera populations in the invasive Brazilian populations. Taken as a whole, these findings provided evidence to support multiple independent introductions of H. armigera into the South American continent over the effect of stochastic lineage sorting. Our finding is not without precedent, with the globally invasive hemipteran whitefly Bemisia tabaci MED species (i.e., the real B. tabaci60) also being shown to have been independently introduced into this southern region of South America57,61 in addition to an earlier introduction elsewhere in Brazil62.

Findings that populations of H. armigera in the South American continent likely also involved multiple introduction pathways was fortuitous, because the pattern identified in this study is unlikely to be maintained over time as mixing with other populations in South America and further incursions are likely. That incursions of H. armigera from the Old World potentially involved multiple pathways will have significant implications to pest and resistance management strategies in the New World. For example, populations of H. armigera around the world have developed resistance to conventional pesticides (e.g.31,63,64,65,66,67,68, and the South American populations have also arrived at least with resistance to pyrethroids67 but also potentially with various enhanced allelochemical detoxification traits. Increasing genetic diversity is a key factor that underpins increasing invasion success69,70. While significant levels of genetic diversity now exist in Brazil, and propagule pressure has also therefore decreased, the genetic make-up of these populations could be further bolstered by likely unrelated source populations from other parts of the Old World and this will further complicate and challenge management strategies. As pointed out by De Barro et al.71, measures to restrict the recruitment of additional genetic diversity should be maintained even after establishment and spread have occurred, so as to avoid increasing the genetic diversity of damaging invasive pests.

Material and Methods

Sample collection and DNA extraction

Suspected H. armigera adults were collected using delta traps baited with the female sexual pheromone Iscalure armigera® (ISCA Tecnologias LTDA, Ijuí, RS, Brazil) randomly installed in soybean fields in the Brazilian states Rio Grande do Sul, Santa Catarina and Paraná, and also in Uruguay, Argentina and Paraguay, in the 2014/15 cropping season (specifically, early April 2014; Table 2). Preservation of specimens, gDNA extraction procedures and PCR amplification and sequencing of partial mitochondrial DNA COI gene were done following PCR conditions as detailed in Arnemann et al.7 using the primers Noc-COI-F (5′-GCGAAAATGACTTTATTCAAC-3′) and Noc-COI-R (5′- CCAAAAAATCAAAATAAATGTTG-3′).

Table 2 Collection sites, dates, and mtDNA COI GenBank accession numbers of Helicoverpa armigera specimens from Brazil (BRA), Uruguay (URY), Argentina (ARG) and Paraguay (PRY). Brazilian States are Rio Grande do Sul (RS), Santa Catarina (SC) and Paraná (PR).

Selection of published mtDNA COI haplotype dataset

For our analysis, the following criteria underpinned our selection of mtDNA COI from publicly available sources: (A) only include published sequences from studies that had undergone critical review processes to avoid inclusion of untested haplotypes; (B) the sequences must originate from samples collected at a similar sampling time period as our material, (C) the published sequences must match our characterised partial mtDNA COI gene region, and (D) the populations must include northern/central Brazil to enable spatial comparisons to our Southern populations. Based on these criteria, sequences from three studies were chosen7,37,38, with the published data from Arnemann et al.7 also here included as part of the southern populations of H. armigera, while the studies of Leite et al.37 and Mastrangelo et al.38 represented the most comprehensive population diversity surveys at the mtDNA COI region in Brazil and fulfilled all four criteria.

Sequence analysis of partial mtDNA COI gene

The programs Pregap and Gap4 within the Staden sequence analysis package46 were used for editing DNA trace files and to assemble sequence contigs (i.e., haplotypes). Assembled mtDNA COI haplotypes were checked for premature stop codons that may be indicative of pseudogenes. Categorisation of global mtDNA COI haplotypes at the 5′ gene region, (Suppl. Table 1) and estimates of evolutionary divergence between all H. armigera individuals (i.e., between South America vs. Europe vs. Asia vs. Australia; n = 314) involved 548 bp of the mtDNA COI partial gene in the final dataset. The evolutionary divergence between haplotypes was estimated using the maximum composite likelihood model72, and included all (i.e., 1st + 2nd + 3rd) codon positions using MEGA673. Estimates of haplotype diversity (h ± SE) and nucleotide diversity (π ± SE) were carried out using the molecular evolution software package DNA sequence polymorphism (DnaSP) v. 5.10.0174. Assessments for presence of PCR errors, or nuclear mitochondrial sequences (NUMTs) involved (1) examination for presence of premature stop codons in the sequenced partial mtDNA COI gene region, (2) ascertaining for conservation of amino acid substitution patterns where presence of unique amino acid changes were further assessed for conservation of biochemical properties and/or their molecule sizes, and (3) sequence characterisation of rare haplotypes as confirmed by multiple independent PCR and sequencing efforts.

Analysis of Helicoverpa armigera haplotypes spatial distribution patterns

To better investigate the spatial distribution patterns of H. armigera haplotypes in the South American continent, a matrix table was prepared for the frequencies of the 25 H. armigera mtDNA COI haplotypes identified to-date from 11 South American locations (i.e., from Brazilian samples sites: Bahia (BA), Maranhão (MA), Mato Grosso (MT), Piauí (PI), Roraima (RR), Paraná (PR), Santa Catarina (SC) and Rio Grande do Sul (RS); and from Argentina, Uruguay, and Paraguay; Suppl. Table 2), prior to performing a contingency table analysis using the χ2 statistic to detect departures, as detailed below. The statistical test was based on the randomisation of haplotypes x locations generated according to an appropriate null model (see ‘null model’ below), similar to Gotelli’s75 analysis of species co-occurrence data.

Null and alternative hypotheses

Due to the perceived unevenness of mtDNA COI haplotype spatial patterns, a statistical test was applied to ascertain whether the diversity and frequencies of individual haplotypes had occurred independently and at random across South American sampling sites, or whether there was evidence for spatial segregation. For such a test the null hypothesis is therefore that mtDNA COI haplotypes are distributed randomly across the locations. The alternative hypothesis therefore considers at least one haplotype as being either more or less common, in at least one location, than expected due to chance alone (i.e. haplotypes were differentially distributed across locations).

Test statistics used

Equation (1) below was used for the χ2 statistic to detect departures from randomness:

$${\chi }_{Obs}^{2}=\sum _{i,j}\frac{{({Z}_{Ob{s}_{i,j}}-{\bar{Z}}_{Ra{n}_{i,j}})}^{2}}{{\bar{Z}}_{Ra{n}_{i,j}}}$$
(1)

where \({Z}_{Ob{s}_{i,j}}\)is the observation for row i and column j within the matrix (i.e. Suppl. Table 2), and \({\bar{Z}}_{Ra{n}_{i,j}}\) is the expected value for the same table entry calculated under the null hypothesis. If certain haplotypes and/or locations were disproportionately under or over-represented, then \({\chi }_{Obs}^{2}\) would be expected to fall within the extreme tail of the distribution of \({\chi }_{Ran}^{2}\) (i.e., the calculated value of the statistic when the null hypothesis is known to be true).

Although an analysis based on \({\chi }_{Obs}^{2}\) provides an overall test of haplotype randomness across the landscape, information on which location and/or haplotype combinations underpin any deviation from randomness will require an additional test statistic (i.e., Eq. (2)) to be applied at the scale of each haplotype/location combination:

$$T{S}_{DIFF}={Z}_{Ob{s}_{i,j}}-{\bar{Z}}_{Ra{n}_{i,j}}$$
(2)

Equation (2) implies that when \(T{S}_{DIFF}\) < 0.0 a haplotype would be less common than expected in location j due to chance alone, when \(T{S}_{DIFF}\) > 1.0 the haplotype would be more common than expected in location j by chance, and when \(T{S}_{VAR}=0.0\) the observed data conformed with the null model. Analysis of \(T{S}_{DIFF}\) can therefore be used to further explore primary circumstances (i.e., combinations of haplotype and location) and directionality (i.e., whether haplotypes were unexpectedly rare or common across the sites) if/when evidence of differentially distributed haplotypes across the landscapes was detected (i.e. to identify the genotypes that are unduly rare or common across the locations).

Determining the statistical significance of individual table entries is problematic because multiple comparisons are being simultaneously assessed. For example in Fig. 2, there are 11 locations × 25 haplotypes = 275 cell-level tests, of which approximately 14 would be expected to show significance simply by chance (i.e. at the 0.05 level of confidence). To control for this ‘familywise’ error rate the method of Benjamini and Hochberg76 was applied, using a false discovery rate of 0.1077. This identified seven cell entries of interest in the full analysis (Fig. 2), and five pairwise (Brazil /non-Brazil) comparisons of interest when data were pooled across locations (Fig. 4).

Figure 4
figure 4

Distribution of \({\chi }_{Ran}^{2}\) under the null model, and the location of \({\chi }_{Obs}^{2}\) for the Brazil haplotypes vs non-Brazil haplotypes comparison.

Null model

The null model generated for this study involved distributing all 226 observations (i.e., the complete mtDNA COI dataset of H. armigera in South America; Suppl. Table 2) across haplotypes and locations at random, with the constraint of fixed row and column totals (i.e. the same number of observations of each haplotype, and the same number of observations for each location, are both retained in the randomised matrices). This treatment is necessary to ensure that the null model explicitly accounts for both unequal survey efforts at different sites and the overall lower frequencies of some haplotypes. The algorithm AS159 of Patefield78 was used to generate random tables that constrained both row and column marginal totals.

This null model represents a non-parametric alternative to Fisher’s exact test, which is also based on fixed marginal totals, but which relies upon a chi-squared distribution to approximate the underlying exact distribution. This approximation can be poor if the data are sparse (see Suppl. Table 3), hence the decision to calculate P-values via random permutation.

Calculating P-values

P-values for \({\chi }_{Obs}^{2}\) were calculated by enumerating the number of times \({\chi }_{Ran}^{2}\) was less than or equal to \({\chi }_{Obs}^{2}\) (NLessEqual), and also the number of times \({\chi }_{Ran}^{2}\) was greater than or equal to \({\chi }_{Obs}^{2}\) (NgreaterEqual), based on 10,000 randomly generated tables. The two-tailed P-value79 is given by:

$$P-value=\frac{\min ({N}_{GreaterEqual},{N}_{LessEqual})}{10,\,000}\times 2$$
(3)

P-values for \(T{S}_{DIFF}\) were calculated similarly, but at the scale of individual table cells. Note that as the quantities \({N}_{GreaterEqual}\) and \({N}_{LessEqual}\,\)can potentially overlap in their included values (i.e., both include an equality term), P-values can potentially exceed 1.0; in such cases values are rounded down to 1.0.

Validating the null model

The statistical test was validated by creating n pseudo-observed data sets under the null model, followed by application of the test to each of the created data sets to confirm appropriate Type I and Type II error rates. At the 5% confidence level only 5% of pseudo random data sets should be statistically significant, and the expected P-value distribution across the n tests should be rectangular. If it is found that more than 5% of test results are significant at the 5% confidence level, then this indicates an elevated Type I error rate (i.e., a chance that the null hypothesis is incorrectly rejected when no real differences are actually present). Conversely, when less than the nominal number of pseudo random data sets yield a significant result it can lead to incorrect acceptance of the null hypothesis (i.e., concluding no significant difference even when one is actually present; (i.e., an elevated Type II error rate)).

The null model was validated through repeated analyses (10,000 times) using random pseudo-data reconstructed to match the n = 266 observations (Suppl. Table 2), but allowing observations to be randomly allocated to the matrices, and thereby ensuring conformity with the null hypothesis. Tables that had no observations for a given row or column during the constructing of pseudo-data were excluded. The code, the program and test data are provided as supplementary material (Suppl. Material 1).

Brazil vs. Non-Brazil

The above analyses were repeated, but combining the location data to consider just two categories of samples – the Brazilian and the non-Brazilian (Argentina, Paraguay, Uruguay) samples. Note that this analysis was carried out as a guide to assist interpretation for the observed within-table patterns of deviations, and have not impacted our main findings which was the result from the full dataset (i.e., Fig. 2) analysis.

AMOVA and FST analysis

The population pairwise FST and AMOVA estimates were carried out using Arlequin 3.5.2.239, and significance values were estimated with 10,000 permutations. For the AMOVA, populations were assigned into four groups to test molecular variation across geographical regions of Brazil (North: Roraima, Maranhão, Piauí; Central/Eastern: Bahia, Mato Grosso; South: Rio Grande do Sul, Santa Catarina, Paraná) and neighbouring countries (Argentina, Paraguay, Uruguay).