The reunion of two lineages of the Neotropical brown stink bug on soybean lands in the heart of Brazil

The rapid pace of conversion of natural areas to agricultural systems is highly concerning, and the consequences for conservation and pest management are not yet fully understood. We examined mitochondrial (COI and Cytb) and nuclear (ITS1) gene regions of 21 populations of the stink bug Euschistus heros, to investigate the genetic diversity, genetic structure, and demographic history of this emerging soybean pest in South America. Two deep lineages that diverged in the Pliocene (4.5 Myr) occur over wide areas of Brazil. Historical changes during the Plio-Pleistocene led to significant genetic differences between E. heros populations, which differentiated further in several biomes. The northern lineage is older, more diverse, and prevalent in the Amazon and Caatinga, while the southern lineage is younger, less diverse, and prevalent in the Atlantic Forest and Chaco biomes. Euschistus heros populations are expanding in size and range but at different rates, strongly affected by environmental variables. Secondary contact between the main lineages is now occurring, mainly in areas of intensive farming and particularly in the Cerrado, an important agricultural frontier. Individuals adapted to different environmental conditions and to large monocultures might currently be combining into a panmictic and hard-to-control pest population.


Results
Genealogical inferences. Genealogical relationships among 111 mitochondrial haplotypes indicate two well-supported E. heros lineages separated by 52 mutational steps, and an estimated genetic distance of D = 0.042 (Fig. 1a). The southern lineage (S) haplotypes occur mainly in southern regions of South America (Fig. 1b). Small percentages of lineage S haplotypes were also found in central and northeastern Brazil (Fig. 1b), in a wide range of habitats (Table 1). The northern lineage (N) occurs mainly in northern and northeastern South America and was not present in Paraguay or southern Brazil (Table 1). A total of 91 haplotypes were identified as private haplotypes; the most frequent variants were H2 and H38 (n = 8), both from lineage S ( Table 1).
Analyses of the ITS1 region revealed a single nucleotide polymorphism variation separating the haplotypes. The six haplotypes were separated by a one-step mutation ( Fig. 1c and Supplementary material S2). We created two alternative ITS1 sequences for all individuals that showed ambiguity in the polymorphic site, to recreate the heterozygotes. The ITS1 network had lower haplotype diversity than the mitochondrial network. Haplotype HA was the most frequent (70.16%) and was widely distributed across all regions (Table 1). Haplotype HC was found in southern and central South America, and associated individuals previously identified as mitochondrial lineage S (Table 1 and Supplementary Fig S2). Haplotype HD (21.77%) was found only in northern South America, and associated individuals previously identified as mitochondrial lineages N or S ( Table 1). The single haplotypes HB, HE and HF were found in populations RS1, MT1, and MT2, respectively ( Table 1). The sharing of nuclear haplotypes by specimens from both mitochondrial lineages may indicate that these lineages can interbreed (Fig. 1c).
Diversity statistics. These South American E. heros populations showed extensive mitochondrial diversity.
Of 159 concatenated mitochondrial sequences of COI and Cytb analyzed, 111 haplotypes were found; most (82%) of these haplotypes were private and only 18% were shared among individuals. The overall haplotype diversity, nucleotide diversity and mean number of nucleotide differences were h = 0.991, π = 0.03312 and K = 32.892, respectively ( Table 2). The haplotype diversity was similar in lineage N (h = 0.984) and lineage S (h = 0.982); however, lineage N (π = 0.0090) had higher nucleotide diversity than lineage S (π = 0.0062). For the different biomes, the haplotype diversity among biomes (groups) was low, ranging from 0.967 in the Amazon to 1.000 in the Chaco. The nucleotide diversity varied more widely among biomes, from 0.00627 in the Atlantic Forest to 0.03147 in the Amazon Forest (Table 2). Locations where both lineages were present had the highest levels of nucleotide diversity and mean numbers of nucleotide differences. Sequence analysis of the ITS1 region identified six haplotypes with a haplotype diversity of 0.461, nucleotide diversity of 0.0008 and mean number of nucleotide differences of 0.499 ( Table 2). The haplotype diversity and nucleotide diversity were higher in lineage N (h = 0.503; π = 0.0008) than lineage S (h = 0.355; π = 0.0006), previously defined by the mitochondrial network. Among biomes, the highest diversity was observed in the Chaco (h = 0.600; π = 0.0009) and the lowest diversity in the Atlantic Forest (h = 0.315; π = 0.0005) ( Table 2). The Amazon Forest, Caatinga and Cerrado have higher nucleotide diversity due to mixing of the two lineages in these areas.
Mitochondrial divergence dating. The estimated age of origin of the lineage S (southern lineage) clade was 4.5 Myr (95% C.I 2.801-6.453 Myr), during the Pliocene, with intense diversification in the Pleistocene and Holocene (Fig. 2b) Caatinga and Amazon Forest (Fig. 2a,c). Lineage S predominates in the Atlantic Forest and Chaco, with a lower frequency in the Cerrado, Caatinga and Amazon Forest (Fig. 2a,c). At the regional scale, the E. heros populations showed high genetic structure, as assessed by the Analysis of Molecular Variance (AMOVA). Differences among populations accounted for most of the genetic variances in mtDNA (56.57%, P < 0.001) and a high and significant value in ITS1 regions (20.18%, P < 0.001) ( Table 3). A test of the hypothesis that the genetic variation is structured by biomes showed that 45.07% of the mtDNA total variance was distributed among biomes (Φ CT = 0.450, P < 0.001). Furthermore, the larger portion of genetic variation within populations (39.61%, Φ ST = 0.603) indicates overall genetic differentiation in these populations ( Table 3). Analysis of the ITS1 region supported the mitochondrial data, showing a significant structuring by biome (Φ CT = 0.173, P < 0.001), in which most of the genetic variation was within populations (Table 3).

Demographic statistics inferred for mitochondrial genes.
Considering the two lineages, significant negative values were found in both the Tajima's D and Fu's Fs neutrality tests, indicating population expansion or purifying selection. Considering the biomes, the neutrality test statistics did not fully agree with one another. Fu's Fs statistic was significantly negative for all biomes, but only the Atlantic Forest biome had a significant negative Tajima's D value (Table 4).
For the lineages, the mismatch distribution analysis resulted in a nonsignificant SSD (P > 0.05), indicating a recent demographic expansion of lineage S but not lineage N (P = 0.04). For the biomes, a nonsignificant SSD (P > 0.05) was also found for the E. heros populations in all biomes but the Caatinga (P = 0.03) ( Table 4). The nonsignificant raggedness index (P > 0.17) supports the spatial-expansion model of populations of lineages, biomes, and the entire group (all populations combined) ( Table 4). The τ values were higher in the Cerrado and the Amazon Forest, τ = 57.5 and τ = 58.7, respectively, compared to the other three biomes, Atlantic Forest (τ = 6.4), Chaco (τ = 6.0) and Caatinga (τ = 7.3) ( Table 4).
The expansion of populations in the Amazon Forest, Chaco, Caatinga, Cerrado and Atlantic Forest occurred within the last 500 years, corresponding to a recent expansion during the Quaternary according to the Bayesian skyline plot analysis (Fig. 3). The Chaco and Atlantic Forest populations remained stable during the past 100 years, while the Caatinga and Cerrado populations are still expanding. According to the effective population size (Ne), the Atlantic Forest population is the largest, followed by the Caatinga. The Cerrado, Amazon Forest and Chaco populations have similar sizes, but the Cerrado population is still expanding very rapidly, while the Chaco population is expanding slowly and the Amazon population is now contracting (Fig. 3).
Environmental features and soybean expansion modelling the current mitochondrial lineage distribution. Three models passed the cutoff (i.e. models that were less than 2 units away from the "best" model) to explain the presence (%) of the southern lineage at a given location (i.e. the probability of finding an individual from the southern lineage). The best predictors were the 'max temperature of the warmest month' , 'latitude' , and 'annual mean temperature' (Fig. 4 and Table 5). The most important variables were the 'max temperature of the warmest month' and 'latitude' , which received the highest score in all top models. The model performance improved when 'annual mean temperature' was excluded, i.e. AIC c (36.22 and 35.35) and w i (0.09 and 0.14) (   Table 1) (map based on Ab'Saber 77 and modified with GIMP 2.8.22); (b) Bayesian phylogeny tree of 159 concatenated mitochondrial sequences (COI-Cytb). Gray bars at nodes indicate 95% highest probability density intervals (HPD) confidence intervals for nodal age. (c) Bayesian phylogeny tree showing posterior probability values (>75) and biome where individuals were collected (taxon names provided in Table 1).

Discussion
Our results revealed two divergent deep lineages of E. heros in South America. The two COI-Cytb haplotype groups are separated by 52 mutational steps and have an estimated genetic distance of 4.2% (K2P). The number of mutation steps separating the two E. heros lineages is exceptionally high, raising the question of the possible presence of cryptic species. Rolston 35 described a species in Middle America that is morphologically similar to E. heros, Euschistus atrox (Westwood) 1837, distributed in Colombia, Guiana, and Panama. However, examination of the external morphology, the sharing of ITS1 alleles by both lineages, and the admixture of lineages in the laboratory support the hypothesis of a single species that encompasses the two divergent lineages. Our results reveal the need for a thorough systematic review of the genus Euschistus.
The two E. heros lineages are geographically separated from one another, with one clade clustering the northernmost populations (i.e. northern and northeastern Brazil), and a second clade clustering the southernmost populations (i.e. southern and southeastern regions). Both mitochondrial lineages expanded to form a mixed zone upon secondary contact in the central and southwestern regions. It is not clear when the reunion occurred, but the formerly isolated populations seem to have come into contact before reproductive isolation was complete 36,37 . A related point to consider is that all but one of the northern haplotypes found in the Cerrado (CW) were phylogenetically grouped together in one clade, indicating a subgroup differentiation. The central-western (CW) subgroup likely occupied the region much earlier than the southern lineage arrived and before the first soybean fields were established. This is strong evidence against the hypothesis that the E. heros expansion was purely associated with the expansion of soybean cultivation during the 1970s.
The divergence time of the two main clades is estimated as occurring during the Pliocene (i.e. 4.5 Myr). This divergence seems to be associated with a cooling and drying of the global environment, which caused the  Table 3. Analysis of molecular variance (AMOVA) for genetic structure of Euschistus heros based on two concatenated mitochondrial genes (COI-Cytb) and ITS1 region.  separation of the Amazon Forest from the southern part of the Atlantic Forest and the consequent expansion of grasslands and savannas 38 . Temperature cycles were also associated with more recent diversification events during the Pleistocene (i.e. differentiation of the CW group). Deep sequence divergence dating to the Pliocene is also reported for other organisms in the Neotropics 39,40 , and phylogeographic structure has been found in amphibians in the Atlantic Forest 3 , reptiles in the Cerrado 41 , and plants 4,42 . Spatial genetic structuring by biomes was also found among subpopulations of E. heros. Separation into the Amazon Forest, Caatinga, Cerrado, Atlantic Forest and Chaco biomes seems to be the best way to explain the genetic variance hierarchically. Thus, separating insects by biomes can help us to understand the pattern of lineage mixing, diversity and demographic history. The haplotype diversity of E. heros was high and similar among biomes and lineages. This pattern is the result of the high number of private haplotypes found in E. heros populations in all biomes. The higher nucleotide diversity of lineage N compared to lineage S can be explained under the 'historical climate' stability models, where a stable environment such as the Amazon Forest can offer conditions for a population to persist, resulting in elevated intraspecific genetic diversity 43,44 . Unstable regions, on the other hand, would be associated with recent or multiple-event colonization, resulting in lower intraspecific genetic  diversity and signatures of expansion 4 . Therefore, the northern biomes (Amazon Forest and Caatinga) were the most stable environment, while the Atlantic Forest was the least stable environment. Another consideration is that lineage S is associated with areas that have undergone intense transformation due to agricultural practices, and has experienced population dynamics linked with farming cycles and control tactics.
Although E. heros' limited dispersal capacity likely helped to preserve the pattern formed during the late Tertiary and Quaternary as an outcome of the climate changes, the last 100 years were an important turning point for E. heros populations (i.e. soybean introduction and expansion of farming starting at the end of the 19 th century). It is plausible that farming and trade routes have increased the admixture process in certain areas, especially the Cerrado and connecting areas, even though there are still large areas where the two lineages have not yet encountered each other, showing that the pattern is still well preserved.
Recent signals of expansion were detected for E. heros lineages and in all biomes sampled. The inferences regarding population growth were supported by the neutrality tests, the unimodal mismatch distribution and the demographic expansion parameters (τ). Spatial expansion is also occurring, given that no significant Raggedness values were found. Apart from differences in test sensitivity, the lack of full agreement between tests for E. heros in each biome might indicate a more complex scenario. Multiple processes affecting local diversity and the noise from human intervention causing population reduction, population subdivision, bottlenecks, and facilitation of dispersal resulting in the secondary contact might affect the precise demographic estimates for a species [45][46][47] .
We also conducted a Bayesian Skyline analysis to test the hypothesis of recent expansion in all biomes and to determine how the effective sizes of the populations behaved over time. The period of E. heros population growth in all areas overlaps with the period of intense changes caused by the increase of urban occupation and agricultural area in South America 48 . It may be that the resulting habitat loss not only did not affect E. heros populations negatively, but has even been advantageous. One possible hypothesis to explain the success of E. heros is shifting hosts from natural areas to agricultural fields, especially soybeans but also cotton and bean fields 31 . A second hypothesis is that one or more traits occur in a latitudinal cline [49][50][51] . The species' association with environmental gradients should also be considered, given the possibility of differences in traits and adaptations such as reproductive diapause 27,52,53 .
We used environmental and soybean variables to make phylogeographic inferences to predict the predominance of lineage S over lineage N at a given location. Selected models had similar AICc scores and considerably reduced the number of variables, down to 4 for the percentage of lineage S models. The two most important variables were the maximum temperature of the warmest month and the latitude. Temperature and photoperiod both affect this species, and might induce quiescence behavior and other possible differences in physiological responses. Latitude, on the other hand, can be correlated with geographic distance, environmental gradients, and agricultural gradients, as in the case of the soybean expansion. Our data support the predictions of the latitudinal-gradient hypothesis, even though distinct demographic scenarios can be expected at different times of E. heros' evolutionary history. The time since the first soybean harvest correlates with latitude and other bioclimate variables, which likely decreased the importance of this variable in the model.
The reunion of the two long-separated lineages might have unforeseen consequences for one of the largest soybean-producing regions in the world. The two lineages are united again in central Brazil, where an agricultural revolution started in the 1970s and continues today, pushing soybean fields northward 54 . It is possible that the northern and southern populations of E. heros are exchanging adaptations in admixture zones. However, knowledge of the differences between the two lineages is limited, because their presence was unknown until this point 55,56 . The changing status of E. heros from a secondary to a primary pest in soybean crops and the reasons for this are poorly understood. In recent years, the increase of population densities in soybean fields, the shorter quiescence period, larger host range (i.e. damage in cotton crops) and pesticide tolerance/resistance have been frequently reported in E. heros populations 30,31,57 . These concerns increase in a scenario of GM soybean introduction, no-till management, and expansion to diversity-hotspot areas.

Material and Methods
Sample collection and DNA extraction. One hundred fifty-nine specimens of E. heros were collected between 12/2015 and 07/2016 from 21 different localities across five South American biomes. Twenty sampling sites were in Brazil and one site in Paraguay. Specimens were collected as adults, from the canopy of soybeans, using a beating cloth under the plants. Individuals were preserved in ethanol (>95%) at -20 °C until laboratory manipulation, after which the remaining tissue from all specimens was stored at -80 °C. DNA was extracted from the head tissue of an adult specimen, using the modified CTAB protocol 58 .

PCR amplification and DNA sequencing. Fragments of two mitochondrial and one nuclear region
were amplified by polymerase chain reaction (PCR), using specific mitochondrial primers developed for this project and previously developed ITS1 primers 59 . The Cytochrome c Oxidase Subunit 1 (COI) fragment was amplified using the forward primer (5′-ACCGCACATGCATTTGTAATAA-3′) and the reverse primer (5′-GTGGCTGATGTGAAGTATGCTC-3′), and the Cytochrome b (Cytb) fragment was amplified using the forward primer (5′-GGATATGTTTTACCTTGAGGACA-3′) and the reverse primer (5′-GGAATTGATCGTAAGATTGCGTA-3′). To amplify the ITS1 rnDNA region (18 S partial -ITS1 complete -5.8 S partial) we used the forward primer CAS18SF1 (5′-TACACACCGCCCGTCGTACTA-3′) and the reverse primer CAS5p8sB1d (5′-ATGTGCGTTCRAAATGTCGATGTTCA-3′  60 was inspected in the mitochondrial gene fragments, using the software MEGA v.5.2 61 . Three signatures of numts were searched: (i) indels that introduce frameshifts, (ii) out-of-place inframe stop codons that lead to premature termination of protein translation, and (iii) lack of codon position substitution bias toward the 3rd position, that lead to a higher rate of non-synonymous mutations. The presence of signatures (i) and (ii) is enough to consider a given sequence a numt. No numts were detected in the COI or CtyB sequences; therefore, we included all mitochondrial sequences in our analysis. The posterior analyses were performed using concatenated mitochondrial genes (COI-Cytb).

Genealogical inferences. The genealogical relationships between haplotypes of the mitochondrial and ITS1
regions were reconstructed by a network of median-joining haplotypes, using the PopArt software 62 . Preliminary analysis revealed two putative mitochondrial lineages associated with E. heros populations in Brazil and Paraguay. The genetic distance (D) between the two mitochondrial lineages was inferred by dividing the haplotypes in two groups and calculating the 2-parameter Kimura method (K2P) in MEGA v.5.2 software 61 .
Diversity statistics. The diversity analysis was performed by dividing individuals into two groups according to the mitochondrial lineages, or into five groups according to the biome to which the individuals belonged: Amazon Forest, Cerrado, Caatinga, Atlantic Forest or Chaco. Number of haplotypes, haplotype diversity (h), nucleotide diversity (π) and mean number of nucleotide differences (S) were estimated using the DNAsp v.5 63 .
Divergence dating. We estimated the relative age of divergence between the two mitochondrial lineages using the Bayesian relaxed phylogenetic approach implemented in BEAST v.1.8.4 64 , based on the combined mitochondrial genes. The substitution model was determined using the software PARTITIONFINDER version 1.1.1. 65 that selected the GTR + G + I model. A strict molecular-clock model to estimate the substitution rate and coalescent tree priors set to the constant size model were implemented. We used the insect molecular clock (mean = 0.0177, SD = 0.001) 66 that corresponds to 3.54% pairwise divergence per Myr. Three independent runs were performed for 150 million generations, sampling every 1000 steps and discarding 20% as burn-in. TRACER v.1.6 was used to determine convergence, measure the effective sample size (ESS), and calculate the mean and 95% highest posterior density interval (HPD) for divergence times. Effective sample size (ESS) for all parameters exceeded 200, and the three runs converged to similar distributions. Runs were then combined with LogCombiner v.1.4.7 64 .
Population Structure. Variance Analysis (AMOVA) was performed in Arlequin with parametric bootstrap (1000 replicates) using a 5% significance level 67 . The analyses were conducted to examine the presence of genetic structure among individuals, considering all sites sampled (non-hierarchical), among populations according to the sampling location (populations) and among biomes in three hierarchical levels.
Demographic statistics inferred for mitochondrial data. Tajima's D and Fu's Fs neutrality tests were calculated using Arlequin v.3.5 67 . Both tests used 1,000 permutations using coalescing simulations. Fu's Fs statistic was considered significant at the 95% confidence level when the P-value was less than 0.02. For each biome, we also estimated tau (τ) with its 95% confidence intervals, using a generalized least-squares approach and 1,000 coalescent simulations in Arlequin v.3.5. The parameter τ denotes the age of the expansion (t), so that t = τ/2 u; SCiEntifiC REPORTS | (2018) 8:2496 | DOI:10.1038/s41598-018-20187-6 u = μLg 68 . The parameter μ represents the estimated mutation rate, L is the length of the sequence, and g is the generation time. For E. heros, we did not estimate t directly, because the number of generations per year cannot be estimated straightforwardly. Thus, if we assume that the estimated mutation rate has not changed in E. heros (substitution rate = 1.345%), then u will be constant and we can consider that a smaller τ value indicates a newly established population, and a larger τ value an older one.
We conducted a mismatch distribution analysis using a spatial expansion model. The sum of square of deviations (SSD), raggedness index (r) statistics, and their associated P-value were calculated using Arlequin v.3.5. A nonsignificant SSD value means that the hypothesis of population expansion cannot be rejected, and a nonsignificant raggedness index indicates a good fit of the data to the spatial expansion model. We also used a Bayesian Skyline Plot (BSP) in Beast to reconstruct the demographic history, using TRACER v.1.6, based on the COI-Cytb data using 10 groups. We used the same substitution model and molecular-clock model that were used to estimate the divergence time.
Environmental features and soybean expansion modelling the current mitochondrial lineage distribution. We used a model selection approach to identify and select variables that could be influencing the presence of a lineage at a given location 69,70 . Therefore, our response variable was the proportion of the southern lineage S calculated at each location as a percentage of the total composition. As predictor variables, we used 'latitude' , 19 WorldClim variables based on all pixels of a CFR at 30 arc-second image 71 , and two soybean variables. The soybean variables consisted of the estimated time since the first harvest and the rate of increase of soybean production, given the cultivated area. We used linear regression to compile data from different sources and to estimate the two soybean variables, using the regression slope and the predicted year when the cultivated area was 100 hectares (Supplementary Table S1 and Fig. S1) 54,72,73 .
We evaluated the fit and plausibility of possible candidate models using glmulti 74 . We used a selection considering only the main effect, keeping the 200 best models. The criterion for selection was the corrected Akaike Information Criterion (AIC c ) 75 . We selected models with AIC c less than two units away from the best model. We also evaluated the Akaike weight of the best models, to assess the probability that a model is the best 76 . All variables were standardized by z-score, and the significance of each predictor was assessed by a GLM. We also assessed the importance of each variable by summing the Akaike weight for the models in which the variable appeared. Variables that appear many times in the top models, tend to be more important. We used the cutoff of 0.8 to separate the most important variables under the weight criterion 69,70 .