Ecological drivers of genetic connectivity for African malaria vectors Anopheles gambiae and An. arabiensis

Anopheles gambiae and An. arabiensis are major malaria vectors in sub-Saharan Africa. Knowledge of how geographical factors drive the dispersal and gene flow of malaria vectors can help in combatting insecticide resistance spread and planning new vector control interventions. Here, we used a landscape genetics approach to investigate population relatedness and genetic connectivity of An. gambiae and An. arabiensis across Kenya and determined the changes in mosquito population genetic diversity after 20 years of intensive malaria control efforts. We found a significant reduction in genetic diversity in An. gambiae, but not in An. arabiensis as compared to prior to the 20-year period in western Kenya. Significant population structure among populations was found for both species. The most important ecological driver for dispersal and gene flow of An. gambiae and An. arabiensis was tree cover and cropland, respectively. These findings highlight that human induced environmental modifications may enhance genetic connectivity of malaria vectors.


Results
Anopheles gambiae s.l. larvae were collected in 2014-2015 from thirteen sites across Kenya, which fall within four distinct geographical areas: western Kenya lowlands, western Kenya highlands, Great Rift Valley, and coastal Indian Ocean (Supplementary Table 1; Supplementary Fig. 1). An. arabiensis specimens included in analyses originated from ten populations in western Kenya, Great Rift Valley and coastal Indian Ocean (total individuals = 357); whereas, An. gambiae specimens included in analyses originated from six populations in western Kenya only (total individuals = 254). To assess genetic diversity and structure of vector populations, six microsatellite loci were genotyped in An. arabiensis and five microsatellites were genotyped in An. gambiae specimens (Supplementary Table 2).
Genetic diversity analysis. For An. arabiensis, a measure of genetic diversity, expected heterozygosity (H E ), ranged from 0.461 in an Indian Ocean coastal site to 0.723 in an Indian Ocean coastal study site (Supplementary Table 3). H E did not vary significantly among regions ( Supplementary Fig. 2). Mean allelic richness (A R ) ranged from 3.46 in an Indian Ocean coastal site to 6.24 in a western lowland site (Supplementary Table 3 Table 3). An overall heterozygote deficiency was detected across all populations (P < 0.05) for both species. There was no evidence indicating linkage disequilibrium at any locus across all population for An. arabiensis (P > 0.05). By population, there was significant linkage disequilibrium in An. arabiensis detected at two study sites in western Kenya (HB: 45C1/29C1 and AG2H143/AG3H577; KB: 29C1/AG3H577), one study site in the Great Rift Valley (MT: AG3H249/AG2H143), and two study sites in Indian Ocean coastal Kenya (JU: AG2H46/AG2H143; KK: AG2H46/29C1). For An. gambiae, linkage disequilibrium was detected in two locus pairs across all populations (AG2H143/29C1 and AG2H143/AG3H577; P < 0.05). By population, there was significant linkage disequilibrium in An. gambiae at three study sites in the western lowlands (PB: AG2H143/AG3H577, AG2H46/AG2H143, and AG2H46/AG3H577; KN: AG2H143/29C1; HB: AG2H143/29C1 and AG2H143/AG3H577) and one study site in the western highlands (EE: AG2H46/29C1). Since all of the locus pairs observed to be in linkage disequilibrium for both species do not occur in the same chromosomal region (Supplementary Table 2), this result suggests the possibility that random drift or non-random mating are leading to the observed linkage disequilibrium in An. arabiensis and An. gambiae. Compared to previously published H E values for An. arabiensis samples collected from nearby study sites within the Lake Victoria basin area of Western Kenya in 1996 12 , we observed no significant change in overall H E between 1996 and 2014 ( Fig. 1; P = 0.417; paired t-test). In comparison to previously published H E values for An. gambiae samples collected in 1994 26 , overall H E was significantly lower in 2014 ( Fig. 1; P = 0.002, paired t-test).
Population structure and migration rates. For An. arabiensis, we identified three clusters consistently across ten runs in Structure (Fig. 2). Predominant cluster membership varied between coastal Kenya and the other two regions. The AMOVA indicated that 12.77% of variation occurred among groups, which were populations grouped by their predominant cluster (F CT = 0.128; P < 0.001). Variation within populations (79.7%; F ST = 0.203; P < 0.001) and variation occurred among populations within groups (7.49% F SC = 0.086; P < 0.001) were also significant. For An. gambiae, we identified two clusters consistently across ten runs in Structure (Fig. 2). Predominant cluster membership varied between the western lowland and western highland regions. The AMOVA indicated that 6.22% of variation occurred among groups, which were populations grouped by their predominant cluster (F CT = 0.062; P = 0.058). Variation within populations (86.9%; F ST = 0.131; P < 0.001) and variation occurred among populations within groups (6.84% F SC = 0.073; P < 0.001) were significant. Based on a linear mixed effects model with maximum likelihood population effects (MLPE), populations did not conform to an isolation-by-distance model for either species (P > 0.05; Supplementary Fig. 3).
Historical migration rates between vector populations were evaluated by analysis in Migrate-N, and the highest directional migration rates (migration rate [M] > 18) were visualized (Fig. 3).Analysis of past migration rates for An. arabiensis revealed that frequent migrations (M > 18) occurred between populations across regions, as well as between populations within regions (Fig. 3) Landscape genetic analysis. We used landscape genetics to test for associations between landscape factors and gene flow between populations. We created landscape resistance raster files for the following variables, which we hypothesized to potentially influence gene flow of the studied malaria vector species: annual mean temperature, annual precipitation, percent tree cover data, presence or absence of croplands or cropland/natural vegetation mosaics, human population density, and road proximity (Table 1). Landscape genetic analysis was conducted using two measures of genetic distance, F ST and D PS . Since the results from both analyses did not vary substantially, F ST analyses are presented here, and D PS analyses are presented in the supplementary material (Supplementary Table 5; Supplementary Fig. 4). Notably, the genetic distance metrics were highly correlated (P < 0.0001 for both species). For An. arabiensis, cropland was the top single-surface model in explaining population genetic structure of An. arabiensis ( Table 2). The cropland model had the lowest ΔAIC (0.000), third lowest average rank (2.335), and highest percentage for being the top model across 1000 bootstrap iterations (34.8%). For An. gambiae, the tree cover model was the top model in explaining population genetic structure, which had the lowest ΔAIC (0.000), second lowest average rank (1.891), and highest percentage for being the top model across bootstrap iterations (40.3%).
To determine the relationship between model factors and landscape resistance to gene flow, response curves were generated (Fig. 4). In the highest performing model for An. arabiensis, cropland, areas with croplands or cropland/natural vegetation mosaics had a low resistance to gene flow. Average temperature was the next highest performing model. An average temperature of 15.9 °C had the lowest landscape resistance to gene flow. Areas with an average temperature exceeding 23.3 °C had the highest landscape resistance to gene flow (highest 30% resistance values). For the third highest performing model, landscape resistance to gene flow was lowest in areas with annual precipitation 1530. .8 mm (lowest 30% resistance values) and highest in areas with annual precipitation less than 1230.8 mm and greater than 1970.8 mm. In the highest performing model for An. gambiae, the lowest landscape resistance values occurred in areas with 9.2-38.2% tree cover and the highest landscape resistance values occurred in areas with less than 4.9% tree cover and greater than 61.2% tree cover. For the second highest performing model, the lowest landscape resistance occurred in areas with an average annual temperature between 18.9 and 22.6 °C and the highest landscape resistance values occurred in areas with an average temperature less than 16.1 °C and greater than 23.0 °C. For the third highest performing model, the lowest landscape resistance to gene flow occurred in areas with 1401.5-1890.8 mm annual precipitation and the highest landscape resistance to gene flow occurred in areas with less than 1017.5 mm and greater than 1963.0 mm annual precipitation. Our visualization of gene flow between study sites based on the highest performing models, cropland for An. arabiensis and tree cover for An. gambiae, identifies likely pathways for gene flow between study sites (Fig. 5).

Discussion
Using a landscape genetics framework to test hypotheses related to whether environmental, landscape, or social factors predominantly influence the population structures of An. gambiae and An. arabiensis, we found that landscape and environmental factors primarily influence the population structures of these species in western Kenya. For An. gambiae, tree cover was the most important factor shaping population structure. While, for An. arabiensis, presence of cropland was the highest performing explanatory factor. These ecological drivers of gene flow may help to explain broad-scale population structuring patterns previously observed.
In this study, we observed a significant reduction in diversity (H E ) in An. gambiae from 1994 26 to 2014 (this study), which were collected from neighboring populations within the Lake Victoria basin area of western Kenya. For An. arabiensis, we did not observe a comparable reduction in H E in between similar time points, 1996 12 to 2014 (this study). Notably, we do not report on diversity from within the 18 to 20-year period, so we cannot describe how diversity may have fluctuated within this time frame. With that said, these findings of reduced H E in An. gambiae are consistent with a population decline, which has been observed throughout much of East Africa in response to mass distribution and increased coverage of insecticide-treated bed nets (ITNs) [27][28][29][30][31][32] . Conversely, the maintenance of H E in An. arabiensis suggests that An. arabiensis populations have remained relatively more stable despite the high coverage of ITNs in Kenya. Anopheles gambiae are known to be more susceptible to malaria interventions than An. arabiensis due to their preference for feeding on humans and resting indoors 4,5 . Whereas An. arabiensis are more catholic in their feeding behavior, taking blood meals from both human and non-human hosts, especially cattle [33][34][35] , as well as more commonly feed and rest outdoors 27,28,36,37 . These findings highlight the role that An. arabiensis may play in maintaining residual malaria transmission in sub-Saharan Africa 28,38 and the need for alternative intervention strategies to target this species. Moreover, understanding how ecological features influence An. arabiensis dispersal will likely become increasingly important to interrupting residual malaria transmission.
Neither An. arabiensis nor An. gambiae conformed to an isolation-by-distance model in this study. These findings are consistent with previous studies [9][10][11][12][13] , which suggests the possibility that factors aside from geographic distance alone drive gene flow, such as climate, landscape, or social features. Additionally, for An. arabiensis, we found frequent past migration across regions of Kenya. This finding is also consistent with previous studies and may be in part due to a past range expansion, facilitated by the expansion of human settlement and agriculture, rather than large-scale, contemporary cross-country migrations [39][40][41] . For An. gambiae, most frequent migrations occurred from lowland to highland populations. Nonetheless, we did find significant population structuring, indicating the presence of gene flow barriers for An. arabiensis and An. gambiae in Kenya. Notably, F ST values for both species were on average higher than those observed in the 1990s 12,26 , which suggests that population connectivity has reduced since then. This pattern of reduced connectivity may be potentially confounded by the recent history of vector control. With that said, these results together suggest the possibility that factors aside from distance alone drive gene flow patterns of malaria vectors in Kenya.
We tested the hypothesis that low human population densities and scant agriculture between An. gambiae populations provide a barrier to gene flow 12,14 . This notion was postulated since populations of An. gambiae between western Kenya and eastern Kenya were found to be more distinct than between western Kenya and Senegal in western Africa, despite that the two countries are separated by more than 5000 km 12 , whereas western Kenya and coastal Kenya populations are only 700 km apart 12 . The lack of human settlement on the high plateaus of the eastern arm of the Great Rift Valley, which bisects Kenya, was thought to explain the population structuring pattern, since unlike the eastern arm of the rift, human agricultural activity occurs in a broad band across the area between Senegal and western Kenya 12 . However, we provide evidence that tree cover primarily explains gene www.nature.com/scientificreports/ flow patterns in western Kenya rather than human population densities or agriculture. Both low (< 5%) and high (> 61%) tree cover were found to be gene flow barriers for An. gambiae in western Kenya. Whereas an intermediate tree cover of 9-38% was associated with enhanced gene flow for An. gambiae. This finding is consistent with a previous study in Kenya, which found that a canopy cover of 26% was associated with the presence of the An. gambiae complex, while a canopy cover of 44% was associated with the absence of the An. gambiae complex 42 . Moreover, this finding may also be consistent with the broad-scale population structuring observed across Africa 12,14 , as unlike with the western arm of the Great Rift Valley, a wide swath of land to the East of the eastern arm of the Great Rift Valley is characterized by short vegetation and a tree cover < 5% ( Supplementary Fig. 5). While populations of An. gambiae across the eastern arm of the rift were previously found to be highly genetically differentiated, lower differentiation was found between populations of An. arabiensis in this and other studies 12 . This result suggested that different factors drive gene flow patterns for An. arabiensis than for An. gambiae despite that the species commonly occupy similar ecological niches 2,3 . We provide evidence in support of this conjecture, as unlike for An. gambiae, a low percent tree cover was not a barrier to gene flow for An. arabiensis, but rather, the absence of cropland was the primary barrier to gene flow. Notably, however, the presence and absence of cropland does not explain the lack of population structuring between western and eastern Kenya observed by Kamau et al. in the late 1990s 12 , as western and eastern Kenya are not connected by cropland areas (Supplementary Fig. 5). That this cropland model does not explain population structuring patterns outside of western Kenya may be explained by previous observations that An. arabiensis is more of a climate generalist than An. gambiae 38 . Anopheles arabiensis are found across a very diverse environmental range, which ultimately allows the survival of the species in many locations in Africa 38 . Moreover, An. arabiensis are known to be relatively resilient to drastic environmental changes through both natural causes and control measures 43 . Lastly, since An. arabiensis are able to withstand more arid conditions than An. gambiae 44 , it is possible that other environmental indicators which were not evaluated in this study explain gene flow patterns, such as maximum temperature. Since An. arabiensis may not be as constrained by environmental heterogeneity, as well as are less susceptible to traditional malaria interventions, controlling residual malaria transmission may prove more challenging and require additional, alternative interventions.
The importance of tree cover in shaping population structure of An. gambiae and cropland for An. arabiensis suggests the significant role that land cover and land use plays in providing suitable microhabitats for larvae to develop. This notion is underscored by the fact that the tree cover model outperformed models based on environmental and social factors in An. gambiae, and that cropland was the top model for An. arabiensis. Too much tree cover may lead to a lack of sunlight and lower temperatures, which slows or inhibits the growth of larvae 42 . Whereas a lack of tree cover may lead to or be indicative of conditions that are too dry, potentially leaving adult An. gambiae susceptible to desiccation 44,45 . Whereas An. arabiensis are more desiccation tolerant 44 , and as such, can thrive in drier environments. These findings of high tree cover hindering gene flow of An. gambiae highlight the effect that deforestation may have in promoting gene flow of malaria vectors. Further, deforestation for agricultural purposes may promote gene flow of An. arabiensis, as cropland areas may enhance gene flow. Thus, deforestation may increase invasion risk of novel malaria vector and parasite genotypes to surrounding areas, potentially also enhancing the spread of insecticide resistant vectors and drug resistant parasites.
This study has several limitations. First, the landscape genetic analysis included genetic data from a single time point. It is not clear whether these spatial trends would be consistent across time. Second, we only examined relationships between genetic distance and ecological variables at one spatial scale. Patterns driving gene flow for these species may vary across spatial scales, i.e. one factor may be more important at a finer scale than at a coarser scale 8 . Third, the genetic data used for this study was based on five to six microsatellite loci per species, which may provide a limited sample of the genome. Future studies capturing a larger proportion of the genome may improve the resolution of population structure assignment. Finally, this landscape genetics analysis only examined populations in western Kenya. It is likely that different factors drive gene flow patterns depending on the ecological setting, and thus the patterns identified here may differ from those in another region. Thus, the generalizability of these findings to different scales and regions is unknown and requires further investigation. With that said, this study has sought to extend our understanding of local malaria transmission settings in Kenya. Further, this approach may be broadly applicable for investigating factors driving gene flow of malaria vectors in other geographic regions, as well as for additional studies in Kenya.
To conclude, we found that corridors for An. gambiae in western Kenya are most likely to be areas of moderate tree cover (9-38%). Potential corridors for An. arabiensis are areas with cropland or cropland/natural vegetation mosaics. These findings underscore that human induced land cover and land use modifications may enhance connectivity of these species. Specifically, agricultural deforestation may promote dispersal and gene flow for both species. Understanding the factors that limit and promote movement (gene flow) of malaria vectors can help us to more effectively deploy interventions to maintain vector control by identifying areas susceptible to invasion from neighboring mosquito populations. Moreover, this knowledge can be leveraged to help limit the spread of malaria parasites by mosquitoes from nearby areas, as well as contain insecticide resistance 8 . Finally, knowledge of factors influencing malaria vector dispersal are likely to become increasingly important as malaria transmission becomes increasingly heterogeneous 46 .

Methods
Sample collection. An. gambiae s.l. larvae were collected between May 2014 and January 2015 from thirteen sites across Kenya (Supplementary Fig. 1). These sites fall within four distinct geographical areas: western Kenya lowlands, western Kenya highlands, Great Rift Valley, and coastal Indian Ocean (Supplementary Table 1. Larvae were collected using a standard mosquito dipper. No more than five larvae were collected per habitat (pool of water) to reduce potential bias from collecting mosquito full siblings 47 . Larvae in a given study site were Scientific Reports | (2020) 10:19946 | https://doi.org/10.1038/s41598-020-76248-2 www.nature.com/scientificreports/ collected from between 14 and 53 habitats within a 1.5 km diameter area. Collected larvae were stored in 100% ethanol until DNA purification. Genomic DNA was extracted using standard ethanol extraction procedures with phenol:chloroform 48 . DNA was eluted into 20 µl of TE buffer. Then, DNA was quantified using a NanoDrop 8000 Spectrophotomer and diluted to a concentration of 1 µg/1 µl sterile water. We identified An. arabiensis and An. gambiae species within the An. gambiae s.l. complex using a ribosomal DNA polymerase chain reaction (PCR) assay 49 .
Microsatellite genotyping. Six microsatellite loci were selected for genotyping An. gambiae and five loci were selected for An. arabiensis based on evidence of polymorphism in previous studies, reliable amplification, and distribution across chromosomes (Supplementary Table 2) 50 . We used the M13 tailed primer method to fluorescently label our primers 51 . Amplification was conducted in a total volume of 10 µl with 5 µl of 2 × DreamTaq Green PCR Master Mix (Thermo Fisher, USA), 0.5 µl of 10 µM primer (forward primer with M13 tail), and 1 µl of DNA template. Thermocycling conditions for An. gambiae and An. arabiensis were as follows: initial denature of 94 °C for 3 min, followed by 35 amplification cycles of 94 °C for 30 s, annealing temperature (Supplementary Table 3 Population structure and migration rates. We estimated population structure using STRU CTU RE v. 2.3.4, which uses a Bayesian algorithm to group samples into genetically distinct clusters 56 . We tested K = 1-7 for An. gambiae and K = 1-10 for An. arabiensis, with ten replicates for each K-level, an initial burn-in of 100,000, and then 500,000 Monte Carlo Markov Chain iterations. The program was run using an admixture model. ∆K was used to detect the number of K (clusters). The output data for the best estimate of K were analyzed using CLUMMP to calculate the mean cluster membership coefficients across multiple runs 57 . An analysis of molecular variance (AMOVA) was conducted in Arlequin to test for significance of genetic differentiation 52 . To test for isolation-by-distance, we used linear mixed effects models with the maximum likelihood population effects 58 to fit pairwise geographic distance to F ST measured in Genepop 4.2 53,54 . Gene flow frequency among populations was estimated for each species in Migrate-N v3.7.2 with the Brownian motion model and Bayesian inference search strategy 59 . Bidirectional gene flow between all pairwise population comparisons was considered. Four independent runs were conducted with a burn-in of 10 4 steps, sampling increment of 100 steps, and 5,000 recorded steps in each chain for a total of 2 × 106 visited parameter values.

Landscape genetic analysis.
To test for hypothesized associations between landscape factors and gene flow between populations, we used a landscape genetic analysis approach. When testing the effects of environmental factors on gene flow between populations, between-site characteristics are of the greatest concern 60 . Hence, landscape resistance surfaces were created based on factors hypothesized to prevent or promote gene flow between sites. Since mosquitoes primarily disperse to seek blood meals and oviposit, as well as have physiological constraints regarding development and desiccation, we selected variables which influence these phenomena, i.e., the availability of blood meals and oviposition sites (larval habitats) and environmental suitability. Further, we selected variables that did not closely correlate with each other, which can lead to erroneous model selection results. All pairs of variables had a Pearson correlation coefficient <| 0.6 | (Supplementary Table 6). We created landscape resistance raster files in ArcGIS 10 using climate data (annual average temperature and annual precipitation) from WorldClim (BIO1 and BIO12) 61 , percent tree cover data from NASA (MOD44B) 62,63 , cropland from NASA (MOD12Q) 62,63 , human population density from WorldPop 64 , and road proximity from OpenStreetMap (Table 1). Specifically, tree cover and cropland variables were chosen because these factors may alter the local ambient temperature and humidity as well as the availability and nutrient content of larval habitats. The cropland raster file was created by reclassifying croplands and cropland/natural vegetation mosaics (IGBP land cover classification system) to "presence of cropland" and all other land cover types to "absence of cropland. " The cropland raster was categorized into two classes to limit the number of parameters estimated and to ensure classes were adequately represented in the landscape. The road proximity raster file was created using the proximity analysis tool in ArcGIS 10. All raster files were resampled to a grain size (resolution) of 2 km to balance computational efficiency. Landscape resistance distance among all pairs of sites was measured using the commuteDistance function in PopGenReport, which is based upon electrical circuit theory 65 . Resistance distance, defined as the effective resistance between a pair of nodes where all edges are replaced by analogous Scientific Reports | (2020) 10:19946 | https://doi.org/10.1038/s41598-020-76248-2 www.nature.com/scientificreports/ resistors, reflects the movement cost, as well as availability of alternative pathways, providing an advantage over the commonly used least cost path method 65 . Genetic distance between populations was calculated as F ST measured in Genepop 4.2 53,54 and D PS measured in PopGenReport 66 (Supplementary Table 7; Supplementary Table 8). F ST is a commonly used metric in population and landscape genetic studies, but relies on equilibrium assumptions, which may not be met. Whereas D PS does not rely on any such theoretical assumptions. Evaluation of locus-specific effects to overall genetic distance (F ST ) values indicated that F ST values obtained from each individual locus positively correlates with overall F ST for both species, but loci contribution to overall F ST was not even, particularly with regard to the 29C1 locus for An. arabiensis (Supplementary Fig. 6). Since the 29C1 locus for An. arabiensis was an outlier in contributing to overall F ST, pairwise F ST was calculated both with and without the 29C1 locus. Notably, the omission of 29C1 in calculating overall F ST did not substantially alter the isolation-by-distance analysis (Supplemental Fig. 6). Therefore, we used pairwise F ST and D PS values inclusive of 29C1 for landscape genetic analyses.
We used the R package ResistanceGA to unbiasedly optimize resistance surfaces to our genetic data 67 . This method has been demonstrated to have a low type 1 error rate for continuous and categorical surfaces, as well as a high correlation between true and optimized resistance surfaces in simulations 68 . Prior to optimizing our terrestrial resistance surfaces of interest (Table 1), we optimized a surface with lakes and land (no lakes) only to assign a resistance value to lakes. We then masked lakes in subsequent optimizations with the optimized resistance assignment value from the prior analysis, so that continuous land factors were optimized independent of lake features. Linear mixed effects models with the maximum likelihood population effects (MLPE) were used to fit optimized resistance surfaces to genetic data 58 . The three optimized resistance surfaces with the highest MLPE were then bootstrapped over 1000 iterations by subsampling 80% of the sample locations and refitting the MLPE model. Akaike information criterion (AIC) was used to compare model fitness to genetic data. We excluded eastern Kenya sites from landscape genetic analysis for Anopheles arabiensis due to the biases uneven sampling can introduce in analysis, as population sampling was not evenly distributed across the country 69 .