Ecological divergence despite common mating sites: Genotypes and symbiotypes shed light on cryptic diversity in the black bean aphid species complex

Different host plants represent ecologically dissimilar environments for phytophagous insects. The resulting divergent selection can promote the evolution of specialized host races, provided that gene flow is reduced between populations feeding on different plants. In black bean aphids belonging to the Aphis fabae complex, several morphologically cryptic taxa have been described based on their distinct host plant preferences. However, host choice and mate choice are largely decoupled in these insects: they are host-alternating and migrate between specific summer host plants and shared winter hosts, with mating occurring on the shared hosts. This provides a yearly opportunity for gene flow among aphids using different summer hosts, and raises the question if and to what extent the ecologically defined taxa are reproductively isolated. Here, we analyzed a geographically and temporally structured dataset of microsatellite genotypes from A. fabae that were mostly collected from their main winter host Euonymus europaeus, and additionally from another winter host and fourteen summer hosts. The data reveals multiple, strongly differentiated genetic clusters, which differ in their association with different summer and winter hosts. The clusters also differ in the frequency of infection with two heritable, facultative endosymbionts, separately hinting at reproductive isolation and divergent ecological selection. Furthermore, we found evidence for occasional hybridization among genetic clusters, with putative hybrids collected more frequently in spring than in autumn. This suggests that similar to host races in other phytophagous insects, both prezygotic and postzygotic barriers including selection against hybrids maintain genetic differentiation among A. fabae taxa, despite a common mating habitat.

Table S1: A: The five members of the Aphis fabae group according to Blackman and Eastop (2017) and their main host plants.B: Four additional aphid species that are currently counted towards the Aphis fabae group sensu largo and that might be found on the same hosts as Aphis fabae.As far as known, they are morphologically distinct from member of the Aphis fabae group sensu stricto but genetically weakly resolved.Summarised from the following online databases: http://www.aphidsonworldsplants.info(accessed 25.08.2023, Blackman andEastop (2000)), https://influentialpoints.com (accessed 25.08.2023, B. Dransfield and B. Brightwell), http://aphid.speciesfile.org(accessed 24.09.2023, C. Favret and D. Eades) Table S3: Cycling conditions and primers for used for microsatellite PCR; as applied for and presented in the manuscript "Defensive symbiosis in the wild: seasonal dynamics of parasitism risk and symbiont-conferred resistance" (Gimmi et al., 2023).The primers were published by Coeur d'acier et al. (2004).between sampling sites within each of the four main groups found in the winter host data with STRUCTURE under K=6 and assigning samples to a cluster if they show an assignment probability >0.8.FST values were calculated using pairwise.WCfst, 95% CI were estimated using boot.ppfstwith nboot=1000 from the R package hierfstat (Goudet, 2005).Values whose confidence intervals do not include zero are printed in bold and red.Table S12: Pairwise FST values (Weir & Cockerham) and 95% confidence intervals (in brackets) between sampling time points within each of the four main groups found in the winter host data with STRUCTURE under K=6 and assigning samples to a cluster if they show an assignment probability >0.8.FST values were calculated using pairwise.WCfst, 95% CI were estimated using boot.ppfstwith nboot=1000 from the R package hierfstat (Goudet, 2005).
Values whose confidence intervals do not include zero are printed in bold and red.).Minimal values or "elbows" in the curves, i.e. trends that change from decreasing to increasing, may hint at the "optimal" number of clusters in the data, a such is most clearly visible in the BIC plot at K=6.  Clustering results from snapclust for the number of clusters K=6, K=7, or K=8.Each aphid individual is represented by a vertical bar, the proportion of this bar in a specific color represents the likelihood that the sample belongs to the respective cluster (membership probability).For each K, the wide boxes to the left show all 2099 samples used in the analysis.For all solutions the samples are ordered according to the cluster for which they show highest membership probability in the K=8 result.The two narrow boxes to the right zoom in on the reference samples known to represent A. f. fabae and A. f. cirsiiacanthoides, respectively.

snapclust K=6
A. f. fabae A. f. cirsii.Figure S4: Output from STRUCTURE HARVESTER (Evanno et al., 2005) used to determine the optimal number of clusters (K) in the results from running STRUCTURE on the full dataset using the settings suggested by Wang (2017).These settings should improve the detection of small clusters in datasets with uneven or unknown samples sizes, but they may also lead to overestimation of the "optimal" number of clusters.The "optimal" number of clusters might be derived from the plot on the left as the K (y-axis) for which the mean Ln of assignment probability (x-axis) is highest, or sometimes also where the curve flattens down (Pritchard et al., 2000), no such pattern can be seen here.From the plot on the right, the optimal number of K (y-axis) might be derived as the one where DeltaK is maximal (x-axis, Evanno method, Evanno et al., 2005), i.e.K = 2 is determined as the optimal number of clusters here.The summary table shows the values that are visualized in the plot.

Figure S5:
Output from STRUCTURE HARVESTER (Evanno et al., 2005) used to determine the optimal number of clusters (K) in the results from running STRUCTURE on the more balanced data subset with the settings suggested by Wang (2017).These settings should improve the detection of small clusters in datasets with uneven or unknown samples sizes, but they may also lead to overestimation of the "optimal" number of clusters.The "optimal" number of clusters might be derived from the plot on the left as the K (y-axis) for which the mean Ln of assignment probability (x-axis) is highest, or sometimes also where the curve flattens down (Pritchard et al., 2000), which is the case at K=6 here.From the plot on the right, the optimal number of K (y-axis) might be derived as the one where DeltaK is maximal (x-axis, Evanno method, Evanno et al., 2005).
The two peaks at K = 4 or K=6 visible in this plot are caused by the high uncertainty for the K=5 solution visible in the plot on the left.(b) on samples belonging to the presumed A. f. fabae cluster (cluster 1) only, no substructure is suggested here; (c) on the even data subset, K=6 is suggested here (the "optimal" number of clusters might be indicated by minimal values and/or an "elbow" in the curve).

Figure S10:
Clustering results from snapclust (sc), STRUCTURE (STR) and DAPC applied to the more balanced data subset.Each aphid individual is represented by a vertical bar, the proportion of this bar colored in a specific color corresponds to the likelihood that the sample belongs to a specific cluster (membership probability).For each K, the wide boxes to the left show all 1333 samples used in the analysis next to each other.For all solutions the samples are ordered by their most likely cluster in the snapclust K=6 solution in the full data analysis.The two narrow boxes to the right zoom in on the reference samples from A. f. fabae and A. f. cirsiiacanthoides, respectively.

Figure S11
: Discriminant analysis of principal components (DAPC) after dividing the more balanced data subset into six groups using the k-means algorithm.The axes represent the 1 st and 2 nd (left), the 3 rd and 4 th (middle) or the 1 st and 5 th (right) linear discriminants; the number in brackets shows the percentage of variance explained by the discriminant.Each dot represents an individual aphid, its color corresponds to its group assignment as used for DAPC (which is very similar but not identical to the group assignment resulting by the snapclust and STRUCTURE analyses).

Figure S2 :
Figure S2: AIC, BIC and KIC values for the clustering results obtained with snapclust applied to the more balanced data subsets (containing a subset of data from the largest cluster such as to have more similar numbers of samples belonging to the six clusters initially inferred with snapclust), for different numbers of clusters (K).
Figure S3:Clustering results from snapclust for the number of clusters K=6, K=7, or K=8.Each aphid individual is represented by a vertical bar, the proportion of this bar in a specific color represents the likelihood that the sample belongs to the respective cluster (membership probability).For each K, the wide boxes to the left show all 2099 samples used in the analysis.For all solutions the samples are ordered according to the cluster for which they show highest membership probability in the K=8 result.The two narrow boxes to the right zoom in on the reference samples known to represent A. f. fabae and A. f. cirsiiacanthoides, respectively.

Figure S6 :Figure S7 :
Figure S6: Clustering results from STRUCTURE under K=6.Each aphid individual is represented by a vertical bar, the proportion of this bar in a specific color represents the likelihood that the sample belongs to the respective cluster (membership probability).The plot shows the same results (bars) as the second plot in Figure 1, but ordered by sampling timepoint here (first four rows, host plant = Euonymus europaeus if not indicated differently) and/or host plant (last row).Host plant abbreviations: Am Achillea millefolium, Ap Aegopodium podagraria, As Anthriscus sylvestris, Al Arctium lappa, Bv Beta vulgaris, Cb Capsella bursa-pastoris, Ca Chenopodium album, C Cirsium spp., Ga Galium aparine, Gm Galium mollugo, Mc Matricaria chamomilla, Pr Papaver rhoeas, Ro Rumex obtusifolius, Tm Tropaeolum majus, Vo Viburnum opulus.

Figure S9 :
Figure S9: BIC values for k-means clustering solutions on (a) the full dataset, K=2 is suggested here;(b) on samples belonging to the presumed A. f. fabae cluster (cluster 1) only, no substructure is suggested here; (c) on the even data subset, K=6 is suggested here (the "optimal" number of clusters might be indicated by minimal values and/or an "elbow" in the curve).

Table S4 :
Cycling conditions and primers used for symbiont-diagnostic PCR.

Table S5 :
Number of samples assigned to each cluster with snapclust or STRUCTURE under K=6, using an assignment threshold of p>0.8 (p is the group membership probability).

Table S6 :
Summary of allele numbers, observed and expected heterozygosity, and p-value of an exact test for HWE per locus; for the full dataset and for each of the six genetic groups inferred with STRUCTURE.P-values from HWE tests <0.05 are printed in bold, p-values below the Bonferroni-corrected significance threshold of 0.05/64=0.0008are printed in red.

Table S7 :
Contingency table showing the number of samples per genetic group per winter host (assignments based on STRUCTURE under K=6).A Fisher's Exact Test (simulated p-value based on 2000 replicates) on a version of this table with the row for undetermined samples removed (these are the samples that are not assigned to any cluster with p>0.8) results in a pvalue <0.001.

Table S8 :
Contingency tableshowing the number of samples per genetic group per winter host (assignments based on snapclust under K=6).A Fisher's Exact Test (simulated p-value based on 2000 replicates) on a version of this table with the row for undetermined samples removed (these are the samples that are not assigned to any cluster with p>0.8) results in a p-value <0.001.

Table S9 :
Contingency table showing the number of samples per genetic group per summer host (assignments based on STRUCTURE under K=6).A Fisher's Exact Test (simulated p-value based on 2000 replicates) on a version of this table with the row for undetermined samples removed (these are the samples that are not assigned to any cluster with p>0.8) results in a pvalue <0.001.
group Am

Table S10 :
Contingency table showing the number of samples per genetic group per summer host (assignments based on snapclust under K=6).A Fisher's Exact Test (simulated p-value based on 2000 replicates) on a version of this table with the row for undetermined samples removed (these are the samples that are not assigned to any cluster with p>0.8) results in a pvalue <0.001.

Table S13 :
Results from the search for hybrid genotypes using NewHybrids in datasets containing 20 simulated hybrids and their parental populations (N=number of samples).Average values from 100 datasets with identical parents but newly simulated hybrids are shown.The input data consisted of those samples that were assigned to either of the parental cluster with p>0.8 in the STRUCTURE analysis under K=6 and for which data was complete for all eight markers.

Table S14 :
Results from the search for hybrid genotypes using NewHybrids.The number of "undetermined" samples for each subset corresponds to the number of samples that show p<0.8 for any cluster but highest probability to one of the considered parental clusters and second highest to the other.

Table S15 :
Aphid samples identified as putative hybrids, ordered by presumed parents, then sampling timepoint.

Table S16 :
Prevalence of the endosymbionts Buchnera aphidicola (an obligate symbiont, thus our positive control), Hamiltonella defensa and Regiella insecticola in the genetic clusters determined by STRUCTURE, and number of samples for which each symbiotypes was observed.H-R-: none of H. defensa or R. insecticola; H+ R-: only H. defensa, H-R+: only R. insecticola; H+ R+: both H. defensa and R. insecticola.Note: the number of samples per cluster (n) considered for the endosymbiont analyses is slightly lower than the total number of samples (cf.TableS4).

Table S17 :
Values and from pairwise Fisher's Exact tests to assess the statistical significance of differences in symbiotypes (see TableS15) between the genetic groups inferred from the STRUCTURE K=6 solution.The Bonferroni-corrected significance level is 0.05/18 = 0.00278.AIC, BIC and KIC values for the clustering results obtained with snapclust applied to the full dataset (2099 samples) for different numbers of clusters (K