Negative frequency-dependent selection and asymmetrical transformation stabilise multi-strain bacterial population structures

Streptococcus pneumoniae can be divided into many strains, each a distinct set of isolates sharing similar core and accessory genomes, which co-circulate within the same hosts. Previous analyses suggested the short-term vaccine-associated dynamics of S. pneumoniae strains may be mediated through multi-locus negative frequency-dependent selection (NFDS), which maintains accessory loci at equilibrium frequencies. Long-term simulations demonstrated NFDS stabilised clonally-evolving multi-strain populations through preventing the loss of variation through drift, based on polymorphism frequencies, pairwise genetic distances and phylogenies. However, allowing symmetrical recombination between isolates evolving under multi-locus NFDS generated unstructured populations of diverse genotypes. Replication of the observed data improved when multi-locus NFDS was combined with recombination that was instead asymmetrical, favouring deletion of accessory loci over insertion. This combination separated populations into strains through outbreeding depression, resulting from recombinants with reduced accessory genomes having lower fitness than their parental genotypes. Although simplistic modelling of recombination likely limited these simulations’ ability to maintain some properties of genomic data as accurately as those lacking recombination, the combination of asymmetrical recombination and multi-locus NFDS could restore multi-strain population structures from randomised initial populations. As many bacteria inhibit insertions into their chromosomes, this combination may commonly underlie the co-existence of strains within a niche.

-3 -at a mean rate, τ, of 0.21 y -1 (τ = 0.0175 month -1 ) and span 6.4 kb of the genome. This rate does not include within-strain transformations, which are unlikely to be detectable through sequence divergence. Additionally, as these values were inferred from isolate collections, they correspond to the post-selection rate, and therefore underestimate the actual transformation rate.
Across the 616 genomes in the Massachusetts dataset, isolates encoded a mean of 309 intermediate-frequency loci. As an S. pneumoniae genome contains ~2,000 genes [1], this implies intermediate-frequency loci comprise ~15% of the genome. Therefore the monthly whole genome transformation rate (τ = 0.0175 month -1 ) was scaled to represent the rate with which transformation would affect only the intermediate-frequency loci, assuming an homogeneous distribution of transformation events (τ = 0.002625 month -1 ).
The typical transformation event size (6.4 kb [8]) would be consistent with the acquisition or deletion of a 5 kb accessory locus, given that transformation events affecting accessory loci typically include two flanking homologous arms of 0.75-1 kb [9]. Such an homologous recombination would correspond to approximately five intermediate-frequency genes, given each S. pneumoniae gene has a length of ~1 kb [3]. As each isolate encoded approximately 300 accessory loci, the proportion of loci affected during an exchange through transformation (ϱ) was set to 0.0167, such that the expected number of loci present in the donor and recipient potentially affected by transformation was approximately five. Given this size of accessory locus and the experimentally-determined relationship between the efficacy of insertion relative to SNP transfer [9], the magnitude of transformational asymmetry (φ) was estimated to be 0.05.
The product τρ, representing the mean rate at which transformation affects each S. pneumoniae accessory locus, dictates the timescale over which the effects of recombination are detectable. The calculated value of 4.38x10 -5 month -1 indicated each locus would only be -4 -expected to be affected by transformation in a given isolate once every ~2,000 years. Therefore simulations were run for 60,000 generations (equivalent to ~5,000 years).
Transformation appears to be a saltational process, with substantial inter-strain exchanges occurring infrequently [8]. These rare, but extensive, recombination events may play an important role in the emergence of strains [10]. To model this, simulations were run in which τ was reduced five-fold (τ = 0.000525 month -1 ), and ρ increased five-fold (0.0835), such that the mean transformation rate per locus (τρ) remained constant.

Initialisation of simulated populations
All simulations were initialised with a population of size κ generated through sampling input genotypes with replacement. For analyses using permuted genotypes, each simulation was initialised with an independently-generated set of genotypes in which the alleles at each locus (i.e., the columns of the gi,l and ci,s matrices) had been separately shuffled. Hence the genotypes were expected to be in linkage equilibrium, but each allele remained at the same initial frequency as in the genomic data. For analyses using randomised genotypes, each simulation was initialised with an independently-generated set of genotypes in which all alleles in each individual (i.e., all elements in the gi,l and ci,s matrices) were selected to be zero or one, with equal probabilities. Hence the genotypes were again expected to be in linkage equilibrium, and each allele had an initial frequency of approximately 0.5.
For analyses using a reduced subset of the accessory loci, all simulations used the same ten loci to enable the results to be combined and plotted. These were selected to be uniformly spaced across the range of intermediate frequencies, to control against any observed effects being specific to loci that were near one extreme of the distribution.

Simulation of migration
The strain composition of S. pneumoniae populations sampled from different locations varies extensively [2,7]. Hence simulations were run with inward migration from ten external populations, to reflect the effects of genotypes moving within a geographically-structured meta-population. High rates of migration would be expected to cause all communities within a metapopulation to homogenise [11]. Hence m was set at 10 -5 , such that under neutral evolution it would be expected that the majority of isolates in a population at the end of the simulations would not have been imported from other sources (i.e., (1-m) 60000 > 0.5).
-6 -For each analysed combination of parameters, ten independent replicate source populations were generated prior to the reported simulations. Each of these was produced by running a series of simulations, each for k generations, in which the final generation of one simulation was used to generate the starting population of the next. These simulations were themselves of closed populations. For this analysis, k = 600, and the series of simulations was run for the same overall number of generations as the analysed simulations (60,000 generations). At the end of each k generations, 5,000 genotypes were randomly sampled from the final population, without consideration of its categorisation into strains. These were used to generate a population of size κ through sampling with replacement to initiate the next phase of k generations. Once these serial simulations were complete, ten isolates were randomly drawn from each sampling timepoint (after each k generations) in each of the ten replicates.
This process generated pools of 100 migrants, denoted jk, j2k… jnk, each of which corresponded to a particular timestep of simulations run with a specified parameter set.
These were supplemented with 100 randomly-selected genotypes from the genomic data used as the starting population, to provide a j0 pool representing isolates in the early timesteps. When the analysed simulations featuring migration were run, the pool of genotypes from which migrants were drawn changed over time. Hence at generation t, while nk ≤ t < (n+1)k, the migrant genotypes were selected from jnk. Synchronising the immigrating and resident bacteria ensured migration did not artefactually disrupt long-term evolutionary trends in the simulations, such as the decay of the accessory genome in neutral simulations featuring asymmetric transformation.
The analysed simulations were each run as the equivalent simulations of closed populations, except that at each generation t, Mt isolates were imported into the simulated population at a rate determined by the parameter m: Hence the reproduction function was changed to: Each isolate in the pool jnk was equally likely to be randomly drawn to contribute to the Mt imported isolates.
Strains were defined using the genetic distance threshold indicated in Fig. 4 by constructing networks using igraph [23,24]. Permutations of gene presence and absence were conducted with vegan [25].
-9 - Additionally, exchange of sequence through recombination was underestimated in these simulations. Interstrain transformation occurred at a single rate that was inferred from reconstructions of clinical isolates' evolutionary history [8], which is necessarily measured after selection. Hence the actual pre-selection rate of transformation is higher, although this should not qualitatively alter the results, unless it were high enough to be predicted to drive the elimination of some accessory loci.
The transformation rate was simulated as being uniform across genotypes, between loci and over time. This does not account for the variation in interstrain transformation rate observed across the species [1, 7], nor the apparent 'hotspots' of recombination within the chromosome [27, 28]. Additionally, this does not reflect the punctuate nature of interstrain recombination [8], which was approximated by simulations with 'saltational' transformation. It was assumed that these large, infrequent recombinations had the same properties as more common transformation events, but it is possible they would not exhibit the same deletional bias. Large, more symmetrical exchanges would increase the rate of new strain formation, based on the diverse, unstructured populations generated by simulations combining multilocus NFDS with symmetrical transformation. This would increase the necessity for a mechanism driving outbreeding depression to preserve MSPs. Correspondingly, the simulations initiated with permuted or randomised initial populations demonstrated that the combination of multi-locus NFDS and asymmetrical transformation can restore a unstructured population's division into strains.
In each exchange through transformation within the model, the number of loci affected by recombination was determined by a fixed parameter applied at a constant rate per site.
However, transformation events in the core genome have an approximately exponential length distribution [29], consistent with greater variance in the number of SNPs being affected by each exchange between divergent genotypes in the population. The effect of transformation on accessory loci depends on how they are arranged within the chromosome as genomic islands, as short recombinations can nevertheless delete many accessory loci if they are present in the recipient as a contiguous stretch of DNA absent from the donor [9].
However, in these simulations all loci evolved independently, without consideration of the chromosomal architecture. Hence it is likely that the model underestimates the variation in the amount of recipient sequence affected by each exchange between cells, as well as the heterogeneity in the rate of such exchanges within the population.
The assumption of independently-evolving loci, without considering the linkage of nonmobile accessory loci into genomic islands [2,26], had further implications for the analysis.
Genomic islands are found in many combinations across S. pneumoniae population, which is consistent with the "modular selection" model of local epistasis [30]. The coherence of these distinct islands would decrease the number of possible accessory locus genotypes, and increase the variance and heterogeneity of the locus frequency, genome size and pairwise distance distributions generated by simulations featuring transformation.
Additionally, as the extent of transformational asymmetry is determined by the size of a genomic island [9], and the selective pressure acting upon it will be determined by the functions of the encoded genes, this means the prevention of decay by NFDS will actually be a property of each island, rather than being uniform across the population.
Contrasting with the strong linkage between accessory loci within a genomic island, there is little species-wide linkage between these islands and the proximal core genome SNPs in S.
pneumoniae [2]. Hence there is limited evidence for localised "fronts" of diversification surrounding genomic islands, as was previously hypothesised to flank structural variation [31][32][33]. Therefore, the extent to which neutral SNP frequencies were preserved through selection on accessory loci should represent a minimum, that may be slightly higher in more realistic simulations that account for the limited linkage between some core and accessory variation.
Many genomic islands, and by implication the accessory loci within then, are autonomously mobile [2]. Therefore the net asymmetry of the recombination processes affecting their distribution will favour insertion over deletion. This could be simulated using a parameterisation of φ > 1. Assuming such elements to be parasitic, recombinant progeny would likely be outcompeted by the original genotype, which does not harbour the element.
This would be consistent with the long-standing hypothesis that the steady-state frequency of 'selfish' elements reflects a balance between their mobility and selection against infected As well as the genetic simplifications, the simulations also assumed ecological homogeneity.
The selection pressures on the population depended on the equilibrium frequencies of the accessory loci. These were identical for all genotypes, and were consistent over the duration of the simulations. This is in contrast to the observation that public health interventions can cause differences in these equilibrium frequencies, such as by suppressing loci through -12 -  (Fig. S49). -

-
A further limitation of the simulations was the implementation as an individual-based model using a Wright-Fisher framework. This enabled individual genotypes to be modified through transformation over the course of simulations, but made the model computationally intensive to run. Hence the number of generations it was feasible to analyse was not sufficient for neutral simulations to reach an equilibrium (Fig. S1). It is also likely that the simulations featuring transformation that were initialised with permuted or randomised genotypes had not yet reached a final equilibrium state, given the relatively slow rate of transformation. -18 -  (Table 1).
-21 - Figure S4: Plots describing the dynamics of simulated populations. Data are displayed as in Fig. S1. These simulations were initiated with populations in which the alleles at each accessory locus, and SNP site, had been permuted across genotypes (Table 1).
-22 - Figure S5: Scatterplots comparing the frequency of alleles at the initial timepoint in the genomic data to their frequency in the final simulation timepoint (N = 616 isolates sampled from each simulation). Data are displayed as in Fig. 1. These simulations were initiated with populations in which the alleles at each accessory locus, and SNP site, were randomly generated (Table 1).
-23 - Figure S6: Plots describing the dynamics of simulated populations. Data are displayed as in Fig. S1. These simulations were initiated with populations in which the alleles at each accessory locus, and SNP site, were randomly generated (Table 1).
-38 - 0.0043, skewness: -0.13). These simulations were initiated with populations in which the alleles at each accessory locus, and SNP site, were randomly generated (Table 1).
-40 - in which the alleles at each accessory locus, and SNP site, were randomly generated (Table   1).
-67 - -68 - Figure S51: Comparison of trees generated from the genomic data and simulation outputs.
Data are shown as in Fig. 6. a Scatterplot comparing the characteristics of the neighbourjoining trees b Representative trees from individual simulations from each parameter set.
These simulations were initiated with populations in which the alleles at each accessory locus, and SNP site, had been permuted across genotypes (Table 1).
-69 - Figure S52: Comparison of trees generated from the genomic data and simulation outputs.
Data are shown as in Fig. 6. a Scatterplot comparing the characteristics of the neighbourjoining trees b Representative trees from individual simulations from each parameter set.
These simulations were initiated with populations in which the alleles at each accessory locus, and SNP site, were randomly generated (Table 1).
-70 - Figure S53: Comparison of trees generated from the genomic data and simulation outputs.
Data are shown as in Fig. 6. a Scatterplot comparing the characteristics of the neighbourjoining trees b Representative trees from individual simulations from each parameter set.
-71 - Figure S54: Comparison of trees generated from the genomic data and simulation outputs.
Data are shown as in Fig. 6. a Scatterplot comparing the characteristics of the neighbourjoining trees b Representative trees from individual simulations from each parameter set.
-72 - Figure S55: Comparison of trees generated from the genomic data and simulation outputs.
Data are shown as in Fig. 6. a Scatterplot comparing the characteristics of the neighbourjoining trees b Representative trees from individual simulations from each parameter set.
-73 - Figure S56: Comparison of trees generated from the genomic data and simulation outputs.
Data are shown as in Fig. 6. a Scatterplot comparing the characteristics of the neighbourjoining trees b Representative trees from individual simulations from each parameter set.
These simulations included only a reduced subset of ten accessory loci being subject to multi-locus NFDS (Table 1).