Long-range linkage effects in adapting sexual populations

In sexual populations, closely-situated genes have linked evolutionary fates, while genes spaced far in genome are commonly thought to evolve independently due to recombination. In the case where evolution depends essentially on supply of new mutations, this assumption has been confirmed by mathematical modeling. Here I examine it in the case of pre-existing genetic variation, where mutation is not important. A haploid population with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N$$\end{document}N genomes, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L$$\end{document}L loci, a fixed selection coefficient, and a small initial frequency of beneficial alleles \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${f}_{0}$$\end{document}f0 is simulated by a Monte-Carlo algorithm. When the number of loci, L, is larger than a critical value of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\text{4log}}^{2} \left( {Nf_{0} } \right),$$\end{document}4log2Nf0, simulation demonstrates a host of linkage effects that decrease neither with the distance between loci nor the number of recombination crossovers. Due to clonal interference, the beneficial alleles become extinct at a fraction of loci \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-2\mathrm{log}\left(N{f}_{0}\right)/{L}^{0.5}$$\end{document}1-2logNf0/L0.5. Due to a genetic background effect, the substitution rate varies broadly between loci, with the fastest value exceeding the one-locus limit by the factor of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${[{L}^{0.5}/\mathrm{log}\left(Ns\right)]}^{0.75}.$$\end{document}[L0.5/logNs]0.75. Thus, the far-situated parts of a long genome in a sexual population do not evolve as independent blocks. A potential link between these findings and the emergence of new Variants of Concern of SARS-CoV-2 is discussed.


Scientific Reports
| (2023) 13:12492 | https://doi.org/10.1038/s41598-023-39392-z www.nature.com/scientificreports/ In sexually reproducing organisms and viruses with frequent recombination, linkage effects are partly compensated by recombination between parental genomes. A fundamental fact of genetics discovered by Morgan is that frequent recombination destroys allelic associations, so that alleles at far-spaced loci segregate independently. Conventional wisdom tells us that all the other linkage effects between far-situated loci must vanish as well. Models of long-term sexual evolution depending on new mutation events confirm this expectation [21][22][23] . Assuming that genome consists from independently-evolving blocks and applying the phylogenetic theory of asexual evolution to each block, these authors constructed a scaling argument expressing the length of each block, the lead of the traveling wave, and the average coalescent time in terms of the average adaptation rate. The analytic predictions have been confirmed numerically for two particular models of population in the presence of natural selection and mutation.
In the present work, I investigate linkage effects in a different biological scenario, when natural selection and recombination act on pre-existing beneficial alleles, and new mutations can be neglected. This model is appropriate, for example, when a population migrates to a new environment, or a virus was subjected to rapid mutation for a period of time. Then the fixation of pre-existing beneficial alleles does not depend on mutation de novo.

Results
Model. The evolutionary factors included in the model are directional natural selection, random genetic drift, linkage, and recombination. A sexually reproducing population is comprised of N individual genomes (or N/2 diploid genomes without allelic dominance), where each genome has L loci, and L is assumed to be much larger than unit. In the beginning, each locus is assumed to have a fraction f 0 of beneficial alleles, with a fixed fitness benefit s . The initial state of a population is obtained by the generation of random and uniform distribution of alleles across all sites and individual genomes. The value of f 0 is assumed to be in interval 1 Ns ≪ f 0 ≪ 1. The evolution is simulated in MATLAB™ using a Wright-Fisher process, in which the progeny genomes replace the parental genome. The logarithm of the average progeny number (log fitness) is given by the product of the number of beneficial alleles and the selection coefficient s (Eq. 9, Section "Materials and Methods"). The random number of progeny for each genome is generated using a random number generator and the broken-stick algorithm. The total number of genomes N does not change. With some probability r, which is an input parameter of the model, a genome undergoes a number of random crossovers with another, randomly chosen genome. The average crossover number is M. One of the two parents is replaced with the recombinant. Below I assume r = 1 , which corresponds to fully-sexual reproduction. Parameters r and M can be connected to the average number of crossovers between two sites r 2 , which enters 2-locus models, as given by r 2 = rM/L . Altogether, the model has 5 input parameters (N, L, s, r, M) and the initial value f 0 .
New mutation events are absent. Epistasis and allelic dominance are neglected, and a haploid population is considered. Indeed, as it is well-known in population genetics, a diploid population with N/2 genomes and without dominance is effectively haploid, with a double number of genomes N. The details are given in section "Materials and Methods".
Extinction of beneficial alleles depends on a single composite parameter. If the number of loci L is sufficiently large, beneficial alleles at most loci become extinct. The fraction of remaining polymorphous loci, denoted 1 − C loss (t) , decreases in time from 1 to at a low plateau (Fig. 1, red line).
This result differs strongly from the prediction of the single-locus model, in which multiple lineages per site are expected to reach fixation in the chosen parameter interval, and C loss is exponentially small. Indeed, in the single-locus model, the fixation probability of an allele is s , and the extinction probability is 1 − s 24 . The probability of the extinction of Nf 0 beneficial alleles present in the beginning is given by C loss (∞) = (1 − s) Nf 0 ≈ e −Nf 0 s , which is exponentially small. Thus, in the parameter regime investigated, Nf 0 s ≫ 1, the one-locus model predicts that many alleles are fixed per each locus, and the loss of polymorphous loci due to genetic drift 25 is negligible. Therefore, the observed loss is not a one-locus effect ocurring due to genetic drift, but is caused by competition between the clones of beneficial alleles expanding at different loci 3,15,16,18 .
Varying model parameters in simulation, I found out empirically that the fraction of loci with non-extinct alleles, 1 − C loss (∞), depends mostly on a single composite parameter ( Fig. 2A-C) Note the critical point, log(Nf 0 ) = 0.5 √ L . If the population size is too large or the number of loci is too small, no significant loss of polymorphism is predicted. Intuitively, the clonal interference effects are expected to increase with the number of intefering loci, i.e., the length of genome L , just as the linkage effects on the adaptation rate increase with L 4 . The reason for a seemingly sharp transition remains to be investigated by analytic methods.
The fastest adaptation rate among loci is much faster than in a single-locus model. Because most loci fail to complete adaptation, the average frequency of beneficial alleles per locus, f av (t) , saturates far below 1 (Fig. 1, blue line). The dependence of average heterozygocity on time, 2w 2 (t), is decreased accordingly (Fig. 1, green). The allele frequency averaged over remaining polymorphic sites, f pol (t), increases in the same general time range as the one-locus prediction. The time of half-fixation of polymorphous sites, t 50 , is very close to the deterministic one-locus prediction, t 50 ≈ t 1loc (Fig. 1) In the range of parameters s = 0.025 − 0.2, L = 200 − 2000, N = 1000 − 10, 000 , the relative difference between t 50 and t 1loc is between -0.11 and 0.14. Compared to the one-locus model prediction (blue dashed curve in Fig. 1), the dependence f pol (t), experiences a delay in the late phases of adaptation and has a noticeable random oscillation component (Fig. 1, blue ).
At some loci, alleles accumulate much faster than predicted by the one-locus model ( Fig. 1, black line). Indeed, the half-time of adaptation of the fastest locus, max(t loc ) , is much shorter than t 1loc and increases as power ¾ of composite parameter log(Ns) √ L (Fig. 2E). The ultrarapid evolution at some loci implies that the genome segments comprising these loci tend to have unusually high numbers of beneficial alleles. Indeed, the fitness of a genome is set to be proportional to that number. In other words, the broad variation in the evolution rate between loci with an identical selection coefficient demonstrates the existence of a strong background selection effect created by random recombination events. Recombination brings together different numbers of favorable alleles in different segments, and natural selection favors the fittest. The resulting distribution of genomes in fitness forms a traveling wave, well-known for both asexual and sexual populations 24 (Fig. 3A). The new genomes at the highfitness edge of the wave are born by recombination. They grow much faster in number than the average genome at the wave maximum, causing the wave to move forward 26 .
The population has a complex lineage structure that varies between loci. For a given locus, a lineage is determined as the set of individuals that have the same initial ancestor. The lineages all initially consists from a single individual, the founder (Fig. 3B), but their sizes grow in time, at different rates for different loci, and become distributed in a very broad range ( Fig. 3C-G). The lineage size distribution between loci shifts in time towards larger lineages eventually occupying almost the entire population. If we take into account only the largest lineage for each locus, their size distribution looks similar but has a low cutoff increasing in time ( Fig. 3B-G, column 2). The largest lineages grow to a half of the population at a much earlier time than t 1loc in Eq. (2).
Phylogenetic time scale depends only on the same composite parameter. Another quantity affected by linkage effects is the identity by descent, C , defined as the probability of a homologous locus pair to have the same initial ancestor. The average identity by descent averaged over all loci and over only polymorphous loci is almost the same (magenta line and magenta , Fig. 1). This result, again, differs from the single-locus prediction, where common ancestry is rare, C(t) < f 2 pol (t) , because each of the pair of loci must fall into the same growing lineage to have the same ancestor, and the size of each lineage relative to the population size is smaller than f pol (t) . In contrast, in the present simulation, C(t) is larger than f pol (t) , which is larger than f 2 pol (t). At the time point t = T 2 such that C(T 2 ) = 1 − C loss (T 2 ) , both quantities are close to a half, in a broad parameter range, as given by Observables depend weakly on the average number of recombination crossovers. The above results in Figs. 1, 2, and 3 are weakly sensitive to the average crossover number, M . In its entire range of between 1 and L , the fraction of loci that do not lose alleles, 1 − C loss (∞), varies only by the factor of ~ 2 (Fig. 4). The variation of selection coefficient s is rescaling units of time; otherwise, its effect is modest (Fig. 4).
The absence of long-range linkage disequilibrium. A measure of linkage disequilibrium (LD) is Pearson correlator between allelic frequencies at two loci which is averaged over pairs of sufficiently heterozygous loci, 2f loc 1 − f loc > 0.1. The value of r 2 defined by Eq. (4) depends on time, as follows. Initially, r 2 ≡ 0 due to the random initial distribution of alleles set in simulation. After some time has elapsed, r 2 becomes positive at any distance between loci (Fig. 5, t = 30) . After passing through a maximum, r 2 decreases in time again. At each next time point, r 2 depends more and more sharply   In the parameter interval of interest, this value falls far below the 1-locus prediction, P 1loc fix = s (Fig. S1). Probability P fix , Eq. Phylogenetic tree varies between loci. In addition to calculating the phylogeny time scale T 2 , we constructed the ancestral trajectory of a locus between individuals in real-time by recording the parentage of each individual locus and then tracing its ancestry back in time. Lineage of each locus jumps randomly between individuals due to recombination (Fig. S2A). If we straighten these trajectories and keep only the topology of coalescence and the coalescent times, we arrive at phylogenetic trees for different loci (Fig. S2B-D). As expected, the tree varies strongly between loci due to recombination, and the early branches are much shorter relative to late branches compared to Kingman's coalescent constructed in the absence of selection 28 . The average density of coalescent events averaged over 10 runs and normalized to the prediction of the selectively-neutral model (Methods) decreases exponentially with time (Fig. S2E, F). This is because the coalescent event density is proportional to the inverse effective population size 28 , which is the size of the growing variant subpopulation. Importantly, the coalescent density is much larger than in the one-locus limit and increases with number of loci L . Thus, in agreement with the previous studies, uncompensated linkage in the presence of selection makes phylogenetic trees denser and changes their shape by making early stems shorter 5,6,20,29 (Fig. S2E,F).

Alleles "surf" between lineages.
In addition to the ancestor number trajectory of a locus (Fig. S2A), one can also construct its fitness trajectory, by recording the fitness values of its ancestors (Fig. S2G). The fitness trajectory comprises alternating straight horizontal segments due to the clonal expansion connected to jumps caused by recombination. The jumps occur in both directions, but more often towards a genetic background with a higher fitness (Fig. S2G). This "allelic surfing" behavior was predicted for sexual populations analytically 27,30 .

Discussion
In the presence of standing variation, on moderate time scales, rapid evolution can occur in the absence of new mutations, and even much faster than due to mutation 26 . Such evolution based on standing variation and natural selection, with or without recombination, has been observed for poliovirus 31 and VSV 32 . The present study considers a process of adaptation after minute quantities of beneficial alleles are generated in the beginning, for www.nature.com/scientificreports/ example, due to the change of external environment, or due to early mutation. If an allele is not lost to random genetic drift, its further accumulation is dominated by natural selection and recombination working together. The process continues, until all alleles are either fixed or lost. Despite of the lack of observable LD for far-situated loci, simulation predicts the existence of strong longrange linkage effects encompassing the entire genome. The effects include the extinction of beneficial alleles at most loci, due not to random drift but to clonal interference, weak sensitivity of results to the number of crossovers, and ultrarapid evolution at some loci, even faster than in the independent-locus limit. The last observation implies the existence of genomic segments enriched in beneficial alleles over the average. Taken altogether, these results imply that far-situated genomic regions do not evolve independently, and recombination is not strong enough to break down linkage effects caused by selection acting, thus, at the level of a whole genome, as well as at the level of genomic segments.
If the locus number is decreased, or if the population size is increased, a transition to the independent-locus limit is predicted. The predicted dependence of all linkage effects on the population size N is logarithmic (Fig. 2). For a genome of 200 loci and f 0 = 0.02, s = 0.1 , the transition to the independent-locus regime can be observed already for 100,000 individuals. For a longer genome of 1000 loci, they would evolve independently only for populations of 10 12 individuals or larger, which is unrealistic for most species. A human or an animal population has millions of variable loci, of which a significant fraction is under natural selection, so that independent-locus models, probably, never work in most animals, except for rare mutations that are under very strong selection pressure.
The results of the present study carried out for the moderate-term evolution are in striking contrast to the previous findings for the long-term evolution driven by mutation, selection, and recombination, where genome was demonstrated to consist from quasi-independent blocks 21-23 . In my notation, the cited result for the average time to the most recent common ancestor has the form [ 21 ,Eq. 5] where v is the average rate of long-term adaptation, defined as the fitness gain per unit time, const is a number on the order of 1, and the logarithm is supposed to be much larger than 1. In the present model, the proxy of T MRCA , by the order of magnitude, is T 2 in Eq. (3), and the adaptation rate is As already mentioned, the average time to a half-fixation for the loci that do not lose alleles, t 50 , is always close to one-locus limit t 1loc . Substituting Eqs. (3) and (8) into Eq. (7), we get The last equality is clearly false, because M, N, f 0 and s are independent parameters. Hence, Eq. (7) does not apply in the present case with pre-existing variation.
Note that the analytic argument in 21 was developed and tested for the stationary long-term adapation. After the medium-term evolution ends with the loss or fixation of all initially-present alleles, further evolution requries a constant supply of new mutations. The derivation in 21 is based on two statements: the assumption that a genome evolves as quasi-independent asexual blocks, and an expression for the time to the most recent common ancestor in terms of the average adaptation rate. The expression was based on the basic concept that the time to most recent common ancestor is the lead of the wave divided by the adaptation rate and was confirmed for various multi-locus models, both sexual and asexual. Therefore, it is likely that the quasi-independence assumption is the cause of the discrepancy.
In other words, in the case of pre-existing variation and moderate-term evolution studied here, the genome does not evolve as a set of quasi-independent segments. That conclusion is indirectly confirmed by the results in Fig. 3 showing that beneficial alleles can form highly-fit genomes whose rapid growth outruns mixing of genomes due to recombination (Fig. 3). A recombinant that decreases fitness is not relevant for future generations. Furthermore, within one realization (Monte Carlo run), the fitness variance of a genomic segment normalized to the genome fitness is not linearly proportional to its length, but shows a complex step-like dependence (Fig. 6).
The results obtained in the present study might be potentially relevant for the viruses with frequent recombination, such as HIV, polio, or SARS-CoV-2. Similar to seasonal human coronaviruses or influenza virus, SARS-CoV-2 is constantly acquiring new mutations in its genome and has hundreds (if not thousands) of observably-diverse sites. Evolution is especially fast in receptor Spike protein, 5 replacements per year [33][34][35][36] . Two major reasons account for the high speed of evolution, as follows. Firstly, Spike has receptor-binding motives that affect transmission, and their evolution leads to the emergence of variants with enhanced transmissibility. Secondly, Spike contains epitopes, regions that are important for the immune response because of their involvement in binding of antibodies that can neutralize virus. Mutations in epitopes are a major factor that limits the virus recognition by the immune system and, hence, the durability of protection [37][38][39] .
A puzzle important for devising future vaccination strategies is the origin of the variants of concern (VOC) produced by large groups of new mutations that emerge all together at once [40][41][42][43][44] . Alternative theories of the emergence of VOCs 45 include reverse zoonosis, the evolution within immunocompromised patients 46,47 and the evolution in population pockets not covered by the genetic surveillance. Still another possibility is the fitness  12,48 and studied theoretically 49 . Based on the present study, I may add yet another possible explanation. SARS-CoV-2, with its single-chromosome genome, has observable crossover recombination [50][51][52] . Hence, the large packages of mutations may emerge due to the combined effects of recombination and natural selection and represent the sequences comprising the fastest loci ( Fig. 3H and J).
VOC are characterized by the sudden emergence of sister strains with large groups of mutations, occuring not on the background of the existing strain, but in parallel. This real-life observation can be compared to the tree predicted by the present simulation obtained by neighbor-joining analysis from a sample of 10 simulated sequences, at 3 time points each (Fig. 7). Note that genome samples obtained at different time points form separate subtrees (clades) growing on the same stem, the first two subtrees without apparent relation to each other. Each tree has an excess in the number of beneficial alleles compared to the previous tree. These features bear resemblance to VOC.
To understand the importance of recombination for SARS-CoV-2, we need to know the frequency of coinfected individuals among all the infected, which determines outcrossing probability r , an important input parameter entering the models of sexual populations 1,21,30 . For fully sexual reproduction considered in the present work, by the definition, r = 1. The outcrossing number for SARS-CoV-2 is presently unknown, but it could be larger than it seems due to the possibility of a co-infection during superspreading events 53,54 . Methods developed previously to quantify recombination from RNA sequence data for HIV could be re-applied to SARS-CoV-2 1,55 . The potential relevance of the present model for VOC of SARS-CoV-2, as well as a detailed analysis of SARS-CoV-2 in the light of this hypothesis, will be investigated elsewhere.

Assumptions and limitations.
(i) The model considered only variable sites under purifying selection. While a real population has many conserved sites and some selectively neutral sites as well, the existence of these sites has no effect on the evolutionary dynamics of the sites under purifying selection. Hence, these sites are not included into consideration explicitly. (ii) We have investigated the case of a constant selection coefficient, but the results are expected to apply also for a sufficiently fast decaying distribution of selection coefficients, such as a Gaussian distribution. Distributions with long tails may have different properties, where the traveling wave is replaced by pairwise clonal interference 7 . The border case is the exponential distribution often observed in experiments on pathogens, which fact has been explained 56 . In this case, the present scenario with a fixed effective value of selection coefficient applies at sufficiently large population sizes 7 . (iii) The model assumed homologous recombination for the two reasons, as follows. Firstly, recombination in viruses and organisms is similar in the sense that the vast majority of recombination events occur between homologous templates and do not include insertions or deletions. Secondly, when non-homologous recombination does occur, which effect is more frequent in viruses, the resulting progeny is often defective and, hence, not important for the evolutionary dynamics of a population.

Conclusion
In sexual populations with pre-existing beneficial alleles, in an exponentially broad range of population size, recombination cannot suppress long-range linkage effects, including the excessive loss of beneficial alleles due to clonal interference, independence of observables on the average number of crossovers, and superfast evolution at some loci due to a genetic background effect. A potential link of these findings to the emergence of VOC of SARS-CoV-2 will be investigated elsewhere.

Materials and methods
Consider a fully sexual population with L loci comprised of N individual genomes. Each locus has initially Nf 0 alleles, 1/Ns ≪ f 0 ≪ 1 , with fitness benefit s ≪ 1 . In each generation step, each genome undergoes random crossovers with another, randomly chosen genome, with average crossover number M, producing a recombinant genome. One of the two parents is replaced with the recombinant. Genome number j with k j favorable alleles is replaced with a random number of its copies distributed according to the polynomial distribution implemented by "broken stick" method, as follows. N random points are generated uniformly within the interval [0, N] broken into N segments. The length of segment j is proportional to the fitness of the corresponding genome w j The number of random values that fall into segment j are taken to be the number of his progeny in the next generation. Thus, the total number of genomes stays constant. New mutations are neglected, which is www.nature.com/scientificreports/ shown to be correct in the short-term in the presence of pre-existing genetic variation, both in simulation and experimentally 31,32 . Epistasis is absent; for epistatic analysis, see 12 and references therein. Input model parameters are the selection coefficient across loci, s = s 0 , population size N, outcrossing rate r = 1, number of loci L, initial beneficial allele frequency f 0 , total simulation time t , average number of recombination crossovers M, and the seed number of the generator of pseudorandom numbers.