## Introduction

The widespread deployment of biomedical interventions against SARS-CoV-2 has highlighted viral evolution as a significant potential risk in bringing the ongoing pandemic to an end. SARS-CoV-2 has a high mutational rate, similar to other RNA viruses, and a high mutational tolerance in key proteins, such as Spike, the molecular target for many biomedical interventions against the disease1,2,3.

The emergence and expansion of novel SARS-CoV-2 variants over the past few months threatens to undermine the promise of a return to normalcy as a result of the newly deployed vaccines. In fact, many of these new variants are more capable of infecting cells, spreading between hosts, and/or evading natural immunity or therapeutics4,5,6. On a population level, natural selection has acted to rapidly and deterministically increase the frequency of these more-fit SARS-CoV-2 variants, leading them to dominate the local viral population after emergence (Table S1). The emergence of fitter variants has also led in a number of cases to more severe disease outbreaks, and poses the threat of eventual reductions in vaccine efficacy4,5,7,8.

In addition to acting at the population level, natural selection has been shown to select for advantageous viral variants that are generated within individual patients infected with SARS-CoV-2. Longitudinal sequencing of SARS-CoV-2 from individual patients has revealed selection for multiple antibody-evading mutations, particularly in patients with long-term infections treated with convalescent plasma9. Additionally, studies have shown that individuals with impaired immune function can shed high levels of virus for weeks, creating an environment where SARS-CoV-2 is exposed to prolonged selection pressures favoring variants that can escape the immune response and/or are resistant to treatment10,11,12,13,14.

Understanding the factors driving the evolutionary process for SARS-CoV-2 could potentially allow us to design and use biomedical interventions in a way that hinders viral evolution, giving us the upper hand in managing the pandemic. In particular, there are several steps in the process of generating new advantageous variants that are stochastic, occurring largely by random chance.

First, sufficient genetic diversity must be created within infected individuals through stochastic events. SARS-CoV-2 mutations are initially generated by random errors in viral replication within individuals with COVID-19. Deep sequencing studies have revealed that the SARS-CoV-2 viral population exists within the host as a quasispecies15,16,17,18, a population structure with a large number of related sequences arising from de novo mutations that occur during the course of infection. Quasispecies genetic diversity has been shown to vary over time16,19, and deep sequencing studies have demonstrated a role for genetic drift20 and intrahost transmission bottlenecks15 as the virus moves from one region of the body to another. While genetic diversity provides opportunities for advantageous mutations to arise and expand due to natural selection within individuals infected with SARS-CoV-2, genetic drift may provide a barrier to the deterministic expansion of advantageous viral mutations at lower population sizes.

Next, viral variants generated within a COVID-19 patient are transmitted to new hosts. During this process, further stochasticity is introduced by the low numbers of viral particles required to start an infection in a new host, which creates a narrow transmission bottleneck21,22. Consistent with this bottleneck, sequencing studies have provided conflicting results with respect to the ability of intrahost variants to transmit and establish new SARS-CoV-2 lineages that have the potential to transmit widely. While a number of studies have shown this to be the case21,22,23,24, other studies have not been able to demonstrate transmission of viral lineages derived from intrahost evolution11,25,26. The stochasticity in inter-host transmission of specific viral lineages is compounded by the overdispersed nature of SARS-CoV-2 spread, where most onward transmission originates from a small number of individuals. Recent work also suggests an important role for stochastic extinction of variants, as at least five new infections are required in order for a newly emergent variant to establish itself in the population27.

These stochastic factors in the evolution of SARS-CoV-2 represent a potential weakness that can be exploited in the design of intervention strategies to slow viral evolution. To the extent that the stochastic contribution of drift can be increased, the deterministic contribution of natural selection to the improvement of viral fitness can be weakened. In this study, we have used evolutionary dynamics to better understand the process by which fitter SARS-CoV-2 variants arise during infections and to identify practical means by which viral evolution can be slowed.

## Results

### Single mutants that are fitter within patients are more likely to be transmitted to new hosts

To investigate mutation and selection dynamics of SARS-CoV-2 within hosts, we simulated stochastic viral evolution using a modified Wright-Fisher model (Fig. 1A, Methods). During each 12-h replication cycle, virions from the previous generation are randomly selected with probability proportional to their fitnesses to replicate and produce a burst of Nb new viral particles in the next generation. Each replicating virion has a constant probability of generating a new single point mutation that is passed to all of its Nb progeny. The total number of virions present in each generation is given by estimates from previously-published sputum RT-PCR measurements28 (Fig. 1B, Methods). Unless stated, the parameter values for the simulations are those given in Table 1.

Over the course of a typical-length COVID-19 infection (23 days), our simulations show that viral variants with point mutations that increase the replication probability by 20–50% (selection coefficients of 0.2–0.5) expand significantly more than variants with neutral or weakly deleterious fitness effects (Fig. 1C). This expansion of fitter variants increases the probability that at least one viral particle with a specific beneficial mutation will be transmitted to a new host (Fig. 1D, Methods). Individuals are more likely to pass on beneficial variants if transmission occurs later in the infection, since the frequency of these variants increases over time due to selection. Assuming these variants are also able to spread through the population with a moderate transmission advantage, new lineages with advantageous single mutations are rapidly created at the population level (Fig. 1E). These results suggest that selection for beneficial single point mutations within COVID-19 patients increases the rate at which fitter SARS-CoV-2 lineages establish at a population level.

### Longer infection duration and higher viral load increase probability of transmitting fitter variants

Viral load dynamics vary significantly between COVID-19 patients28. To assess the evolutionary consequences of this variation, we simulated viral replication and transmission of variants for patients with different viral load kinetics (Fig. 2). Patients with longer periods of peak viral load were able to transmit fitter SARS-CoV-2 variants more efficiently (Fig. 2A). On the other hand, decreasing viral load decreased the probability that fitter variants generated within a patient would be transmitted (Fig. 2B). Both of these observations indicate that the strength of selection increases as the number of replicating viruses over the course of an infection increases.

### Beneficial two-mutation combinations are readily generated within patients with long-term SARS-CoV-2 infections

Many of the reported new SARS-CoV-2 variants are defined by more than one point mutation. One of these variants arose during long-term SARS-CoV-2 infection within an immunocompromised individual and included a mutation that conferred resistance to neutralizing antibodies but reduced infectivity, which was offset by a mutation that increased infectivity9. These observations suggest that highly-fit mutation combinations that require transit through a deleterious intermediate state (i.e., crossing a fitness valley) may be generated within COVID-19 patients with longer infection durations.

To investigate the rate at which these variants are generated within hosts, we modeled a multistep mutation process where beneficial mutation combinations are created from deleterious intermediates that only have some of the mutations found in the beneficial combination (Fig. 3A). Beneficial two-mutation combinations exist at very low frequencies over the timescale of a typical-length SARS-CoV-2 infection but increase in frequency within hosts that have prolonged infections (Fig. 3B). This increase in variant frequency that occurs within hosts due to selection corresponds to an increase in the probability the beneficial two-mutation variant will be transmitted (Fig. 3C). Therefore, patients with longer SARS-CoV-2 infections are more likely to transmit a variant with multiple mutations. On a population level, these individuals who produce and shed virus for long periods (> 30 days after symptom onset) increase the production rate of new variants with multiple mutations. Increasing the proportion of COVID-19 patients with prolonged SARS-CoV-2 shedding increases the rate at which new, fitter variants with two mutations are produced (Fig. 3D).

### SARS-CoV-2 evolution can be impeded by targeting stochastic events required for the emergence of new variants

Several steps are required for a new SARS-CoV-2 variant to be generated and established in the population (Fig. 4A), and each of these steps represents a potential choke point for viral evolution that can be exploited in the design of interventions. First, a new variant must be generated through mutation and expand within a host. The efficiency of generating and selecting advantageous variants within COVID-19 patients can be reduced by biomedical interventions that reduce viral load within patients or reduce the frequency of long-term infections which lead to viral transmission (Fig. 4B). Variants then must be transmitted to additional hosts to establish within the population. Reducing the number of viral particles transmitted to new hosts (for example, through mask wearing or vaccination) will also slow the rate of emergence of advantageous variants (Fig. 4B). Finally, interventions specifically aimed at reducing the transmissibility of the new variant would also reduce their probability of spreading widely (Fig. 4B). This could be achieved by deliberately deploying a patchwork of vaccines and other prophylactics that target distinct epitopes, thereby increasing the diversity of biomedical interventions to provide a more challenging evasion landscape for the virus. Taken together, our simulation results suggest that there are several low-probability stochastic events that are important for SARS-CoV-2 variant emergence and that interventions targeting these events can slow SARS-CoV-2 evolution.

## Discussion

The recent emergence of variants of SARS-CoV-2 with multiple mutations poses a direct threat to the viability of vaccine-based suppression strategies for the pandemic29. A number of phenotypic changes that are capable of providing a fitness advantage have been associated with newly emerged variants. These include increased transmissibility, lethality, and viral replication rate, all characteristics of the B.1.1.7 lineage5,30,31. Additionally, some variants have immune evading phenotypes, including increased reinfection potential (e.g., B.1.351)32, and reduced neutralization for monoclonal antibodies (e.g., B.1.1.7, P.2 and B.1.351)33,34, convalescent plasma (e.g., B.1.351, B.1.429)32,35, and vaccine-induced sera (e.g., B.1.351, B.1.429, B.1.1.7, B1.298, P.2, and P.1)36,37 or complete resistance to a vaccine (B.1.351)37 (see Supplementary Note and Supplementary Table S1 for further details). Some viral variants have also been associated with a longer duration of infection38, which is particularly concerning in light of our finding that longer infections increase the probability of transmitting fitter SARS-CoV-2 variants.

In this work, we investigated the impact of within-host SARS-CoV-2 mutation and selection on the emergence of new viral variants in the population. Using a stochastic computational model of viral replication and selection, we found that mutations that increased the rate of viral replication increased in frequency within hosts during the course of a typical SARS-CoV-2 infection. This expansion led to more frequent transmission of the new variant and faster emergence of the variant on a population level. The effect of selection within hosts is more pronounced for longer SARS-CoV-2 infections with higher viral load, which can lead to the generation of variants with multiple mutations within a single host during prolonged infection. While these general findings are in line with the intuition provided by evolutionary theory and viral quasispecies modeling39, our findings suggest that the expected frequency of these variant-generation events (using real-world parameter estimates) is likely to be high enough to represent a tangible public health threat. The risk to public health from long-term infections is not currently appreciated, as public-health authorities continue to use symptom-based strategies for testing and discontinuing transmission-based precautions40 even if the patient tests positive for SARS-CoV-2 RNA in a nasal swab41. Our work also points to a strategy for controlling viral evolution—the stochastic events required for variant generation represent a vulnerability from the virus’ perspective which can be exploited in the design of potential biomedical interventions. Thus, our quantitative modeling points to real-world practical risks that need to be mitigated as the pandemic drags on, and points out a way in which this can be done.

Our modeling approach has several limitations and assumptions that could be further investigated. When modeling intrahost viral evolution, we assumed that patients were initially infected with only wild-type virus. We did not account for genetic variation in SARS-CoV-2 already present in the population due to neutral genetic drift, which is another important source of advantageous viral mutations3. Our model did not account for variation in some parts of the viral replication cycle, including replication cycle length and burst size, which may represent additional stochastic events that could influence viral evolution. We also only modeled point mutations, which does not account for other possible genetic alterations that affect viral fitness (indels or rearrangements). Our population-level transmission model also does not consider the effects of spatial or demographic structure on SARS-CoV-2 spread. Finally, we only investigated the case where mutations that increase viral fitness within hosts also increases transmissibility on a population level. This hypothesis is consistent with the observation that increased viral load is associated with increased transmissibility42.

This work, along with other recent findings on the stochastic nature of SARS-CoV-2 transmission27, suggests a number of real-world strategies that can be used to suppress the emergence of these fitter viral variants, thereby slowing the evolution of SARS-CoV-2. First, long-term SARS-CoV-2 infections (lasting longer than 30 days) should be treated as a serious public health concern, regardless of the presence of symptoms. Our work suggests that the low frequency of such long-term cases belies the threat that they pose to public health. In particular, patients with asymptomatic long-term infections still transmit10,11,12,13 and can potentially be efficient accelerators of viral evolution. Thus, patients testing positive for SARS-CoV-2 by RT-PCR after 30 days should be tested for viable virus, and if found to remain infectious, should isolate. Isolation measures for such long-term infections should be designed to minimize transmission in the interest of minimizing their disproportionate impact on the pace of viral evolution. This would represent a departure from current practice (the CDC, for example, does not require negative testing prior to ending isolation for mild and asymptomatic cases of COVID-1943). Second, treatments for long-term viral infections should be aimed at the suppression of viral load and should be initiated regardless of whether patients are symptomatic. Such treatment regimens should consist of multiple active agents that can suppress viral replication and that preferably target different viral proteins other than Spike to reduce the risk of generating viral mutants that are resistant to the treatment. This shift in medical practice may reduce the risk of intrahost evolution leading to emergence of immune-evading SARS-CoV-2 variants, as likely occurred in some documented cases with convalescent plasma treatment9. Finally, contact tracing of transmission events and genetic characterization of secondary infections resulting from long-term infections (including those from immunosuppressed patients) will be crucial.

We note that detecting long-term infections with SARS-CoV-2 represents a major logistical challenge in its own right—a widely deployable method for distinguishing long-term infections from spurious PCR results is an unmet need at present. While PCR positivity tracks closely with infectious viral particles at the beginning of infection, the presence of neutralized virus at later timepoints make PCR positivity difficult to interpret as a marker of infectiousness44. Although viral culture can be used to confirm the presence of replication-competent virus45, logistical constraints (such as turnaround time and the requirement for Biosafety Level 3 containment) limit its clinical use46. Subgenomic RNA (sgRNA) has been proposed as a biomarker for actively replicating virus47, but this remains controversial at present48. Therefore, there is an urgent need to develop a validated biomarker for viral replication in long-term infections. Such a biomarker will also enable us to better measure the frequency of transmissible long-term infections and differentiate them from long-term COVID-19 sequelae in patients without infectious virus.

Our work also suggests reasons for concern if intrahost evolution in long-term SARS-CoV-2 infections is not addressed directly. As intrahost evolution leads to faster generation of beneficial variants at a population level, new waves of transmission driven by fitter viral variants can be expected to arise as a result of untreated long-term infections. Our results also suggest another evolutionary incentive favoring increased infection duration, since long-term SARS-CoV-2 infections increase the rate at which the virus evolves. To the extent that this evolution is shaped by immune evasion (and immune evasion is expected to result in more severe infections), this study identifies an additional evolutionary route that is open to SARS-CoV-2, by which the virus can evolve to become progressively more lethal. Notably, while this potential risk conflicts with recent work that has conjectured that SARS-CoV-2 evolution may lead to progressive decreases in viral virulence49, it is consistent with available real-world evidence for this virus so far30. The evolution of increased virulence over time has been observed for several other viral pandemics, including for HIV over the years 1984–201050, the second wave of the 1918–1919 Influenza pandemic51, and myxomatosis in rabbits in the 1970s and 80 s52,53,54.

The deployment of multiple vaccines against SARS-CoV-2 in under a year is a remarkable triumph of modern medicine. These vaccines hold the promise of bringing the current pandemic to an end but are vulnerable to the rapid emergence and expansion of immune-evading viral variants. Understanding the mechanism of SARS-CoV-2 evolution allows us to design strategies that can tip the balance in this evolutionary arms race and ultimately allow us to control the spread of SARS-CoV-2.

## Methods

### Intrahost viral dynamics simulations

We used a modified Wright-Fisher evolutionary model to investigate viral mutation and selection dynamics within individuals with COVID-19. This model assumes that SARS-CoV-2 virions replicate in discrete generations of length tg and that each replicating virion produces a fixed burst size of new virions Nb. The total number of virions present at each generation was estimated from previously-published nasopharyngeal swab and sputum qRT-PCR measurements after diagnosis28. This estimated viral load curve over time (shown in Fig. 1B) was used for the simulations shown in Fig. 1. In Figs. 2, 3 and 4, we used the original viral load curve as a baseline and varied the length of the peak viral load period and/or scaled the total number of virions in each generation (Fig. 2).

During each replication cycle, Nt/Nb virions from the previous generation are chosen to reproduce, where Nt is the total number of virions in the current generation and Nb is the burst size. Each virion from the previous generation has a probability of reproducing and contributing Nb virions to the next generation that is proportional to its fitness. During each replication, there is a constant probability of generating a particular mutation that will be present in all Nb virions that are produced by that parent virion. Within each simulation condition, mutations were assumed to provide the same constant fitness benefit or penalty compared to the wild-type virion fitness, provided as the intrahost selection coefficient value in the figures and figure captions. Parameter values common to all simulations are given in Table 1, and additional parameter values specific to individual conditions are given in the figure captions. Each condition was simulated 1000 times.

### Probability of variant transmission from a single infected host

The probability that a transmission event that includes at least one mutant virion occurs within a particular time window during infection was estimated by assuming new infections are initiated by Ntrans virions sampled uniformly at random from the transmission period which go on to establish a new infection. Therefore, the probability that at least one virion with a specific mutation is passed on is $$1 - (1 - f)^{{N_{trans} }}$$, where f is the fraction of virions with the mutation that existed within the transmission window.

### Population-level de novo generation rate of SARS-CoV-2 variants

A newly-generated variant present in a single patient will initially spread stochastically to new hosts, the success of which will determine whether the variant will survive in the population. We modeled the stochastic spread of a new variant within the population as a branching process in which infected individuals infect on average R0 new hosts. More specifically, each infected individual transmits to X new hosts, where X is a random variable distributed as Poisson(R0). Probability theory results for Poisson branching processes states that if a new variant has R0 > 1, it will survive in the population with probability π, where π is the smallest solution to $$\pi = 1 - e^{{ - R_{0} \pi }}$$ in the interval [0, 1]55. This model assumes the population is well-mixed, without any population or geographic structure affecting COVID-19 transmission.

We combined this estimated survival probability of an existing variant in the population with the results from the intrahost simulations described above to calculate the overall rate at which new surviving variants arise. The rate at which new surviving variants are generated from patients with infections of length i is πσipi, where pi is the probability of a patient with infection length i generating and transmitting the variant and σi is the number of patients with infection length i that are infected each day. The total rate at which new surviving variants are generated in the population is just the sum over all infection lengths, $$\sum\nolimits_{i} \pi \sigma_{i} p_{i}$$. To estimate σi, we assumed that most infections were of the typical length and viral load kinetics shown in Fig. 1B. However, a small fraction pLT of infections were assumed to be longer than 30 days. The distribution of infection lengths in these long-term patients was assumed to be normal, and mean and standard deviation were estimated from longitudinal oral swab RT-PCR data56. The rate zLT at which new surviving variants are generated from long-term patients was estimated by simulating infections that are 1, 2, 3, 4, 5, 10, or 15 weeks longer than typical infections to estimate pi for each infection length and multiplying by the fraction of long-term infections within the length interval ending at i. That is, $$z_{LT} = \sum\nolimits_{i} \pi \sigma_{i} p_{i}$$ for i in {30, 37, 44, 51, 58, 95, 128} days, where σi = P(X < i) − σi-1 and X is normally distributed according to the distribution of infection lengths. The total rate at which new surviving variants are generated in the entire population is therefore zLT + zST, where zST is the rate at which standard length infections produce the variant.

Interventions that reduce the rate at which new variants are generated were modeled by reducing the fraction of long-term infections (reduced from 0.1% to 0.01%), number of virions transmitted during each infection event (reduced 10 to 1), overall viral load for each infection (reduced by 90% from the baseline curve given in Fig. 1B), or the transmission rate of the variant in the simulations (reduced from 1.5 to 1.05).