Main

The emergence of highly divergent variants of SARS-CoV-2 has been a defining feature of the COVID-19 pandemic. Although the evolutionary origins of these variants are still a matter of speculation, multiple pieces of evidence point to chronic persistent infections as their most likely source5,7,15. In particular, infections in immunocompromised patients who cannot clear the virus may lead to persistence for months6,7,16,17 or even years8,18 before potentially seeding new outbreaks in the community3. Persistence of SARS-CoV-2 during chronic infections exposes the viral population to host immune responses and other selective pressures as a result of treatments over prolonged periods of time. Persistent infections also release the virus from undergoing the tight population bottlenecks that are characteristic of SARS-CoV-2 transmission19,20, making the viral population less vulnerable to stochastic genetic drift and allowing it to acquire more evolutionary changes over a longer timescale. These adaptive intra-host changes can lead to elevated evolutionary rates, particularly in key regions of the spike protein (encoded by S) that are often associated with immune escape and increased rates of transmission13,14.

Despite the substantial public health implications of persistent infections, uncertainty still surrounds how common these infections are among the general population, how long they last, their potential for adaptive evolution and their contribution to long COVID.

In this work, we used genetic, symptom and epidemiological data from the Office for National Statistics COVID Infection Survey (ONS-CIS)21, a large-scale community-based surveillance study carried out in the UK. We identified individuals with high-titre SARS-CoV-2 samples spanning 1 month or more and representing the same viral population. We have provided several lines of evidence suggesting that these individuals are persistently infected with replicating virus, and hence refer to these as persistent infections; however, the presence of non-replicating SARS-CoV-2 RNA cannot be categorically ruled out in all cases. We characterized various aspects of viral dynamics during these persistent infections, including evolutionary changes in the virus, RNA viral titre kinetics (hereafter referred to as viral load), number of reported symptoms and prevalence of long COVID, the last in comparison with individuals without identified persistent SARS-CoV-2 infection.

Identifying persistent infections

We considered 93,927 high-quality sequenced samples from the ONS-CIS collected between 2 November 2020 and 15 August 2022, and representing 90,146 people living in 66,602 households across the UK (see Extended Data Fig. 1). Households representative of the UK population were recruited in the survey using a rolling recruitment strategy. Most participating individuals (approximately 98%) were sampled once a week for the first 4 weeks of their enrolment, and then approximately monthly thereafter, regardless of symptoms or testing history. To identify persistent infections, we first limited the dataset to individuals with two or more PCR with reverse transcription (RT–PCR)-positive samples with cycle threshold (Ct) values ≤ 30 in which sequencing was attempted (a proxy for viral load), taken at least 26 days apart, and where the consensus sequences were of the same major lineages of B.1.1.7 (hereafter referred to as Alpha), B.1.617.2 (hereafter referred to as Delta), or the two Omicron lineages BA.1 or BA.2 (BA.4, BA.5 and XBB were not considered). This included a total of 500 individuals (18 Alpha, 122 Delta, 130 BA.1 and 230 BA.2) with two or more sequences of the same major lineage (including those with at least one undetermined lineage; see Extended Data Table 1). If sequences from the same individual also shared the same rare single-nucleotide polymorphisms (SNPs) at one or more sites relative to the major-lineage population-level consensus, we classified them as having a persistent infection. Because we used sequence data to identify persistent infections, we could only identify persistent infections with at least two high viral load (Ct ≤ 30) samples.

We defined a rare mutation for a given lineage as one observed in 400 or fewer samples of that lineage within the ONS-CIS dataset, giving a false-positive rate of identifying persistent infections of 0–3% depending on the major lineage (see Methods; Extended Data Fig. 2). We note that the rare SNP method provides a conservative estimate for the true number of persistent infections, as some persistent infections may not have rare mutations. To evaluate the robustness of our method for identifying persistent infections, we considered the phylogenetic relationship between the sequences from persistent infections relative to other sequences of the same major lineage that belonged to individuals with only a single sequence within the ONS-CIS dataset. The great majority of sets of sequences identified as belonging to the same persistent infection formed monophyletic groups with strong bootstrap support (Fig. 1a and Extended Data Fig. 3). However, seven sequences did not group with the other sequence (or sequences) from the same persistent infection. All of these had high Ct values (Ct ≈ 30) and low genome coverage, which could explain their lack of clustering on the phylogeny as lower-coverage sequences are more likely to lack information at lineage-defining sites and they may be more prone to errors when calling the consensus19. In particular, two of these sequences were collected at intermediate time points of two persistent infections where the first and last sequences of each persistent infection do cluster on the phylogeny, whereas the sequence at the intermediate time point does not.

Fig. 1: Individuals identified with persistent SARS-CoV-2 and reinfections with the same major lineage within the ONS-CIS.
figure 1

a, Phylogenetic relationship between samples from individuals with persistent SARS-CoV-2 RNA (hereafter referred to as persistent infections) (top), and reinfections with a representative background population of Alpha (B.1.1.7; see Extended Data Fig. 2 for the analysis on the other three major lineages) (bottom). The dashed lines connect every pair of sequences from the same individual. Pairs from individuals with persistent infections cluster closely together, whereas reinfections do not. All sequences from the same individual are given the same colour. b, Number of days between the earliest and latest genomic samples from persistent infections and reinfections. Each point represents a single individual. The solid vertical lines show the 26-day and 56-day cut-offs. The numbers on the side of each bar show the total counts per category for each major lineage. c, Total number of sequences in the ONS-CIS per major lineage over time. d, Timing of persistent infections (black) during the UK epidemic. Some individuals with persistent infections can be identified up to weeks after the lineage has been replaced at the population level. The coloured boxes indicate the interquartile range, which spans from the 25th to the 75th percentile, with the centre being the median calendar date corresponding to each major lineage. The medians for Alpha, Delta, BA.1 and BA.2 are 13 January 2021, 16 October 2021, 20 January 2022 and 30 March 2022, respectively. The extremities (displayed as grey horizontal lines) denote the minimum and maximum values within each category. The coloured numbers on the side of each box show the total number of sequences within the ONS-CIS for each major lineage. The black numbers represent the total number of sequences from persistent infections corresponding to each major lineage.

Source Data

We identified 381 persistent infections with sequences spanning at least 26 days (11 Alpha, 106 Delta, 97 BA.1 and 167 BA.2). The relatively low number of persistent infections that we identified for Alpha is probably because fewer individuals were infected with Alpha than the other major lineages, but also because a smaller proportion of positive samples with Ct ≤ 30 were sequenced before December 2020, which captures the beginning of the Alpha wave, than after this date (see supplementary figure S1 in ref. 22). Of all the persistent infections that we identified, 54 spanned at least 56 days (3 Alpha, 13 Delta, 15 BA.1 and 23 BA.2). This represents nearly 0.07% (54 of 77,561) of all individuals with one or more sequences (with Ct ≤ 30) of the four major lineages that we investigated in this study (Fig. 1b; see also Table 1). Of note, 2 Alpha, 19 Delta and 8 BA.1 persistent infections were sampled weeks after the corresponding major lineage had dropped to a frequency of 1% or less (Fig. 1c,d); the longest infection was with BA.1 and lasted for at least 193 days (see Fig. 1b).

The actual duration of persistent infections is likely to be at least 3–4 days longer than the time between when the first and last sequenced samples were collected, as it typically takes 3–4 days since the start of infection for viral loads to be sufficiently high to be sequenced (Ct ≤ 30)23,24 and, similarly, viral loads will be too low (Ct values too high) to sequence at the tail end of infection. As individuals were typically sampled weekly during the first 4 weeks of enrolment, followed by approximately monthly sampling thereafter, it is unsurprising that most persistent infections had observable durations clustering around 30 or 60 days (Extended Data Fig. 4).

Identifying reinfections

We considered a pair of sequences from the same individual to indicate a reinfection with the same major lineage if they were sampled at least 26 days apart, had at least one consensus nucleotide difference between the sequenced sampling time points and shared no rare SNPs (see Methods). This criterion may overestimate the true number of reinfections with the same major lineage as some persistent infections may not have a rare SNP, and within-host evolution can lead to the loss of a rare SNP and/or the gain of other mutations leading to differences in the consensus sequence between the samples. We cannot rule out samples being attributed to the wrong individuals, which would also overestimate the true number of reinfections, although we took several measures to control for sample mix-ups (see Methods). We identified three individuals for which pairs of sequences from different sampling time points had no identical rare SNPs and at least one consensus difference, but whose viral load trajectories were consistent with a persistent chronic infection. We therefore excluded these individuals from the reinfection group (Extended Data Fig. 5).

Overall, we identified 60 reinfections with the same major lineage (7 Alpha, 11 Delta, 14 BA.1 and 28 BA.2; Table 1). Sequences from individuals identified as reinfected, collected at the point of primary infection and reinfection, did not form monophyletic groups and mostly belonged to distantly related subclades, and hence supports our method for identifying reinfections (Fig. 1a and Extended Data Fig. 3).

Table 1 Number of persistent infections and reinfections per major lineage

Of all the cases classed as either persistent infections or reinfections with the same major lineage, 9–39% were classed as reinfections (Table 1), rising to 12–50% if only samples collected at least 56 days apart were included (Fig. 1b). This suggests that for Delta, BA.1 and BA.2, the number of individuals reinfected with the same major lineage is low compared with the number of individuals with persistent infection. Alpha seems to be an exception with over one-third of cases classed as reinfection for samples that were 26 days or more and half of cases for samples that were 56 days or more apart. This may be because of the lower number of Alpha samples sequenced, but other factors such as the timing of vaccination roll out could also have contributed.

Periods of stasis at the consensus level

Of the 381 persistently infected individuals that we identified, 68% (259 of 381) displayed no nucleotide differences at the consensus level during infection. By contrast, when we determined the number of consensus nucleotide differences between 16,000 random pairs of sequences from the ONS-CIS, and with each pair from the same major lineage, only 6 pairs had no consensus differences (Extended Data Fig. 6). This provides further support that the sequences that we identified from persistent infections belong to the same infection.

The lack of consensus changes between many pairs of samples taken from the same infection, most of which are less than 2 months apart, is consistent with neutral evolution or weak selection, and indicates that there was limited within-host adaptation. In support of this, we identified 17 persistent infections with three or more sequences, of which the first two sequences (typically about 1 month apart) had zero consensus differences, but, crucially, 41% (7 of 17) gained a consensus change later in the infection. This suggests that the virus evolves measurably at the consensus level as time progresses since the onset of infection. However, shifting populations of RNA-producing cells (and sampling differences) could also potentially lead to differences in the consensus between different time points in the absence of ongoing replication. Among the remaining 59% (10 of 17) with no consensus change throughout infection, we often found substantial sub-consensus activity with intra-host single-nucleotide variant frequencies going up to high frequencies (approximately 40%) and returning to below 5% at a later time point, indicating that the virus population is probably replicating during infection despite acquiring no consensus change (Extended Data Fig. 7a,b).

A strong signal for positive selection

Despite long periods with little or weak positive selection, we also found evidence for positive selection. Among the 381 persistently infected individuals, we observed 317 changes in the consensus nucleotide representing 277 unique mutations and 31 deletions representing 18 unique deletions. Many of these mutations have previously been identified as either lineage-defining mutations for variants of concern or variants of interest25 (8 mutations and 2 deletions), recurrent mutations in immunocompromised individuals12,13,14 (15 mutations and 4 deletions) or key mutations with antibody escape properties and target sites for various different monoclonal antibodies11,26 (7 mutations) (see Source Data Fig. 2 and Table 2).

Table 2 Recurrent mutations and deletions identified during persistent SARS-CoV-2 infections

Several of the consensus changes that we observed were at the same genomic positions in multiple individuals. For example, three individuals infected with BA.2 from different households acquired a mutation at codon position 547 in the spike protein (Fig. 2), two of which were the T547K mutation, which is a lineage-defining mutation for BA.1, and one the K547T mutation (Table 2; also see Source Data Fig. 2). Twelve individuals acquired a deletion (open reading frame (ORF) 1ab (ORF1ab): Δ81–87) in the NSP1-coding region. A similar deletion has previously been observed during the chronic infection of an immunocompromised individual with cancer16 and has also been associated with lower type I interferon response in infected cells27.

Fig. 2: Distribution of SNPs and non-synonymous versus synonymous mutations detected in individuals with persistent SARS-CoV-2.
figure 2

a, Number of mutations that resulted in a consensus change identified in one or more individuals with persistent SARS-CoV-2 RNA (hereafter referred to as persistent infections). E, envelope protein; M, membrane protein; N, nucleocapsid protein. b, Number of synonymous (blue) and non-synonymous (orange) mutations per site during persistent infections. The numbers above each column show the total counts of consensus changes in each category of mutations. c, Distribution of consensus differences per site between sequences from all persistent infections. Nearly 65% of all pairs of sequences from the same infection (corresponding to 70% of persistent infections) had zero consensus differences and most others had below 0.0004 differences per site. The inset shows the remaining pairs with a high number of consensus differences.

Source Data

Overall, we observed a strong signal for positive selection in S, with nearly ninefold more non-synonymous than synonymous mutations (Fig. 2b). With a total of seven non-synonymous mutations, ORF8 had the highest per base (0.036 per base) number of non-synonymous mutations, followed by S with 61 non-synonymous mutations (0.016 per base). The high number of non-synonymous mutations in ORF8 may be due to premature stop codons scattered along ORF8, meaning the downstream non-synonymous mutations are released from negative selection28.

Frequently observed mutations

We determined the number of times each of the consensus change mutations that we observed during persistent infections appeared on representative global and English phylogenies, and compared this with the number of times any mutations observed on the phylogenies occurred (see Methods; Extended Data Fig. 8a,b). In general, mutations that emerged during persistent infection appeared more frequently on the global and English phylogenies, and with mutations emerging multiple times during persistent infection appearing more frequently still (Extended Data Fig. 8c).

Mutations leading to consensus change during persistent infections also tended to be more beneficial at the population level, where here fitness is defined by their ability to spread among individuals29, than other mutations found in the global phylogenies of B.1.1.7, B.1.617.2, BA.1 and BA.2 (Extended Data Fig. 8d). Moreover, mutations observed to appear in multiple persistent infections tended to have a stronger positive fitness effect than those observed in only a single persistent infection (Extended Data Fig. 8e). This indicates that mutations that are selected during persistent infections also tend to be better at transmitting between individuals. Of note, however, are two mutations that emerged twice during persistent infections and were mildly deleterious based on the global phylogeny. These were T1638I (also known to be recurrent in immunocompromised individuals13) and T4311I in ORF1ab. This suggests that these mutations may be beneficial at the within-host level, at least in some individuals, but deleterious at the between-host level; however, it is important to recognize that the ability of immune escape mutations to spread among individuals could change through time due to the changing immune landscape of the population.

One BA.1 persistent infection particularly stood out. This infection lasted for at least 133 days during which 33 unique mutations (23 mutations in ORF1ab, 6 in S, 1 in ORF3a, 1 in M (encoding the membrane protein) and 2 in ORF7) were observed (Extended Data Figs. 3 and 7c); 11 of the ORF1ab mutations and all of the mutations in S, ORF3a and ORF7 were non-synonymous. Contamination could be ruled out because intra-host single-nucleotide variants were shared across multiple time points (Extended Data Fig. 7c), and co-infection is unlikely as we could not identify a likely co-infecting variant after examining all of the ONS-CIS sequences. Given the mutational signature from this persistent infection with 17 G-to-A mutations and very few C-to-A mutations, it is possible that these mutations are induced after a molnupiravir treatment30.

Persistence with rebounding viral load

Of the 381 persistent infections, 65 had three or more RT–PCR tests taken over the course of their infection. We classified these infections as ‘persistent rebounding’ if they had a negative RT–PCR test during the infection (n = 20) and the rest as ‘persistent chronic’ (n = 47) (Fig. 3a,b). Given the weekly or approximately monthly sampling of individuals enrolled in the ONS-CIS, infections classed as persistent chronic may have unsampled periods of very low viral load, meaning the persistent-rebounding category is likely to be an underestimate.

Fig. 3: Comparison of RNA viral load dynamics and the number of reported symptoms in individuals with persistent SARS-CoV-2 and reinfections with the same major lineage.
figure 3

ac, RNA viral load trajectories of individuals with persistent SARS-CoV-2 RNA (hereafter referred to as persistent infections) with rebounding (that is, a negative RT–PCR test during the infection) (purple; a) and chronic persistent viral load (purple; b) and reinfections with at least three PCR tests taken over the course of infection or until reinfection (cyan; c). For ac, only individuals with three or more RT–PCR tests during the course of infection were included. d,e, Change in Ct value (d) and total number of symptoms reported between the first and last time points (e) with sequenced samples for all 381 persistent infections and 60 reinfections.

Source Data

Nonetheless, the observation of rebounding viral load dynamics in over 30% of cases is striking given that, in the absence of genetic information, they could have been misidentified as reinfections, depending on the definition used. Of the 27 cases identified as reinfections with three or more RT–PCR tests, all showed rebounding viral load dynamics (Fig. 3c). Also striking is that persistent-chronic infections often showed similar dynamics; of the 47 infections classed as persistent chronic, 35 had a low viral load (high Ct) test between two high viral load (low Ct) tests. Overall, 55 of 67 (82%) of persistent infections in which we had sufficient data showed a resurgence in viral load after an initial drop (Extended Data Fig. 5a). These rebounding viral load dynamics support the presence of replicating viruses during these infections. There are also several studies that find a strong correlation between high viral load samples (similar to those that we observed here) and the presence of viable SARS-CoV-2 in viral cultures24,31,32,33, which further supports that these samples are taken from replication-competent viruses. However, variation in viral load samples may also occur due to reasons unrelated to the presence of replication-competent virus such as variation in measured Ct values with respect to time and quality of sampling33.

As the sampling strategy of ONS-CIS is based on testing representative individuals across the UK regardless of symptoms, we can estimate the percentage of SARS-CoV-2 infections that are persistent and last for longer than 60 days in the general population. This requires making assumptions about how many persistent infections are missed among ONS-CIS participants due to the approximately monthly (and weekly) sampling. More precisely, estimating the proportion of infections that are persistent depends on the proportion of days the infection has sequenceable virus during the infection (would have Ct ≤ 30 if tested); the fewer the number of days the infection has sequenceable virus, the more likely it is that a persistent infection is missed. By taking two extreme scenarios for the proportion of days that the virus is sequenceable during persistent infection (0.7 and 0.14; see Methods), we estimate that approximately 0.7–3.5% and 0.1–0.5% of infections become persistent for more than 30 and 60 days, respectively.

Difference in viral load and symptoms

For the majority of persistent infections, Ct values (inversely proportional to viral load34) were higher at the last sequenced time point than at the first sequenced time point (Fig. 3d), with the Ct value being more than +6.7 (interquartile range (IQR) +3.2–10.2) units higher at the last time point (two-sided paired Student’s t-test P = 2 × 10−9). For reinfections with the same major lineage, the last sequenced sample also had higher Ct values than the first, but the magnitude of the difference was smaller than persistent infections (Fig. 3d), with only +2.5 (IQR −1.1 to +7.4) units difference between primary infection and reinfections (two-sided paired Student’s t-test P = 0.0003). In both cases, the rise in Ct value (decrease in viral load) during infections or between reinfections could be a consequence of host immunity or within-host compartmentalization. In addition, the rise in Ct for reinfections could be due to the disproportionate sampling of individuals with older infections, which tend to have lower viral loads, towards the end of an epidemic wave35,36.

Individuals with persistent infections remained largely asymptomatic during the later stages of infection, reporting on average two fewer symptoms in the preceding 7 days at the last time of sampling (at which a sequence was obtained) than the first time of sampling, with a median of 1 (IQR 0–4) fewer reported symptoms (two-sided paired Wilcoxon P = 5 × 10−30). They also consistently reported very few or no symptoms after the first positive sample (Fig. 3e). In comparison, individuals reinfected with the same major lineage reported on average only one fewer symptom at the reinfection sampling time point than at the primary sampling time point (Fig. 3e), with a median of 0 (IQR 0–3) fewer reported symptoms (two-sided paired Wilcoxon P = 0.005). In addition, the proportion of individuals reporting more symptoms at the last sampling is higher among the reinfections than among the persistent infections.

Prevalence of long COVID

From February 2021, as well as reporting symptoms, participants were asked whether they describe themselves as having long COVID and if they were still experiencing symptoms more than 4 weeks after they first had COVID-19 (see Methods). We estimated the prevalence of self-reported long COVID in individuals with persistent infection compared with individuals with non-persistent infection, accounting for several confounding variables (see Methods). In the persistent infection group, 9.0% of respondents (32 of 356) self-reported long COVID at their first visit 12 weeks or longer since the start of infection, and 5.8% (19 of 326) reported long COVID at 26 weeks or longer. However, among the non-persistently infected group, only 5.4% (4,291 of 78,902) reported long COVID at their first visit 12 weeks or longer, and 4.1% (3,000 of 72,608) reported long COVID at 26 weeks or longer.

Correcting for confounders, we found strong evidence for a 55% higher odds of reporting long COVID at 12 weeks or more post-infection among individuals with persistent infection than individuals with non-persistent infection (P = 0.004 for the unadjusted model; P = 0.021 for the adjusted model), but no evidence of a difference for long COVID at 26 weeks or more post-infection (P = 0.127 for the unadjusted model; P = 0.367 for the adjusted model) (Table 3). The lower probability of reporting long COVID 26 weeks post-infection than at 12 weeks post-infection could be because the majority of the persistent infections that we identified lasted for less than 3 months, and hence persistence of an infection may no longer be a contributing factor to long COVID beyond 3 months.

Table 3 Prevalence of long COVID in individuals with persistent infection. Individuals with non-persistent infections are set as reference for odds ratio calculations

Discussion

We developed a robust approach for identifying persistent SARS-CoV-2 RNA in individuals with sequenced samples spanning 1 month or longer. Evidence suggests that these represent persistent infections; however, persistence of non-replicating viral RNA cannot be categorically ruled out in all cases. Because viral genetic data are needed to confirm persistent infection, we can only identify persistent infections in individuals with at least two high viral load (Ct ≤ 30) samples. Given this, the number of persistent infections that we identified should be considered a lower bound. Of the 381 persistent infections that we identified among participants of the ONS-CIS, 54 lasted at least 2 months and two over 6 months; in some cases, the infecting lineage had gone extinct in the general population. By contrast, we only identified 60 reinfections by the same major lineage as the primary infection, suggesting that immunity to the same variant remains strong after infection, at least until the lineage has gone extinct (Table 1).

The large number of persistent infections that we uncovered is striking, given the leading hypothesis that many of the variants of concern emerged wholly or partially during long-term chronic infections in immunocompromised individuals1. As the ONS-CIS is a community-based surveillance study, our observations suggest that the pool of people in which long-term infections could occur, and hence potential sources of divergent variants, may be much larger than generally thought. However, we do not know whether the individuals with persistent infection that we identified have other health conditions that may make them more susceptible to these long infections. We estimate that 1 in 1,000 of all infections, and potentially as many as 1 in 200, may become persistent, with intermittent high viral loads, for at least 2 months.

Our results are consistent with a household study37 in which 6% of infections (7 of 109) have been reported to have viral shedding after 30 days since the onset of symptoms, but only two had a Ct ≤ 30 after 25 days, and none after 30 days. By contrast, a study of hospitalized individuals38 has reported prolonged shedding in 18% (17 of 92) of patients. This much higher rate than individuals sampled in the community regardless of symptoms, as in our study, probably represents the severity of infection among the hospitalized individuals.

The harbouring of persistent infections in the general community may also help to explain the early detection of cryptic lineages circulating in wastewaters39,40 long before they spread in the population at large. In support of the hypothesis that variants of concern may emerge during prolonged infections, several studies have shown elevated evolutionary rates driven by selection during chronic infections of immunocompromised individuals6,7,8. Among many of the individuals with persistent infection that we identified, we observed long periods of evolutionary stasis at the consensus level, indicating little to no directional selection during infection. In HIV, zero synonymous consensus differences between sequences spanning prolonged periods of within-host infection have also been observed, probably because synonymous mutations are under little or no selective pressure37,38. However, in other persistent infections, we found strong evidence for positive selection and parallel evolution, particularly in S and ORF1ab. In the most extreme cases, we observed one persistent infection with zero consensus change for over 150 days, whereas another persistent infection had 33 substitutions over a 4-month period, 20 of which were non-synonymous, and where the great majority of these mutations emerged during the first 30 days after the first positive sequence.

Most of the persistent infections in our study with at least three positive PCR samples over the course of infection showed a pattern of viral rebound (high to low to high viral load). This suggests that the mechanism of persistent infection is not due to delayed clearance of the virus by the host, but points to possible presence of actively replicating virus. Other studies have also reported viral rebound both during acute41 and chronic42 infections. These rebounding dynamics also exacerbate the difficulty of distinguishing between persistent and reinfections in the absence of sequence data. A common criterion for identifying reinfections is to only consider positive PCR samples that are at least 90 days apart43. An advantage of the genetic approach used in our study is not only the ability to detect reinfections over shorter timescales of less than 60 days but also to rule out reinfection over longer timescales (more than 90 days). Our findings are in broad agreement with recent systematic reviews showing lower rates of reinfection during the first 12 weeks since the initial infection44.

Individuals with persistent infections report fewer symptoms later in a persistent infection than at their first positive sample, or remain asymptomatic throughout infection, but have more than 50% higher odds of long COVID than a group of individuals with non-persistent infection. Although the link between viral persistence and long COVID may not be causal, these results suggest that persistent infections could be contributing to the pathophysiology of long COVID10,45, as also evidenced by the observation of circulating SARS-CoV-2 S1 spike protein in a subset of patients with long COVID months after first infection46. There is also a growing body of evidence of the persistence of replication-competent virus throughout the body months after the start of an infection47,48, and very recently that this persistence is strongly associated with higher risk of long COVID49.

The association between persistent infection and long COVID does not imply that every persistent infection can lead to long COVID (only 9% of individuals with persistent infection reported having long COVID) nor does it mean that all cases of long COVID are due to a persistent infection. Indeed, many other possible mechanisms have been suggested to contribute to long COVID, including autoimmunity/inflammation, organ damage, Epstein–Barr virus reactivation and microthrombosis (see ref. 10 for a recent review).

Together, our observations highlight the continuing importance of community-based genomic surveillance both to monitor the emergence and spread of new variants, and to gain a fundamental understanding of the natural history and evolution of novel pathogens and their clinical implications for patients.

Methods

ONS-CIS

This work contains statistical data from ONS, which is Crown Copyright. The use of the ONS statistical data in this work does not imply the endorsement of the ONS in relation to the interpretation or analysis of the statistical data. This work uses research datasets that may not exactly reproduce National Statistics aggregates.

The ONS-CIS is a UK household-based surveillance study in which participant households are approached at random from address lists across the country to provide a representative sample of the population21. All versions of the study protocol are available at https://www.ndm.ox.ac.uk/covid-19/covid-19-infection-survey/protocol-and-information-sheets. All individuals 2 years of age and older from each household who provide written informed consent provide swab samples (taken by the participant or parent or carer for those under 12 years of age), regardless of symptoms, and complete a questionnaire at assessments. The survey offered participants the option of only having one enrolment assessment (taken by approximately 1%), or weekly assessments for only 1 month (taken by approximately 1%; Extended Data Fig. 1). All other enrolled participants (approximately 98%) were assessed weekly for the first month of their enrolment in the survey and then approximately monthly (originally for 1 year; all such participants were approached for re-consent for ongoing follow-up beyond 1 year). The survey had rolling recruitment to meet its target for taking a certain number of swabs from the population each month, but in practice, most recruitments occurred between September and December 2020 (Supplementary Information; also see supplementary table 4 in ref. 50). The rolling recruitment enabled the study to achieve its overall sample numbers (required to address its surveillance objectives) while accounting for participants withdrawing from the study. As is standard, the protocol also allowed a 14-day window around the approximately monthly assessments (shifting any following assessments to avoid swabbing participants again at very short (and variable) notice); crucially, assessments were not missed to meet survey targets.

As the vast majority of recruitment comes from invitations sent to households randomly selected from address lists that we do not have relevant demographic information, we are not able to compare characteristics of those agreeing and not agreeing to participate. From 26 April 2020 to 31 July 2022, assessments were conducted by study workers visiting each household; from 14 July 2022 onwards, assessments were remote, with swabs taken using kits posted to participants and returned by post or courier, and questionnaires completed online or by telephone. For this analysis, we included data from 2 November 2020 to 15 August 2022, spanning a period from Alpha to Omicron BA.2 sequences within the ONS-CIS dataset (Extended Data Table 1).

To date, of 535,731 participants recruited into the ONS-CIS, 109,417 (20%) have either completed their participation after a single enrolment visit, visits only for the first month or only for the first year (7%) or withdrawn (13%; see Supplementary Information). Moving house was a major reason for completing participation in the survey (as this leads to participants no longer being eligible for follow-up as it is the original address that is sampled), a small number of participants died (0.4%), and in July 2022, the survey moved to a remote data collection approach at which point some participants chose to end their participation. For the time period of this study, 96.2% of swabs had a negative result and 1.9% had a positive result (1.9% were void). For those with positive test results, the mean time since the previous assessment was 35.2 days and to the next assessment was 37.1 days. For those with a negative test, the associated numbers were 31.8 days and 33.0 days. By definition, 100% of first positive samples from each persistent infection had a subsequent assessment. There was no statistical difference in the time between sampling for individuals with persistent infection compared with those testing positive (Supplementary Information).

Sequencing

From December 2020 onwards, sequencing was attempted on all positive samples with Ct ≤ 30; before this date, sequencing was attempted in real time wherever possible, with some additional retrospective sequencing of stored samples. The vast majority of samples were sequenced on Illumina Novaseq, with a small number using Oxford Nanopore GridION or MINION. One of two protocols were used: the ARTIC amplicon protocol51 with consensus FASTA sequence files generated using the ARTIC nextflow processing pipeline (v1)52, or veSeq, an RNA sequencing protocol based on a quantitative targeted enrichment strategy19,53 with consensus sequences produced using shiver (v1.5.8)54. During our study period, we identified 94,943 individuals with a single sequence and 5,774 individuals with two or more sequences. Here we only included sequences with 50% or more genome coverage.

Identifying candidate persistent infections

We first identified individuals with two or more sequenced samples taken at least 26 days apart. We chose this cut-off because the majority of individuals with acute infection shed the virus for less than 20 days and no longer than 30 days in the respiratory tract24,55. Given the extreme heterogeneity in the shedding profiles during some acute infections24,55, we also considered a more conservative 56-day cut-off for some analyses. Selection was based on availability of sequences, which were required for genetic analysis; it was not possible to allow for failure to identify any long-term shedding due to participants not having assessments/swabs or tests failing or subsequent positives having Ct > 30, and therefore not being sent for sequencing. However, this means that some persistent infections are likely to have been missed and so our estimates should be considered a lower bound.

Candidate persistent infections were defined in one of two ways: (1) pairs of sequenced samples that belonged to the same major lineage, and (2) pairs of sequenced samples where one or both had no defined phylogenetic lineage, but where the genetic distance between them was lower than that required to differentiate two major lineages (Extended Data Fig. 9). The major lineages that we considered were Alpha (B.1.1.7), Delta (B.1.617.2), Omicron BA.1 and Omicron BA.2, including their sublineages. We assumed pairs belonging to different major lineages were either co-infections or reinfections with two different virus lineages. Only candidate persistent infections were considered in further analysis.

Identifying persistent infections

We determined whether two sequences from the same individual are from the same infection by whether they share a rare SNP at two or more consecutive time points relative to the population-level consensus. If an intermediate sequence from that individual had an unknown nucleotide at a site (due to poor coverage), whereas the first and last sequences shared a rare SNP, then the intermediate sequence was also assumed to be part of the same infection. Rare SNPs were defined as those that were shared by fewer than a threshold number of sequences, belonging to each major lineage, within the full ONS-CIS dataset (Extended Data Fig. 2). The thresholds were chosen to maximize the number of persistent infections identified while minimizing the number of false positives (see below).

To determine the false-positive rate, for each major lineage, we generated a dataset of 1,000 randomly paired sequences from different individuals in the ONS-CIS, each sampled at least 26 days apart. We determined the proportion of these pairs that would have been incorrectly identified as persistent infections as a function of the threshold for determining whether a SNP is rare (Extended Data Fig. 2). Although the total number of persistent infections that we identified (among the list of candidate persistent infections) grew as the threshold for determining whether a SNP is rare increased, at very high thresholds, the rate of false positives (among the list of randomly paired sequences) was also high. In our study, we chose a threshold of 400 sequences (corresponding to all sequences of the same major lineage within the full ONS-CIS dataset) for all of the major lineages, giving a false-positive rate (identifying an infection as persistent when it was not) of 0–3%. Using this threshold, approximately 92–98% of all sequences from the four major lineages had a rare SNP relative to the major-lineage population-level consensus.

Identifying reinfections with the same major lineage

Any pair of sequences from the same individual, of the same major lineage and at least 26 days apart were considered as candidate reinfections. Of these, pairs that had at least one nucleotide difference at the consensus level, and did not share any rare SNPs, were classed as reinfections. Pairs that had no identical rare SNPs, nor any nucleotide differences at the consensus level, were classed as undetermined.

Sample mix-ups could inflate the true number of reinfections. In the ONS-CIS, each sample has a unique barcode, a small minority of barcodes are positive, and even fewer still have a Ct ≤ 30; therefore, random swapping of barcodes is unlikely to result in a wrong positive sample with Ct ≤ 30 being sent for sequencing. For each weekly sampling batch, we also checked concordance between lineage from the sequencing laboratory and S gene target failure from the testing laboratory; concordance between Ct from the testing laboratory and genome coverage from the sequencing laboratory (high coverage is expected for low Ct, and low coverage for high Ct); and for veSeq, a log-linear relationship between the number of mapped reads from the sequencing laboratory and Ct from the testing laboratory19.

Phylogenetic analysis

For each of the four major lineages, we chose 600 consensus sequences with at least 95% coverage from the ONS-CIS dataset using weighted random sampling, with each sample of major lineage i collected in week j given a weight 1/nij, where nij is the number of sequences of major lineage i collected during week j22. These sequences were added as a background set to the collection of all consensus sequences for samples from persistent infections and reinfections. Mapping of each sequence to the Wuhan-Hu-1 reference sequence was already performed by shiver, and thus a full alignment for each of the four lineages could be constructed using only this.

Maximum likelihood phylogenetic trees were constructed using IQ-TREE (v1.6.12)56 using the GTR+gamma substitution model and ultrafast bootstrap57. Each tree was rooted using the collection dates of the samples and the heuristic residual mean square algorithm in TempEst58. Visualization used ggtree59.

Measuring the number of independent appearances of mutations and their fitness effects

To find the frequency with which mutations (not including deletions) that we identified during persistent infections are represented in cross-sectional samples from the population and their between-host level fitness, we used the results from ref. 29 on the estimated number of appearances of mutations from a representative global dataset of approximately 6.5 million SARS-CoV-2 sequences (for number of appearances: https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/mutation_counts/aggregated.csv; for estimating the fitness effect of mutations: https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/aa_fitness/aamut_fitness_by_clade.csv), as well as a subset of those sequences that are only sampled from England (arguably more relevant to our sequences from the ONS-CIS). When doing this, we controlled for major lineage, meaning, for example, if a mutation occurred in a BA.1 persistent infection, we only considered the number of times it appeared on the BA.1 phylogeny. To map between Pangolin lineages and Nextstrain clades, we assumed B.1.1.7 ≡ 20I, B.1.617.2 ≡ {21A,21I,21J}, BA.1 ≡ 21K and BA.2 ≡ {21L,22C,22D}. We also compared the frequency and fitness effect of mutations that appeared in two persistent infections (that is, recurrent mutations) and those that appeared in only one persistent infection (that is, single mutations) as reported in ref. 29.

Estimating the percentage of infections that are persistent

We identified 381 and 54 infections that lasted 30 days or longer and 60 days or longer, respectively. Comparing this with the number of individuals that had sequenced samples belonging to Alpha, Delta, BA.1 or BA.2, we identified approximately 0.49% (381 of 77,561) and 0.07% (54 of 77,561) of infections with at least one sample that could be sequenced as persistent for 30 days or longer and 60 days or longer, respectively. As the ONS-CIS is a representative sample of individuals from the general population, we can estimate the percentage of all SARS-CoV-2 infections that became persistent for 1 month or longer, and that have intermittent high viral loads. To do this, we need to determine the probability that a persistent infection with one sequenced sample has at least one more sequenced sample. As most persistent infections probably last 1–3 months, and without knowing the true viral kinetics during persistent infection, this can be approximated as the probability that a persistent infection has virus that can be sequenced on any given day of sampling.

At one extreme, if a typical persistent infection has a virus sample that can be sequenced for only 4 days per month (assuming viral dynamics similar to one acute infection each month), only 14% of persistent infections would be detected through approximately monthly sampling. Correcting for this, we would estimate the percentage of detected infections that are persistent in the general population for 30 days or longer to be 3.5%, calculated as the ratio of the estimated prevalence of persistent infections (0.49%) to the detection rate (14%). Similarly, for infections persisting 60 days or longer, the estimated percentage would be 0.5% (0.07%/0.14). At the other extreme, if we assume typical persistent infections have sequenceable virus for 20 days per month and, therefore a detection rate of 71%, we would estimate the percentage of detected infections that are persistent infections in the general population for 30 days or longer to be 0.7% (0.49%/0.71) and for 60 days or longer to be 0.1% (0.07%/0.71).

Comparing viral load activities and symptoms

To quantify the changes in viral load activities during persistent infections, we compared Ct values at the last time point a sequence was obtained to when the first sequence was collected. Likewise, for reinfections, we compared the changes in Ct value between the primary infection and reinfection. We used a paired Student’s t-test to calculate P values in both cases as the distribution of differences in Ct values were normally distributed for both persistent infections (W = 0.99, P = 0.28) and reinfections (W = 0.99, P = 0.78) as determined by the Shapiro–Wilk test60.

We also tracked 12 symptoms consistently solicited from all participants at every assessment. Symptoms were fever, weakness/tiredness, diarrhoea, shortness of breath, headache, nausea/vomiting, sore throat, muscle ache, abdominal pain, cough, loss of smell and loss of taste. At each follow-up assessment, participants were asked whether these 12 symptoms had been present in the past 7 days (mandatory question completed at all assessments where a swab was taken). Symptom discontinuation was defined as the first occurrence of two successive follow-up visits without reporting symptoms. To compare symptom counts during persistent infections and reinfections, we used the two-sided paired Wilcoxon test as the distribution of symptom differences is not normally distributed (Fig. 3e). For calculation of P values and visualization of histograms and box plots, we used Mathematica (v13.1.0.0).

Long COVID analysis

Attributing persistent symptoms to a previous SARS-CoV-2 infection is difficult in the absence of a diagnostic test for long COVID, and long COVID cases are known to be under-recorded in electronic health records61. Long COVID status was therefore self-reported by study participants, so we cannot exclude some participants’ symptoms being caused by a medical condition other than COVID-19. From February 2021, at every assessment, participants were asked “would you describe yourself as having long COVID, that is, you are still experiencing symptoms more than 4 weeks after you first had COVID-19, that are not explained by something else?”.

When estimating long COVID prevalence in this analysis, we considered the first assessment at least 12 weeks and at least 26 weeks after infection. Our comparison group comprised all individuals with a positive PCR test and Ct ≤ 30 at the first positive test, excluding the individuals with persistent infection identified in this study, over the same time span as persistent infections such that first positive test was within the range of dates of the first positive test among the persistent infection group. Although the underlying study design for ONS-CIS is a cohort study, this specific analysis of long COVID focuses on comparing persistent to non-persistent infections in terms of the risk of subsequent self-reported long COVID (binary outcomes, at least 12 weeks and at least 26 weeks following the first positive test). Some missing data were inevitable, given the timeframe of the study and participant completion or withdrawal (see above); overall, the long COVID question was not completed at 368,161 of 6,797,789 (5.4%) of assessments during the study period from 4 February 2021 when it was introduced, with 93% and 86% of participants without persistent infection but with a positive test with Ct < 30 having a response to the long COVID question at least 12 and 26 weeks after infection, respectively (Extended Data Fig. 1). Analysis used complete cases, that is, excluded those who did not have a response to the long COVID question in this timeframe (Extended Data Fig. 1). As these are binary outcomes rather than a time-to-event outcome, either an odds ratio or a relative risk could be used to evaluate the risk of long COVID in individuals with persistent infection; here we used odds ratio. The fact that some persistent infections were probably missed due to sequencing only being attempted in high viral load samples and due to missed assessments means that our estimates of the impact of persistent infection are likely to be biased towards the null, that is, the true effects of persistent infection are probably larger than we estimate. Follow-up from the start of infection to first long COVID response was similar between persistent and non-persistent infections (Table 3).

In calculating the odds ratio of long COVID in individuals with persistent infection relative to the comparison group, we used a binary logistic regression model and accounted for confounding variables such as age at the last birthday, sex, Ct value, calendar date, area deprivation quintile group, presence of self-reported long-term health conditions (binary), vaccination status (unvaccinated or single vaccinated, fully vaccinated or booster vaccinated 14–89 days ago, fully vaccinated or booster vaccinated 90–179 days ago, fully vaccinated or booster vaccinated 180 or more days ago) and days from first positive test to long COVID follow-up response. All variables except the last one were defined at the time of the first positive test. Continuous variables (age, Ct value, calendar date and days to follow-up response) were modelled as restricted cubic splines with a single internal knot at the median of the distribution and boundary knots at the 5th and 95th percentiles. Vaccination status was derived from a combination of CIS and National Immunisation Management System (NIMS) data for participants in England, and CIS data alone for participants in Wales, Scotland and Northern Ireland. Given the number of potential confounders included, we did not test for interaction (effect modification). We did not test for goodness of fit because the model was solely used to control for measured confounders of the relationship between persistent positivity and long COVID, which we selected on substantive, rather than empirical, grounds (that is, using a causal inference approach).

Although we controlled for many confounders that could potentially impact our long COVID analysis, of note, age, sex, vaccination status and previous infection, there may still be unknown residual confounders that can influence our results. We were also unable to perform the long COVID analysis for the reinfection group due to the low number of participants in this cohort who reported new-onset long COVID 12 weeks or longer or 26 weeks or longer after infections.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.