Recombination is a common evolutionary feature of positive-strand RNA viruses1. It can increase genetic diversity and accelerate adaptation in viral populations by combining existing linked allelic variation. The signature of frequent recombination is pervasive across Betacoronaviruses in bats and other animal hosts2,3,4,5,6, and its detection is made easier in part by the substantial genetic divergence separating these various coronaviruses. When an individual is contemporaneously infected with two genetically distinct strains of a virus, so-called superinfection7, these viruses can recombine to produce a virus with novel allelic combinations. Although recombination is expected to regularly occur during SARS-CoV-2 infections, it can be difficult to detect in vivo unless it involves genetically distinguishable parental strains: recombination between two identical or nearly identical genomes leaves no detectable molecular trace. Furthermore, as with influenza virus8,9, SARS-CoV-2 superinfections have been only rarely reported in humans10,11,12, likely due to the short mean duration of SARS-CoV-2 infections.

Early claims of recombination in SARS-CoV-213 may have been confounded by sequencing errors or convergent evolution14. As the COVID-19 pandemic progressed, and genetically divergent lineages have evolved, evidence for recombination between these lineages is becoming more convincing15,16. Many of these divergent lineages include highly transmissible variants distinguished by S (spike) genes that, under positive selection, have accumulated multiple mutations associated with increased transmissibility, virulence, and immune escape17,18. All of these variants are descended from the B.1 lineage, defined by four mutations: C241T, C3037T, C14408T, and A23403G. This lineage arose in China and spread to Europe in late-January 2020 and the United States in February 202019.

Here, we provide a detailed characterization of an instance of superinfection from January 2021, identified by the New York City (NYC) Department of Health and Mental Hygiene (DOHMH). We show that this individual was superinfected with two SARS-CoV-2 variants: Alpha (B.1.1.7) and Epsilon (B.1.429)20,21. Further, we characterize evidence for recombination occurring within this superinfected individual, providing an in vivo snapshot of this evolutionary process within SARS-CoV-2.


Index case and named contact partner epidemiology

In December 2020, researchers and public health officials in the United Kingdom identified a rapidly spreading SARS-CoV-2 variant within England, then designated as PANGO lineage B.1.1.721, now designated as the Alpha variant of concern in the WHO nomenclature. In NYC, a SARS-CoV-2 genome sequence classified as belonging to the Alpha lineage was obtained from a sample on 4 January 2021 (the ‘index case’): NYCPHL-002130 (GISAID accession number EPI_ISL_857200). Due to the potential public health importance of Alpha variant cases in NYC in early 2021, NYC DOHMH conducted a public health investigation related to the individual from which this sample had been obtained. This investigation determined that the individual had recently traveled to Ghana (late December/early January) and developed symptoms consistent with COVID-19 while in Ghana. Contact tracing in New York City identified another case of an Alpha variant infection, sampled on 14 January 2021, in a named contact with a similar travel history (the ‘named contact partner’): NYCPHL-002461 (GISAID accession EPI_ISL_883324). The named contact partner had also developed symptoms consistent with COVID-19 while in Ghana, prior to returning the United States.

Alpha (B.1.1.7) variant PCR screening

Typical of the Alpha variant21, NYCPHL-002130 from the index case exhibited S gene target failure (SGTF) phenotype with the TaqPath COVID-19 RT-PCR assay (Table 1). NYC PHL uses the ARTIC amplicon-based protocol V3 to sequence full viral genomes and capture intra-host diversity. All 24 mutations diagnostic of the Alpha variant were found in >90% of reads (Table 2). The viral genome from this index case showed limited intra-host viral diversity (Fig. 1). A single variable site was found at position 23099, with C in 20.4% of reads and A in 79.6% of reads.

Table 1 Cycle threshold (Ct) values from TaqPath assays from index case (NYCPHL-002130) and named contact partner (NYCPHL-002461).
Table 2 Lineage defining mutations and their frequency in the index case (NYCPHL-002130) and named contact partner (NYCPHL-002461).
Fig. 1: The distribution of allelic frequencies in the index case (NYCPHL-002130) and named partner with suspected superinfection (NYCPHL-002461).
figure 1

Frequencies of individual alleles shown as ticks, a smoothed kernel density plot is used to highlight clustering patterns, and colors represent allele types.

Atypical Alpha (B.1.1.7) variant PCR screening

During the initial PCR screening of the sample collected from the named contact partner (NYCPHL-002461-A), the SGTF characteristic of the Alpha variant was not observed (Table 1). Furthermore, genome sequencing revealed substantial intra-host viral diversity within the viral genome, a possible signature of superinfection (Fig. 1). To confirm that this intra-host diversity was not attributable to experimental or sequencing artifacts, the original sample was re-extracted and re-sequenced (NYCPHL-002461-B) and similar SGTF was observed. Additional extractions were then performed in duplicate from the original stock (NYCPHL-002461-C and -D) and sequenced. The same signature of intra-host diversity was confirmed in all four sequenced extractions. Four nucleotide (nt) substitutions differentiating this sequence from the reference genome were identified at >90% frequency: C241T, C3037T, C14408T, and A23403G (Fig. 1; Table 2). These four substitutions were all present in the lineage B.1 virus that is ancestral to the named SARS-CoV-2 variants. Numerous additional substitutions, including A23063T (S N501Y), were present, but at slightly lower frequencies. Nonetheless, this genome was classified as an Alpha variant. Notably, the Δ69/70 and Δ144 deletions were found at >97% in the sequencing reads, despite the lack of SGTF.

NYCPHL-002461-A, -B, and -D extracts exhibited low Ct values for the ORF1ab and N gene targets, ranging between 15 and 16 (Table 1). The S gene target Ct values were around 2 to 3 cycles higher. The difference suggests a reduction of viral template in the S gene target region, but not SGTF. We note NYCPHL-002461-C yielded an invalid result, as the TaqPath assay showed no amplification on all targets, including the MS2 phage extraction-control target.

Intra-host diversity

The presence of multiple intermediate frequency alleles and the lack of SGTF in the TaqPath assay prompted us to investigate the intra-host diversity in the named contact partner, NYCPHL-002461. Using the previously described and validated Galaxy SARS-CoV-2 allelic variation pipeline22, we identified four categories of allelic frequencies: shared, major strain, minor strain, and other (see Fig. 1, interactive notebook at The four replicate sequencing runs for NYCPHL-002461 yielded remarkably similar patterns of these allelic frequencies.

Alleles that fell into the shared category were present at 90% allele frequency in three or more samples. Shared alleles included all four substitutions characteristic of B.1 (Table 2) and two deletions in the S gene (Δ69-70 and Δ144) diagnostic of the Alpha variant.

Major strain occurred at frequencies between 60 and 80% (in at least 3 samples). Major alleles included all 21 substitutions defining the Alpha variant, which we observed at a median allele frequency of 74.1%, and ORF1A deletion (Table 2). The remaining major alleles are shared with genome from the index case.

Minor strain alleles occurred at frequencies between 10 and 25% (in at least 3 samples). All but one of the 12 diagnostic Epsilon mutations was found in this set: A28272T is absent in NYCPHL-002461. All remaining minor alleles have been observed in other Epsilon genomes.

The “other” category encompasses all other variable sites, i.e. those occurring at frequency between 25 and 60% or those found in only one or two samples. The two alleles were found in all four replicate sequences at intermediate frequencies: G7723A (30.3%) and C23099A (46.7%). These frequencies are suggestive of intra-host variation in the major strain.

In contrast to the allelic mixture detected in the named partner (NYCPHL-002461), we observed allele frequencies >90% for all Alpha defining mutations in the sequencing data for the index case, NYCPHL-002130 (Table 2). The C23099A mutation, which was at intermediate frequency in NYCPHL-002461 from the named contact partner, was present at 88.1% in NYCPHL-002130 from the index case, consistent with the transmission of a mixed viral population between these individuals.

Phylogenetic inference with major and minor variants

We identified sub-clades within Alpha and Epsilon that shared substitutions with the major and minor strains (Fig. 2). We inferred a maximum likelihood (ML) phylogenetic tree in IQTree2 for the major strain and 3655 related Alpha (B.1.1.7) genomes containing the C2110T, C14120T, C19390T, and T7984C substitutions found in the major strain (Fig. 2A). We also inferred an ML tree for the minor strain and 2275 related Epsilon (B.1.429) genomes containing the C8947T, C12100T, and C10641T substitutions found in the minor strain (Fig. 2C).

Fig. 2: Phylogenetic consistency of major and minor variants.
figure 2

A Phylogeny of Alpha variant immediate relatives. B Root-to-tip regression for Alpha variant. C Phylogeny of Epsilon variant immediate relatives. D Root-to-tip regression for Epsilon variant. NY-NYCPHL-002461 is the genome deposited in GISAID from the case of putative superinfection. NY-NYCPHL-002130 is the genome from the index case.

Root-to-tip regression analyses show that the NYCPHL-002461 sampling date is consistent with the molecular clock for both the major and minor strain sequences (Fig. 2B/D), indicating that one would expect viruses of this degree of genetic divergence to have been circulating in mid-January 2021. In fact, genomes identical to the major variant were sampled in both NYC (the NYCPHL-002130 index case) and in Ghana on 8 January 2021 (EPI_ISL_944711), consistent with a scenario in which this particular Alpha virus was acquired in Ghana. These three viruses share a common ancestor around 4 January 2021 and are separated from additional viruses sampled in Ghana by two mutations: C912T and C23099A. Notably, the latter mutation appears at intermediate frequency in both NYCPHL-002130 and NYCPHL-002461.

The minor variant is genetically distinct from all other sampled genomes, including any genome sequenced by NYC DOHMH (Fig. 2C). The closest relatives were sampled in California (EPI_ISL_3316023, EPI_ILS_1254173, EPI_ISL_2825578), the United Kingdom (EPI_ILS_873881), and Cameroon (EPI_ISL_1790107, EPI_ISL_1790108, EPI_ISL_1790109). The most similar of these relatives is EPI_ISL_3316023, which was sampled on 11 January 2021 in California and represents the direct ancestor of the minor variant on the phylogeny. The only mutation separating this California genome from the minor variant is T28272A, which is a reversion away from an Epsilon-defining mutation (Table 2).

It is unlikely that this minor variant is a laboratory contaminant, as there are no closely related Epsilon genomes sequenced from NYC. That said, NYC represents the probable source of this Epsilon virus. Of the 145 SARS-CoV-2 genomes sequenced by NYC public health surveillance between 10 January 2020 through 16 January 2020, 4 (2.8%) were Epsilon. A similar proportion of Epsilon genomes deposited in GISAID were sampled by other labs during this same period: 11 out of 431 genomes (2.6%)23. No Epsilon genome has been reported to date from Ghana.

Four-gamete tests of recombination

A preliminary inquiry of the genome sequencing data from the S gene (12 contiguous read fragments) and N gene (nucleoprotein; 3 contiguous read fragments) regions was suggestive of recombinant genome fragments within the named contact partner. To determine whether pairs of polymorphic sites within individual read fragments displayed evidence of recombination we employed three different four-gamete based recombination detection tests: PHI24, MCL, and R2 vs Dist25 (Table 3). The power of each of these tests to detect recombination was seriously constrained by the short lengths of the read fragments and the low numbers of both variant-defining sites and other polymorphic sites with minor allele frequencies >1% within each of the fragments. Only three of the 15 read fragments (read fragments 6 and 8 in the S gene and read fragment 3 in the N-gene) encompassed two or more of the variant-defining sites that were expected to provide the best opportunities to detect recombination. Nevertheless, pairs of sites within four read fragments in the S gene (positions 23123–24467 covering fragments 7, 8, 9 and 10) and one read fragment in the nucleoprotein gene (positions 28986–29378 covering fragment 3) exhibited signals of significant phylogenetic incompatibility with at least two of the three tests (p < 0.05): signals which are consistent with recombination. The only read fragment for which evidence of recombination was supported by all three tests was fragment 3 in the N gene: a fragment that was one among only three that contained multiple variant-defining substitutions. Eight of the fifteen analyzed read-fragment alignments exhibited no signals of recombination using any of the tests, which is unsurprising given the lack within these fragments of both variant-defining substitutions and polymorphic sites with minor allele frequencies greater than 1%.

Table 3 Four gamete recombination test results for 15 sets of aligned read fragments in the S (spike) and N (nucleoprotein) genes.

Targeted sequencing for recombination detection

The four gamete tests on genomic sequencing data is limited by the short length of amplified fragments. To obtain data from longer sequence fragments, we PCR-amplified three regions of the genome from the original nucleic acid extracts, cloned them, and then sequenced individual clones. These longer genomic fragments provide greater resolution for detecting recombination, compared with the short fragments from deep sequencing analysis, because they include more differentiating sites spread out farther across the genome.

The longest cloned region spanned 947 nt within the S gene (positions 22904–23850) and contained 5 nt substitutions differentiating the major and minor strains plus a variable site in the major variant. Of the 104 clones sequenced within this region, 60 (57.7%) were major strain haplotypes, 13 (12.5%) were minor strain haplotypes, whereas the remaining 31 clones (29.8%) contained both major and minor strain mutations, consistent with recombination (Fig. 3). We observed 11 distinct combinations of major and minor strain mutations across these clones, with two distinct haplotypes present in 6 clones apiece. Most recombinant haplotypes (n = 24) are consistent with only a single recombination breakpoint. However, 7 clones are consistent with 2 breakpoints (representing 3 different haplotypes), and 1 clone is consistent with 3 distinct breakpoints.

Fig. 3: Major, minor, and mixed haplotypes in the 947 nucleotide S (spike) gene cloned sequences.
figure 3

Each row represents a sequenced clone (n = 104). Colored markings denote mutations from the reference genome. Major strain mutations are those found in the Alpha variant. Minor strain mutations are those found in Epsilon variant. Other mutations are found at intermediate or low frequencies. Shared mutations are those shared by B.1 viruses.

The second cloned S region spanned 657 nt in the S gene (positions 21442–22098) including the Δ69–70 and Δ144 deletions characteristic of the major strain and two 2 substitutions in the minor strain. Of the 93 clones sequenced, 69 (74.1%) were major strain haplotypes, 17 (18.3%) were minor strain haplotypes, and 7 (7.5%) were mixed haplotypes (Fig. 4). Five of these mixed haplotypes contained only one of the two deletions. One mixed haplotype was consistent with multiple recombination breakpoints. Unlike in the primary sequencing analyses where the Δ69–70 and Δ144 deletions were present in >98% of sequences, Δ69-70 was observed in only 72 (77.4%) clones and Δ144 was observed in only 71 (76.3%). These frequencies are consistent with the frequency of the other major strain substitutions in the primary sequencing analysis.

Fig. 4: Major, minor, and mixed haplotypes in the 657 nucleotide S (spike) gene cloned sequences.
figure 4

Each row represents a sequenced clone (n = 93). Colored markings denote mutations from the reference genome. Major strain mutations are those found in the Alpha variant. Minor strain mutations are those found in Epsilon variant. Other mutations are found at intermediate or low frequencies.

The third, and shortest, cloned region spanned 476 nt of ORF8 (positions 27798–28273), surrounding 4 substitutions defining the major strain and 1 minor strain substitution. Of the 36 cloned sequences, 30 (83.3%) had the major strain haplotype, 2 (5.6%) had the minor variant haplotype, and 4 (11.1%) had mixed haplotypes consistent with a single recombination breakpoint (Fig. 5). Note the discriminating substitutions only span 223 nt of this region.

Fig. 5: Major, minor, and mixed haplotypes in the 476 nucleotide ORF8 cloned sequences.
figure 5

Each row represents a sequenced clone (n = 36). Colored markings denote mutations from the reference genome. Major strain mutations are those found in the Alpha variant. Minor strain mutations are those found in Epsilon variant. Other mutations are found at intermediate or low frequencies.

Three cloned sequences from the 947 nt S gene fragment contained single nucleotide deletions resulting in non-sense mutations. In the 657 nt S gene fragment, we observed 8 clones with similar deletions, detected in both the forward and reverse direction during sequencing. These deletions were seen in the non-recombinant Alpha and Epsilon haplotypes and likely reflect non-functional viral particles, expected to constitute a substantial fraction of genomes within an infected individual26,27.

Consistency between cloning and genome sequencing analyses

In vitro recombination can be introduced by reverse-transcription and PCR amplification, which are part of both genome sequencing and cloning protocols28. These in vitro effects have a strong stochastic component and would result in substantially different recombinant haplotype frequencies across different extracts and PCR experiments. To determine the extent to which these protocols could have led to biased inference of recombination, we compared the haplotype frequencies across the four extracts from NYCPHL-002461, which had each independently been subjected to reverse transcription and PCR amplification, and the frequency of these haplotypes in the cloning experiment, which included PCR amplification.

Within the 947 nt cloned S gene fragment, the major haplotype was present between 76.4% and 78.6%, and the minor haplotype was between 13.7% and 15.4% (Supplementary Table 1). The recombinant haplotype positions 23604 A and 23709 C was present at 3.9% allele frequency (standard deviation of 0.34% across extracts), whereas recombinant haplotype 23604 C and 23709 T was present at 4.3% (standard deviation of 0.37% across extracts). Although the haplotype frequencies among extracts were significantly different (p = 0.029; chi-square test), the magnitude of these differences were unremarkable. Furthermore, there was no significant difference between the frequency of these haplotypes in cloning experiment and extracts (p = 0.190 versus -A; p = 0.189 versus -B; p = 0.357 versus -C; p = 0.206 versus -D; Fisher’s Exact Test).

A similar pattern was observed within the 476 nt cloned fragment in the ORF8 region, which included four discrimination sites: 27972, 28048, 28095, and 28111 (Supplementary Table 2). The predominant recombinant haplotypes were consistent across the four extracts, and the frequencies differed only slightly (p = 0.077; chi-square test). As in S, the frequency of these recombinant haplotypes in the cloning experiment was not significantly different from any of the extracts (p = 0.405 versus -A; p = 0.413 versus -B; p = 0.199 versus -C; p = 0.408 versus -D; Fisher’s exact test).

Hence, in vitro recombination induced by either reverse-transcription or PCR amplification, does not appear to have been the dominant contributor to the recombinant haplotype distribution reported here.

Search for transmission of a circulating recombinant

To determine whether there was onward transmission of a recombinant descendent of these major and minor strains, we queried the 27,806 genomes sequenced by NYC public health surveillance and deposited to GISAID through 5 September 2021. We tested these genomes for mosaicism (3SEQ29; with Dunn-Sidak correction for multiple comparisons) of the major and minor strains; however, we were unable to reject the null hypothesis of non-reticulate evolution for any of these genomes. We also did not find any genomes in the PHL dataset with a superset of the identifying substitutions present in the major and minor variants (e.g., C912T and C27406G) among the genomes in the PHL dataset. There is no evidence of an Alpha/Epsilon recombinant that circulated in New York City.

Since the Dunn-Sidak correction done in the 3SEQ analysis applies a conservative type-1 error threshold of 0.05, we reran the analysis using a more permissive threshold of 0.25 (see methods) and were able to reject the null hypothesis for a single genome (EPI_ISL_2965250; p = 2.24×10−6 and Dunn-Sidak corrected p = 0.117). Although this genome (Fig. 6) contains many of the mutations characteristic of the Alpha variant throughout the genome, it does not possess mutations unique to the major strain nor any Epsilon-specific mutations. Rather, within the putative recombinant regions, the EPI_ISL_2965250 genome has C8809T, C27925T, C28311T, and T28879G. All of these mutations are characteristic of the B.1.526 Iota-variant, prevalent in NYC in early 2021. Therefore, this genome is likely not a descendant of the major and minor strains. Instead it appears to be a recombinant descendant of Alpha and Iota viruses.

Fig. 6: Putative Alpha/Iota variant recombinant and the nucleotide variation present in the major, minor, and reference strains.
figure 6

The distribution of the nucleotide variation found in the major, minor, Iota (B.1.526; EPI_ISL_1635735), and single putative recombinant (EPI_ISL_2965250) strains relative to the reference genome (Wuhan Hu-1; bottom gray sequence).


Here, we report evidence of intra-host recombination of SARS-CoV-2 within a single individual superinfected with Alpha and Epsilon viral variants during the second COVID-19 wave in New York City in early 2021. Because recombinant viruses can be successfully generated and transmitted15 between humans, this finding underscores their potential relevance to the future of the COVID-19 pandemic.

The presence of major and minor strains described within the superinfected individual are unlikely to be the result of bioinformatics error, contamination, or experimental artifacts. The degree of evolutionary divergence of each of the strains from other available SARS-CoV-2 genomes is consistent with other naturally co-circulating occurring viruses in mid-January 2021. Moreover, the major strain genome is identical to contemporaneously sampled genomes from both a named contact and strains circulating in the country from which they had both recently visited. No closely related genome to the minor strain was ever sequenced by NYC DOHMH, lessening the probability of a contaminated sample. Given the relatively low sequencing coverage in NYC in January 2021 and low prevalence of the Epsilon variant, around 2–3% in NYC at the time, it is not unexpected that a closely related genome would not be observed. Furthermore, the major and minor variants were both present in all four extractions of the two aliquots at similar frequencies, indicating that any contamination, if present, would need to have occurred in the original sample swab.

The timing of this superinfection is important, because January 2021 was the peak of the second COVID-19 wave in NYC, a time when numerous variants were circulating and immediately prior to the vaccination roll-out campaign. Accordingly, January 2021 in NYC represents not only the height of potential for superinfection risk up to that point in the pandemic, but also a location where its existence would be most apparent due to the co-circulation of numerous genetically distinct viral variants.

There remain unexplained patterns in the genome sequencing data from the named contact partner. Evidence of a major and minor strain was not apparent at the S deletions Δ69/70 and Δ144 in the genome sequencing, but the cloning analysis showed major and minor alleles at these sites at the expected frequencies. Therefore, it is possible that the ARTIC protocol preferentially sequenced templates containing these deletions, giving a false impression of their predominance in the genomic analysis. Also of interest is the A28272T mutation in the minor strain, which is either a reversion or sequencing artifact. If the base-call at position 28272 in the minor variant is erroneous, then the minor strain would be identical to a virus sampled contemporaneously in California, where the Epsilon variant was first discovered and likely originated20.

Laboratory induced recombination is a common artifact during reverse-transcription and PCR30,31. However, recombination is a pervasive feature of natural coronavirus infection, as it has been observed in bats, camels, and humans2,15,32,33. One would a priori expect to find recombinant viruses in a SARS-CoV-2 superinfected individual. Therefore, it is unlikely that the entirety of the signal for recombination reported here is due to reverse-transcription or PCR-induced recombination. A consistent signal for recombination was observed in the four whole genome sequencing analyses and in cloned-fragment analysis, all suggesting the same recombinant haplotypes present at high frequency.

Our search for Alpha/Epsilon variant recombinants in NYC did not identify genomes that would suggest onward transmission of either of the major or minor strains derived here, or a recombinant offspring. This lack of onward transmission is not surprising, given that the initial index case was contacted by NYC DOHMH personnel and the named contact partner received a prompt COVID-19 diagnosis and was advised to self-isolate.

It is likely that superinfection with SARS-CoV-2 is more common than has been described in the literature, especially given the documentation of circulating recombinant strains of the Alpha variant in the United Kingdom15. Recombinant virus can only be produced within an individual with multiple contemporaneous infections. That said, we caution against assuming superinfection before potential issues of contamination, poor-quality sequencing, or bioinformatics errors have been appropriately dealt with. Ideally, multiple specimens should be collected from the same the individual to enable validation of signal for intra-host variation, though feasibility of collecting multiple swabs may be a challenge. Additionally, evidence for recombination is most robust when multiple individuals with the same signal for recombination are identified15.

The high number and genomic variability recombinant haplotypes that we have identified within a single superinfected individual suggests that recombination is perpetually occurring within SARS-CoV-2 infections. Whether recombination will play a role in the emergence of novel SARS-CoV-2 variants is an open question. Many SARS-CoV-2 variants tend to possess the same mutations, although consistent with recombination, are more likely the product of convergent evolution18. In contrast, the BA.3 lineage within the Omicron variant is likely of recombinant offspring of BA.1 and BA.234. However, the most probable mechanism driving the origin of variants like Alpha and Omicron remains host adaptation within chronically infected individuals35,36. Reduced incidence due to vaccine-induced and naturally-acquired immunity would lower the opportunity for superinfection, and the homogenizing effect of variant-driven selective sweeps (as seen in the Delta and Omicron variants37) will lessen the potential for biological innovation in a recombinant genome. Nonetheless, SARS-CoV-2 molecular surveillance should actively monitor for the emergence of a recombinant variant (e.g., within Omicron and Delta/Omicron recombinants38,39,40).


The UC San Diego IRB granted approval of this work as analysis of Human Subjects exempt surveillance data; SARS-CoV-2 public health surveillance data are collected without requiring individual consent.

Extraction and sequencing

Nasopharyngeal specimens positive for SARS-CoV-2 with Ct < 32 were submitted to NYC PHL for sequence analysis by the NYC DOHMH through the COVID Express clinics. The NYCPHL-002130 and -002461 specimens had Ct values of 19 and 20 cycles, respectively, which allowed for sequencing at NYC PHL. Each specimen was split into separate extraction and archive aliquots. Nucleic acid extraction was performed using the KingFisher Flex Purification System (Thermo Fisher Scientific) from the extraction aliquot. Eleven µL of extract was used to anneal with random hexamers and dNTPs (New England Biolabs Inc., NEB) and reverse transcribed with SuperScript IV Reverse Transcriptase at 42 °C for 50 min. The cDNA product was amplified in two separate multiplex PCRs with ARTIC V3 primer pools (Integrated DNA Technologies) in the presence of Q5 2x Hot Start Master Mix (NEB) at 98 °C for 30 s, and 35 cycles of 98 °C for 15 s and 65 °C for 5 min. The two PCR products were combined and were purified with Agencourt Ampure XP magnetic beads (Beckman Coulter) at a 1:1 sample-to-bead ratio. The bead-cleaned PCR products were quantified using a Qubit 3.0 fluorometer (Thermo Fisher Scientific). Standard protocol was used for library preparation in the NEBNext Ultra II Library Preparation workflow using 90 ng of PCR product (NEB). In short, the ARTIC PCR products were used in an end-repair reaction, which added a 5’-phosphate group and a dA-tail, in a reaction for 30 min at 20 °C. The reaction was heat inactivated for 30 min at 65 °C. NEBNext Adaptor was ligated at 25 °C for 30 min and cleaved by USER Enzyme at 37 °C for 15 min. The product was Agencourt Ampure XP bead-purified at a ratio of 0.6x sample:beads. The bead-cleaned, end-ligated amplicons were subjected to a 6-cycle PCR reaction with NEBNext Ultra II Q5 Master Mix in the presence of NEBNext Multiplex Oligos for Illumina (NEB), which added a sample-specific 8-base index and Illumina P5 and P7 adapters for sequencing on Illumina instruments. The product was purified with Ampure XP beads at a 0.6x sample:bead ratio and quantified, normalized and pooled at equimolar concentration with other libraries, followed by loading onto the Illumina MiSeq sequencing instrument using V3 600-cycle reagent kit, with a V3 flow cell for 250-cycle paired-end sequencing (Illumina). For NYCPHL-002461, the same “extraction” specimen aliquot was used for a second extraction, Extract B. Extracts C, and D were independent extractions, but from the “archived” specimen aliquot. As such, the first extract (A), and extracts B, C, and D were independent samples which underwent independent reverse transcription, ARTIC PCR, library preparation, and sequencing reactions. Genomic sequencing depth was similar across all extracts for both the index case and named contact partner (Supplementary Figure 1).

Potential in vitro recombination that occurred during the four independent extractions, reverse-transcription reactions, and library preparation procedures, such as PCR amplification, would require the events to occur independently at the same stage four times in order to produce the same proportions of major and minor variant haplotypes in the high-throughput sequencing data. To account for in vitro recombination, regions where long complete reads span across major and minor variants in close proximity, <105 nucleotide bases, were examined across all genome alignments of the four NYCPHL-002461 replicates. Reads with SAM (Sequence Alignment Map) Flags 81,83,97,113,145,147,161,177,2129 are included in the analysis. Reads with other flags are excluded from the analysis or are not found in the alignment files. Additionally, reads without a combination of major or minor alleles are excluded. All unique haplotypes at major and minor variant positions are grouped together and quantified. Relative frequencies of the haplotypes are calculated for each region of all four extractions.


Three regions of the SARS-CoV-2 genome from NYCPHL-002461 were cloned. Two contained non-overlapping regions of the S gene: a 947 nt fragment (positions 22904–23850) and 657 nt fragment (positions 21421–22098). The third region included 476 nt fragment of ORF8 (positions 27798–28273).

To perform the annealing step for reverse transcription, 3 µl of the NYCPHL-002461 nucleic acid extract was combined with reverse primer and dNTPs at final concentrations of 154 nM and 769 µM, respectively. Annealing was performed by heating to 65 °C for 5 minutes then cooling to 4 °C. The reverse primer sequences used for the reverse transcription are as follows: 947 nt S fragment primer-R is 5ʹ-CTATTCCAGTTAAAGCACGGTTT, 657 nt S fragment is 5ʹ-AGGTCCATAAGAAAAGGCTGAGA and ORF8 476 nt fragment is 5ʹ-GAGACATTTAGTTTGTTCGTTTA. An elongation mix containing 200 units of SuperScript IV Reverse Transcriptase (Invitrogen) and 40 units of RNaseOUT Recombinant Ribonuclease Inhibitor (Invitrogen) was then added along with dNTPs at a final concentration of 200 µM. The resulting solution was heated to 55 °C for 10 minutes then 80 °C for 10 minutes.

Each cDNA target was PCR amplified using Platinum II Taq polymerase and its accompanying buffer (Invitrogen) supplemented with a final concentration of 200 µM dNTPs and 600 nM of each primer. The reverse primer sequences used for reverse transcription were included along with their corresponding forward primers: 947 nt S fragment primer-F is 5ʹ-TCTTGATTCTAAGGTTGGTGGT, 657 nt S fragment is 5ʹ-AGGGGTACTGCTGTTATGTCT and ORF8 is 5ʹ-GCCTTTCTGCTATTCCTTGT. PCR was performed with the following cycling conditions: an initial hold at 94 °C for 2 minutes, followed by 35 cycles at 94 °C for 15 seconds, 60 °C for 15 seconds and 68 °C for 15 seconds. Cycling was followed by a final extension step at 72 °C for 7 minutes. The PCR products were run on 2% agarose gels, excised and gel purified with the GenElute Gel Extraction Kit (Sigma). The gel-purified PCR products were cloned using the TOPO TA Cloning Kit for Sequencing (Invitrogen).

Individual colonies resulting from the transformation into chemically competent One Shot TOP10 Escherichia coli (Invitrogen) were picked and patch plated for sequencing. Rolling circle amplification and Sanger sequencing of the clones were performed by GeneWiz (New Jersey). All clones were sequenced with the M13 reverse primer. Due to its larger size, the 947 nt S fragment clones were also sequenced with the M13 forward primer.

Major and minor variant calling

We used the Galaxy SARS-CoV-2 variant calling pipeline for paired-end Illumina ARTIC amplicon data22. Briefly, the workflow performs quality control, masks primer sites, maps reads to reference using BWA-mem, calls variants using lofreq, annotates them using SNPEff, and outputs tabular variant call files, thresholded on minimum allele frequency of 0.05. These variants are further visualized in a custom ObservableHQ notebook ( For Table 2, we explored the data for diagnostic alleles with frequency below 0.05, but found none.

Alignment and phylogenetic inference

SARS-CoV-2 genomes were downloaded from GISAID on 5 September 202141. Genomes assigned to the B.1.1.7 or B.1.429 Pangolin lineage42 were aligned to the Wuhan-Hu-1 reference genome using the --6merpair option in Mafft v7.46443. We further refined these alignments to include only those genomes sharing specific synapomorphies with the major and minor allelic variants from NYCPHL-002461: C2110T, C14120T, C19390T, and T7984C for B.1.1.7 viruses and C8947T, C12110T, and C10641T for B.1.429 viruses. Separate maximum likelihood phylogenetic trees for B.1.1.7 (n = 3655) and B.1.429 (n = 2275) were inferred in IQTree2 v2.1.3 under a GTR + F + I model, with the additional NNI search option and a minimum branch length of 1e-9 substitutions/site44.

Molecular clock inference

To determine whether the major and minor allelic variants were contemporaneous with the date of sampling, we estimated clock trees for the B.1.1.7 and B.1.429 phylogenies in TreeTime v0.8.045. We fixed the clock rate to 8 × 10−4 substitutions/site/year under a skyline coalescent model (per NextStrain default parameters for SARS-CoV-2). These trees were also used to infer the time to most recent common ancestor of the allelic variants and their closest relatives.

Four-gamete tests for recombination

When sequences evolve in the absence of both recombination and convergent mutations it is expected that, for any pair of polymorphic sites where one of the sites has either nucleotide X or Y and the other has either nucleotide A or B, no more than three of the four possible combinations of nucleotides (or gametes) at the two sites (i.e. XA, XB, YA and YB) should ever be observed. Given that in reality convergent mutations are always possible, four gamete tests of recombination attempt to detect situations where the numbers of site pairs where all four combinations of nucleotides are observed exceed that expected due to convergent mutations in the absence of recombination. We tested 15 multiple sequence alignments, each containing all observed unique read fragment sequences spanning the S-gene (12 fragments) and N-gene (3 fragments) with three different four gamete tests: (1) the PHI test (implemented in RDP524,41) which considers sites with more than two alternative nucleotide states and uses a permutation-based test to determine whether detected site pairs displaying all four gametes display a degree of spatial clustering along the sequence that is significantly higher than would be expected in the absence of recombination; (2) the MCL recombination detection test (implemented in the pairwise component of the LDHat package25) which uses an approximate maximum likelihood method to infer the population scaled recombination rate needed to explain the observed numbers of site pairs with four gametes and then tests for significant deviation of the inferred recombination rate from zero using a permutation test, and (3) the RvsDist test (implemented in pairwise component of LDHat) which determines the correlation between the R2 measure of linkage disequilibrium between site pairs with four gametes with the physical distance in nucleotides between the site pairs25 and uses a permutation test to detect significant deviations from the expected degree of correlation in the absence of recombination. For both the MCL and RvsDist tests we used a minor allele frequency cutoff of 0.01.

Population level recombination detection

We analyzed the 27,806 genomes sequenced by the NYC PHL and Pandemic Response Lab (PRL) from specimens collected from NYC residents. These genomes were aligned to the Wuhan-Hu-1 reference genome (Genbank accession NC_045512.2; using MAFFT v7.453 (options --auto --keeplength -addfragments)43.

To determine whether there was any onward transmission of a major-minor strain recombinant, we used 3SEQ v.1.729 as a statistical test for recombination in the NYC data. 3SEQ interrogates triplets of sequences for signals of mosaicism in a sequence given two ‘parental’ sequences. We interrogated each of the 27,806 NYC PHL and PRL-generated genomes for mosaicism given the major and minor strains as parents. The resulting p-values are Dunn-Sidak corrected for multiple comparisons (n = 55,612), and we tested for mosaicism at p-value thresholds 0.05 and 0.25. The single nucleotide differences between a putative recombinant and the major and minor strains were visualized using snipit (

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.