Main

In December 2019, a cluster of pneumonia cases epidemiologically linked to an open-air live animal market in the city of Wuhan (Hubei Province), China1,2 led local health officials to issue an epidemiological alert to the Chinese Center for Disease Control and Prevention and the World Health Organization’s (WHO) China Country Office. In early January, the aetiological agent of the pneumonia cases was found to be a coronavirus3, subsequently named SARS-CoV-2 by an International Committee on Taxonomy of Viruses (ICTV) Study Group4 and also named hCoV-19 by Wu et al.5. The first available sequence data6 placed this novel human pathogen in the Sarbecovirus subgenus of Coronaviridae7, the same subgenus as the SARS virus that caused a global outbreak of >8,000 cases in 2002–2003. By mid-January 2020, the virus was spreading widely within Hubei province and by early March SARS-CoV-2 was declared a pandemic8.

In outbreaks of zoonotic pathogens, identification of the infection source is crucial because this may allow health authorities to separate human populations from the wildlife or domestic animal reservoirs posing the zoonotic risk9,10. If stopping an outbreak in its early stages is not possible—as was the case for the COVID-19 epidemic in Hubei—identification of origins and point sources is nevertheless important for containment purposes in other provinces and prevention of future outbreaks. When the first genome sequence of SARS-CoV-2, Wuhan-Hu-1, was released on 10 January 2020 (GMT) on Virological.org by a consortium led by Zhang6, it enabled immediate analyses of its ancestry. Across a large region of the virus genome, corresponding approximately to ORF1b, it did not cluster with any of the known bat coronaviruses indicating that recombination probably played a role in the evolutionary history of these viruses5,7. Subsequently a bat sarbecovirus—RaTG13, sampled from a Rhinolophus affinis horseshoe bat in 2013 in Yunnan Province—was reported that clusters with SARS-CoV-2 in almost all genomic regions with approximately 96% genome sequence identity2. Zhou et al.2 concluded from the genetic proximity of SARS-CoV-2 to RaTG13 that a bat origin for the current COVID-19 outbreak is probable. Concurrent evidence also proposed pangolins as a potential intermediate species for SARS-CoV-2 emergence and suggested them as a potential reservoir species11,12,13.

Unlike other viruses that have emerged in the past two decades, coronaviruses are highly recombinogenic14,15,16. Influenza viruses reassort17 but they do not undergo homologous recombination within RNA segments18,19, meaning that origins questions for influenza outbreaks can always be reduced to origins questions for each of influenza’s eight RNA segments. For coronaviruses, however, recombination means that small genomic subregions can have independent origins, identifiable if sufficient sampling has been done in the animal reservoirs that support the endemic circulation, co-infection and recombination that appear to be common. Here, we analyse the evolutionary history of SARS-CoV-2 using available genomic data on sarbecoviruses. We demonstrate that the sarbecoviruses circulating in horseshoe bats have complex recombination histories as reported by others15,20,21,22,23,24,25,26. Despite the SARS-CoV-2 lineage’s acquisition of residues in its Spike (S) protein’s receptor-binding domain (RBD) permitting the use of human ACE2 (ref. 27) receptors and its RBD being genetically closer to a pangolin virus than to RaTG13 (refs. 11,12,13,22,28)—a signal that suggests recombination—the divergence patterns in the S protein do not show evidence of recombination between the lineage leading to SARS-CoV-2 and known sarbecoviruses. Our results indicate the presence of a single lineage circulating in bats with properties that allowed it to infect human cells, as previously described for bat sarbecoviruses related to the first SARS-CoV lineage29,30,31.

To gauge the length of time this lineage has circulated in bats, we estimate the time to the most recent common ancestor (TMRCA) of SARS-CoV-2 and RaTG13. We use three bioinformatic approaches to remove the effects of recombination, and we combine these approaches to identify putative non-recombinant regions that can be used for reliable phylogenetic reconstruction and dating. Collectively our analyses point to bats being the primary reservoir for the SARS-CoV-2 lineage. While it is possible that pangolins, or another hitherto undiscovered species, may have acted as an intermediate host facilitating transmission to humans, current evidence is consistent with the virus having evolved in bats resulting in bat sarbecoviruses that can replicate in the upper respiratory tract of both humans and pangolins25,32.

Results

Recombination analysis and identification of breakpoint-free genome regions

Among the 68 sequences in the aligned sarbecovirus sequence set, 67 show evidence of mosaicism (all Dunn–Sidak-corrected P < 4 × 10–4 and 3SEQ14), indicating involvement in homologous recombination either directly with identifiable parentals or in their deeper shared evolutionary history—that is, due to shared ancestral recombination events. This is evidence for numerous recombination events occurring in the evolutionary history of the sarbecoviruses22,33; specifying all past events in their correct temporal order34 is challenging and not shown here. Figure 1 (top) shows the distribution of all identified breakpoints (using 3SEQ’s exhaustive triplet search) by the number of candidate recombinant sequences supporting them. The histogram allows for the identification of non-recombining regions (NRRs) by revealing regions with no breakpoints. Sorting these breakpoint-free regions (BFRs) by length results in two segments >5 kb: an ORF1a subregion spanning nucleotides (nt) 3,625–9,150 and the first half of ORF1b spanning nt 13,291–19,628 (sequence numbering given in Source Data, https://github.com/plemey/SARSCoV2origins). Eight other BFRs <500 nt were identified, and the regions were named BFR A–J in order of length. Of the nine breakpoints defining these ten BFRs, four showed phylogenetic incongruence (PI) signals with bootstrap support >80%, adopting previously published criteria on using a combination of mosaic and PI signals to show evidence of past recombination events19. All four of these breakpoints were also identified with the tree-based recombination detection method GARD35.

Fig. 1: Breakpoints identified by 3SEQ.
figure 1

a, Breakpoints identified by 3SEQ illustrated by percentage of sequences (out of 68) that support a particular breakpoint position. Note that breakpoints can be shared between sequences if they are descendants of the same recombination events. Pink, green and orange bars show BFRs, with region A (nt 13,291–19,628) showing two trimmed segments yielding region A′ (nt 13,291–14,932, 15,405–17,162, 18,009–19,628). Regions B and C span nt 3,625–9,150 and 9,261–11,795, respectively. Concatenated region A′BC is NRR1. Open reading frames are shown above the breakpoint plot, with the variable-loop region indicated in the S protein. b, Similarity plot between SARS-CoV-2 and several selected sequences including RaTG13 (black), SARS-CoV (pink) and two pangolin sequences (orange). The shaded region corresponds to the S protein. c, Maximum likelihood phylogenetic trees rooted on a 2007 virus sampled in Kenya (BtKy72; root truncated from images), shown for five BFRs of the sarbecovirus alignment. Nucleotide positions for phylogenetic inference are 147–695, 962–1,686 (first tree), 3,625–9,150 (second tree, also BFR B), 9,261–11,795 (third tree, also BFR C), 12,443–19,638 (fourth tree) and 23,631–24,633, 24,795–25,847, 27,702–28,843 and 29,574–30,650 (fifth tree). Relevant bootstrap values are shown on branches, and grey-shaded regions show sequences exhibiting phylogenetic incongruence along the genome. S. China corresponds to Guangxi, Yunnan, Guizhou and Guangdong provinces. N. China corresponds to Jilin, Shanxi, Hebei and Henan provinces, and the N. China clade also includes one sequence sampled in Hubei Province in 2004.

The extent of sarbecovirus recombination history can be illustrated by five phylogenetic trees inferred from BFRs or concatenated adjacent BFRs (Fig. 1c). BFRs were concatenated if no phylogenetic incongruence signal could be identified between them. When viewing the last 7 kb of the genome, a clade of viruses from northern China appears to cluster with sequences from southern Chinese provinces but, when inspecting trees from different parts of ORF1ab, the N. China clade is phylogenetically separated from the S. China clade. Individual sequences such as RpShaanxi2011, Guangxi GX2013 and two sequences from Zhejiang Province (CoVZXC21/CoVZC45), as previously shown22,25, have strong phylogenetic recombination signals because they fall on different evolutionary lineages (with bootstrap support >80%) depending on what region of the genome is being examined.

Despite the high frequency of recombination among bat viruses, the block-like nature of the recombination patterns across the genome permits retrieval of a clean subalignment for phylogenetic analysis. Conservatively, we combined the three BFRs >2 kb identified above into non-recombining region 1 (NRR1). Removal of five sequences that appear to be recombinants and two small subregions of BFR A was necessary to ensure that there were no phylogenetic incongruence signals among or within the three BFRs. Alternatively, combining 3SEQ-inferred breakpoints, GARD-inferred breakpoints and the necessity of PI signals for inferring recombination, we can use the 9.9-kb region spanning nucleotides 11,885–21,753 (NRR2) as a putative non-recombining region; this approach is breakpoint-conservative because it is conservative in identifying breakpoints but not conservative in identifying non-recombining regions. Using a third consensus-based approach for identifying recombinant regions in individual sequences—with six different recombination detection methods in RDP5 (ref. 36)—gives a putative recombination-free alignment that we call non-recombinant alignment 3 (NRA3) (see Methods).

All three approaches to removal of recombinant genomic segments point to a single ancestral lineage for SARS-CoV-2 and RaTG13. Two other bat viruses (CoVZXC21 and CoVZC45) from Zhejiang Province fall on this lineage as recombinants of the RaTG13/SARS-CoV-2 lineage and the clade of Hong Kong bat viruses sampled between 2005 and 2007 (Fig. 1c). Specifically, progenitors of the RaTG13/SARS-CoV-2 lineage appear to have recombined with the Hong Kong clade (with inferred breakpoints at 11.9 and 20.8 kb) to form the CoVZXC21/CoVZC45-lineage. Sibling lineages to RaTG13/SARS-CoV-2 include a pangolin sequence sampled in Guangdong Province in March 2019 and a clade of pangolin sequences from Guangxi Province sampled in 2017.

Because the SARS-CoV-2 S protein has been implicated in past recombination events or possibly convergent evolution12, we specifically investigated several subregions of the S protein—the N-terminal domain of S1, the C-terminal domain of S1, the variable-loop region of the C-terminal domain, and S2. The variable-loop region in SARS-CoV-2 shows closer identity to the 2019 pangolin coronavirus sequence than to the RaTG13 bat virus, supported by phylogenetic inference (Fig. 2). On first examination this would suggest that that SARS-CoV-2 is a recombinant of an ancestor of Pangolin-2019 and RaTG13, as proposed by others11,22. However, on closer inspection, the relative divergences in the phylogenetic tree (Fig. 2, bottom) show that SARS-CoV-2 is unlikely to have acquired the variable loop from an ancestor of Pangolin-2019 because these two sequences are approximately 10–15% divergent throughout the entire S protein (excluding the N-terminal domain). It is RaTG13 that is more divergent in the variable-loop region (Extended Data Fig. 1) and thus likely to be the product of recombination, acquiring a divergent variable loop from a hitherto unsampled bat sarbecovirus28. This is notable because the variable-loop region contains the six key contact residues in the RBD that give SARS-CoV-2 its ACE2-binding specificity27,37. These residues are also in the Pangolin Guangdong 2019 sequence. The most parsimonious explanation for these shared ACE2-specific residues is that they were present in the common ancestors of SARS-CoV-2, RaTG13 and Pangolin Guangdong 2019, and were lost through recombination in the lineage leading to RaTG13. This provides compelling support for the SARS-CoV-2 lineage being the consequence of a direct or nearly-direct zoonotic jump from bats, because the key ACE2-binding residues were present in viruses circulating in bats.

Fig. 2: Phylogenetic relationships among SARS-CoV-2 and closely related sequences for subregions of the S protein.
figure 2

SARS-CoV-2 and RaTG13 are the most closely related (their most recent common ancestor nodes denoted by green circles), except in the 222-nt variable-loop region of the C-terminal domain (bar graphs at bottom). In the variable-loop region, RaTG13 diverges considerably with the TMRCA, now outside that of SARS-CoV-2 and the Pangolin Guangdong 2019 ancestor, suggesting that RaTG13 has acquired this region from a more divergent and undetected bat lineage. The genetic distances between SARS-CoV-2 and RaTG13 (bottom) demonstrate that their relationship is consistent across all regions except for the variable loop. The genetic distances between SARS-CoV-2 and Pangolin Guangdong 2019 are consistent across all regions except the N-terminal domain, implying that a recombination event between these two sequences in this region is unlikely. Uncertainty measures are shown in Extended Data Fig. 1. NTD, N-terminal domain; CTD, C-terminal domain.

Ancestry in non-recombinant regions

Using the most conservative approach to identification of a non-recombinant genomic region (NRR1), SARS-CoV-2 forms a sister lineage with RaTG13, with genetically related cousin lineages of coronavirus sampled in pangolins in Guangdong and Guangxi provinces (Fig. 3). Given that these pangolin viruses are ancestral to the progenitor of the RaTG13/SARS-CoV-2 lineage, it is more likely that they are also acquiring viruses from bats. While pangolins could be acting as intermediate hosts for bat viruses to get into humans—they develop severe respiratory disease38 and commonly come into contact with people through trafficking—there is no evidence that pangolin infection is a requirement for bat viruses to cross into humans.

Fig. 3: Maximum likelihood trees of the sarbecoviruses using the two longest BFRs, rooted on the Kenya/Bulgaria lineage.
figure 3

Region A has been shortened to A′ (5,017 nt) based on potential recombination signals within the region. Region B is 5,525 nt long. Sequences are colour-coded by province according to the map. Five example sequences with incongruent phylogenetic positions in the two trees are indicated by dashed lines.

Phylogenies of subregions of NRR1 depict an appreciable degree of spatial structuring of the bat sarbecovirus population across different regions (Fig. 3). One geographic clade includes viruses from provinces in southern China (Guangxi, Yunnan, Guizhou and Guangdong), with its major sister clade consisting of viruses from provinces in northern China (Shanxi, Henan, Hebei and Jilin) as well as Hubei Province in central China and Shaanxi Province in northwestern China. Several of the recombinant sequences in these trees show that recombination events do occur across geographically divergent clades. The Sichuan (SC2018) virus appears to be a recombinant of northern/central and southern viruses, while the two Zhejiang viruses (CoVZXC21 and CoVZC45) appear to carry a recombinant region from southern or central China.

TMRCA for NRRs of SARS-CoV-2 lineage

To avoid artefacts due to recombination, we focused on NRR1 and NRR2 and the recombination-masked alignment NRA3 to infer time-measured evolutionary histories. Visual exploration using TempEst39 indicates that there is no evidence for temporal signal in these datasets (Extended Data Fig. 2). This is not surprising for diverse viral populations with relatively deep evolutionary histories. In such cases, even moderate rate variation among long, deep phylogenetic branches will substantially impact expected root-to-tip divergences over a sampling time range that represents only a small fraction of the evolutionary history40. However, formal testing using marginal likelihood estimation41 does provide some evidence of a temporal signal, albeit with limited log Bayes factor support of 3 (NRR1), 10 (NRR2) and 3 (NRA3); see Supplementary Table 1.

In the absence of a strong temporal signal, we sought to identify a suitable prior rate distribution to calibrate the time-measured trees by examining several coronaviruses sampled over time, including HCoV-OC43, MERS-CoV, and SARS-CoV virus genomes. These datasets were subjected to the same recombination masking approach as NRA3 and were characterized by a strong temporal signal (Fig. 4), but also by markedly different evolutionary rates. Specifically, using a formal Bayesian approach42 (see Methods), we estimate a fast evolutionary rate (0.00169 substitutions per site yr–1, 95% highest posterior density (HPD) interval (0.00131,0.00205)) for SARS viruses sampled over a limited timescale (1 year), a slower rate (0.00078 (0.00063,0.00092) substitutions per site yr–1) for MERS-CoV on a timescale of about 4 years and the slowest rate (0.00024 (0.00019,0.00029) substitutions per site yr–1) for HCoV-OC43 over almost five decades. These differences reflect the fact that rate estimates can vary considerably with the timescale of measurement, a frequently observed phenomenon in viruses known as time-dependent evolutionary rates41,43,44. Over relatively shallow timescales, such differences can primarily be explained by varying selective pressure, with mildly deleterious variants being eliminated more strongly by purifying selection over longer timescales44,45,46. Consistent with this, we estimate a concomitantly decreasing non-synonymous-to-synonymous substitution rate ratio over longer evolutionary timescales: 1.41 (1.20,1.68), 0.35 (0.30,0.41) and 0.133 (0.129,0.136) for SARS, MERS-CoV and HCoV-OC43, respectively. In light of these time-dependent evolutionary rate dynamics, a slower rate is appropriate for calibration of the sarbecovirus evolutionary history. We compare both MERS-CoV- and HCoV-OC43-centred prior distributions (Extended Data Fig. 3) to examine the sensitivity of date estimates to this prior specification.

Fig. 4: Temporal signal and mean evolutionary rate estimates for coronaviruses HCoV-OC43, MERS and SARS.
figure 4

ac, Root-to-tip (RtT) divergence as a function of sampling time for the three coronavirus evolutionary histories unfolding over different timescales (HCoV-OC43 (n = 37; a) MERS (n = 35; b) and SARS (n = 69; c)). Decimal years are shown on the x axis for the 1.2 years of SARS sampling in c. d, Mean evolutionary rate estimates plotted against sampling time range for the same three datasets (represented by the same colour as the data points in their respective RtT divergence plots), as well as for the comparable NRA3 using the two different priors for the rate in the Bayesian inference (red points).

We infer time-measured evolutionary histories using a Bayesian phylogenetic approach while incorporating rate priors based on mean MERS-CoV and HCoV-OC43 rates and with standard deviations that allow for more uncertainty than the empirical estimates for both viruses (see Methods). Using both prior distributions, this results in six highly similar posterior rate estimates for NRR1, NRR2 and NRA3, centred around 0.00055 substitutions per site yr–1. The fact that these estimates lie between the rates for MERS-CoV and HCoV-OC43 is consistent with the intermediate sampling time range of about 18 years (Fig. 5). The consistency of the posterior rates for the different prior means also implies that the data do contribute to the evolutionary rate estimate, despite the fact that a temporal signal was visually not apparent (Extended Data Fig. 2). Below, we report divergence time estimates based on the HCoV-OC43-centred rate prior for NRR1, NRR2 and NRA3 and summarize corresponding estimates for the MERS-CoV-centred rate priors in Extended Data Fig. 4. Divergence time estimates based on the HCoV-OC43-centred rate prior for the separate BFRs (Supplementary Table 3) show consistency in TMRCA estimates across the genome.

Fig. 5: Time-measured phylogenetic estimates and divergence times for sarbecovirus lineages using an HCoV-OC43-centred rate prior.
figure 5

The time-calibrated phylogeny represents a maximum clade credibility tree inferred for NRR1. Grey tips correspond to bat viruses, green to pangolin, blue to SARS-CoV and red to SARS-CoV-2. The sizes of the black internal node circles are proportional to the posterior node support. 95% credible interval bars are shown for all internal node ages. The inset represents divergence time estimates based on NRR1, NRR2 and NRA3. The boxplots show divergence time estimates (posterior medians) for SARS-CoV-2 (red) and the 2002–2003 SARS-CoV virus (blue) from their most closely related bat virus. Green boxplots show the TMRCA estimate for the RaTG13/SARS-CoV-2 lineage and its most closely related pangolin lineage (Guangdong 2019). Boxplots show interquartile ranges, white lines are medians and box whiskers show the full range of posterior distribution. Transparent bands of interquartile range width and with the same colours are superimposed to highlight the overlap between estimates. In Extended Data Fig. 4 we compare these divergence time estimates to those obtained using the MERS-CoV-centred rate priors for NRR1, NRR2 and NRA3.

The divergence time estimates for SARS-CoV-2 and SARS-CoV from their respective most closely related bat lineages are reasonably consistent among the three approaches we use to eliminate the effects of recombination in the alignment. Using the most conservative approach (NRR1), the divergence time estimate for SARS-CoV-2 and RaTG13 is 1969 (95% HPD: 1930–2000), while that between SARS-CoV and its most closely related bat sequence is 1962 (95% HPD: 1932–1988); see Fig. 5. These are in general agreement with estimates using NRR2 and NRA3, which result in divergence times of 1982 (1948–2009) and 1948 (1879–1999), respectively, for SARS-CoV-2, and estimates of 1952 (1906–1989) and 1970 (1932–1996), respectively, for the divergence time of SARS-CoV from its closest known bat relative. The SARS-CoV divergence times are somewhat earlier than dates previously estimated15 because previous estimates were obtained using a collection of SARS-CoV genomes from human and civet hosts (as well as a few closely related bat genomes), which implies that evolutionary rates were predominantly informed by the short-term SARS outbreak scale and probably biased upwards. Indeed, the rates reported by these studies are in line with the short-term SARS rates that we estimate (Fig. 4). The estimated divergence times for the pangolin virus most closely related to the SARS-CoV-2/RaTG13 lineage range from 1851 (1730–1958) to 1877 (1746–1986), indicating that these pangolin lineages were acquired from bat viruses divergent to those that gave rise to SARS-CoV-2. Current sampling of pangolins does not implicate them as an intermediate host.

Discussion

Identifying the origins of an emerging pathogen can be critical during the early stages of an outbreak, because it may allow for containment measures to be precisely targeted at a stage when the number of daily new infections is still low. Early detection via genomics was not possible during Southeast Asia’s initial outbreaks of avian influenza H5N1 (1997 and 2003–2004) or the first SARS outbreak (2002–2003). By 2009, however, rapid genomic analysis had become a routine component of outbreak response. The 2009 influenza pandemic and subsequent outbreaks of MERS-CoV (2012), H7N9 avian influenza (2013), Ebola virus (2014) and Zika virus (2015) were met with rapid sequencing and genomic characterization. For the current pandemic, the ‘novel pathogen identification’ component of outbreak response delivered on its promise, with viral identification and rapid genomic analysis providing a genome sequence and confirmation, within weeks, that the December 2019 outbreak first detected in Wuhan, China was caused by a coronavirus3. Unfortunately, a response that would achieve containment was not possible. Given what was known about the origins of SARS, as well as identification of SARS-like viruses circulating in bats that had binding sites adapted to human receptors29,30,31, appropriate measures should have been in place for immediate control of outbreaks of novel coronaviruses. The key to successful surveillance is knowing which viruses to look for and prioritizing those that can readily infect humans47.

The difficulty in inferring reliable evolutionary histories for coronaviruses is that their high recombination rate48,49 violates the assumption of standard phylogenetic approaches because different parts of the genome have different histories. To begin characterizing any ancestral relationships for SARS-CoV-2, NRRs of the genome must be identified so that reliable phylogenetic reconstruction and dating can be performed. Evolutionary rate estimation can be profoundly affected by the presence of recombination50. Because there is no single accepted method of inferring breakpoints and identifying clean subregions with high certainty, we implemented several approaches to identifying three classic statistical signals of recombination: mosaicism, phylogenetic incongruence and excessive homoplasy51. Our most conservative approach attempted to ensure that putative NRRs had no mosaic or phylogenetic incongruence signals. A second breakpoint-conservative approach was conservative with respect to breakpoint identification, but this means that it is accepting of false-negative outcomes in breakpoint inference, resulting in less certainty that a putative NRR truly contains no breakpoints. A third approach attempted to minimize the number of regions removed while also minimizing signals of mosaicism and homoplasy. The origins we present in Fig. 5 (NRR1) are conservative in the sense that NRR1 is more likely to be non-recombinant than NRR2 or NRA3. Because the estimated rates and divergence dates were highly similar in the three datasets analysed, we conclude that our estimates are robust to the method of identifying a genome’s NRRs.

Due to the absence of temporal signal in the sarbecovirus datasets, we used informative prior distributions on the evolutionary rate to estimate divergence dates. Calibration of priors can be performed using other coronaviruses (SARS-CoV, MERS-CoV and HCoV-OC43), but estimated rates vary with the timescale of sample collection. In the presence of time-dependent rate variation, a widely observed phenomenon for viruses43,44,52, slower prior rates appear more appropriate for sarbecoviruses that currently encompass a sampling time range of about 18 years. Our approach resulted in similar posterior rates using two different prior means, implying that the sarbecovirus data do inform the rate estimate even though a root-to-tip temporal signal was not apparent.

The relatively fast evolutionary rate means that it is most appropriate to estimate shallow nodes in the sarbecovirus evolutionary history. Accurate estimation of ages for deeper nodes would require adequate accommodation of time-dependent rate variation. While such models have recently been made available, we lack the information to calibrate the rate decline over time (for example, through internal node calibrations44). As a proxy, it would be possible to model the long-term purifying selection dynamics as a major source of time-dependent rates43,44,52, but this is beyond the scope of the current study. The assumption of long-term purifying selection would imply that coronaviruses are in endemic equilibrium with their natural host species, horseshoe bats, to which they are presumably well adapted. While there is evidence of positive selection in the sarbecovirus lineage leading to RaTG13/SARS-CoV-2 (ref. 53), this is inferred to have occurred before the divergence of RaTG13 and SARS-CoV-2 and thus should not influence our inferences.

Of importance for future spillover events is the appreciation that SARS-CoV-2 has emerged from the same horseshoe bat subgenus that harbours SARS-like coronaviruses. Another similarity between SARS-CoV and SARS-CoV-2 is their divergence time (40–70 years ago) from currently known extant bat virus lineages (Fig. 5). This long divergence period suggests there are unsampled virus lineages circulating in horseshoe bats that have zoonotic potential due to the ancestral position of the human-adapted contact residues in the SARS-CoV-2 RBD. Without better sampling, however, it is impossible to estimate whether or how many of these additional lineages exist. While there is involvement of other mammalian species—specifically pangolins for SARS-CoV-2—as a plausible conduit for transmission to humans, there is no evidence that pangolins are facilitating adaptation to humans. A hypothesis of snakes as intermediate hosts of SARS-CoV-2 was posited during the early epidemic phase54, but we found no evidence of this55,56; see Extended Data Fig. 5.

With horseshoe bats currently the most plausible origin of SARS-CoV-2, it is important to consider that sarbecoviruses circulate in a variety of horseshoe bat species with widely overlapping species ranges57. Nevertheless, the viral population is largely spatially structured according to provinces in the south and southeast on one lineage, and provinces in the centre, east and northeast on another (Fig. 3). This boundary appears to be rarely crossed. Two exceptions can be seen in the relatively close relationship of Hong Kong viruses to those from Zhejiang Province (with two of the latter, CoVZC45 and CoVZXC21, identified as recombinants) and a recombinant virus from Sichuan for which part of the genome (region B of SC2018 in Fig. 3) clusters with viruses from provinces in the centre, east and northeast of China. SARS-CoV-2 and RaTG13 are also exceptions because they were sampled from Hubei and Yunnan, respectively. The fact that they are geographically relatively distant is in agreement with their somewhat distant TMRCA, because the spatial structure suggests that migration between their locations may be uncommon. From this perspective, it may be useful to perform surveillance for more closely related viruses to SARS-CoV-2 along the gradient from Yunnan to Hubei.

It is clear from our analysis that viruses closely related to SARS-CoV-2 have been circulating in horseshoe bats for many decades. The unsampled diversity descended from the SARS-CoV-2/RaTG13 common ancestor forms a clade of bat sarbecoviruses with generalist properties—with respect to their ability to infect a range of mammalian cells—that facilitated its jump to humans and may do so again. Although the human ACE2-compatible RBD was very likely to have been present in a bat sarbecovirus lineage that ultimately led to SARS-CoV-2, this RBD sequence has hitherto been found in only a few pangolin viruses. Furthermore, the other key feature thought to be instrumental in the ability of SARS-CoV-2 to infect humans—a polybasic cleavage site insertion in the S protein—has not yet been seen in another close bat relative of the SARS-CoV-2 virus.

The existing diversity and dynamic process of recombination amongst lineages in the bat reservoir demonstrate how difficult it will be to identify viruses with potential to cause major human outbreaks before they emerge. This underscores the need for a global network of real-time human disease surveillance systems, such as that which identified the unusual cluster of pneumonia in Wuhan in December 2019, with the capacity to rapidly deploy genomic tools and functional studies for pathogen identification and characterization.

Methods

Dataset compilation

Sarbecovirus data

Complete genome sequence data were downloaded from GenBank and ViPR; accession numbers of all 68 sequences are available in Supplementary Table 4. Sequences were aligned by MAFTT58 v.7.310, with a final alignment length of 30,927, and used in the analyses below.

HCoV-OC43

We compiled a dataset including 27 human coronavirus OC43 virus genomes and ten related animal virus genomes (six bovine, three white-tailed deer and one canine virus). The canine viral genome was excluded from the Bayesian phylogenetic analyses because temporal signal analyses (see below) indicated that it was an outlier.

MERS-CoV

We extracted a similar number (n = 35) of genomes from a MERS-CoV dataset analysed by Dudas et al.59 using the phylogenetic diversity analyser tool60 (v.0.5).

SARS-CoV

We compiled a set of 69 SARS-CoV genomes including 58 sampled from humans and 11 sampled from civets and raccoon dogs. This dataset comprises an updated version of that used in Hon et al.15 and includes a cluster of genomes sampled in late 2003 and early 2004, but the evolutionary rate estimate without this cluster (0.00175 substitutions per site yr–1 (0.00117,0.00229)) is consistent with the complete dataset (0.00169 substitutions per site yr–1, (0.00131,0.00205)).

Sarbecovirus, HCoV-OC43 and SARS-CoV data were assembled from GenBank to be as complete as possible, with sampling year as an inclusion criterion. MERS-CoV data were subsampled to match sample sizes with SARS-CoV and HCoV-OC43.

Recombination analysis

Because coronaviruses are known to be highly recombinant, we used three different approaches to identify non-recombinant regions for use in our Bayesian time-calibrated phylogenetic inference.

First, we took an approach that relies on identification of mosaic regions (via 3SEQ14 v.1.7) that are also supported by PI signals19. Because 3SEQ is the most statistically powerful of the mosaic methods61, we used it to identify the best-supported breakpoint history for each potential child (recombinant) sequence in the dataset. A single 3SEQ run on the genome alignment resulted in 67 out of 68 sequences supporting some recombination in the past, with multiple candidate breakpoint ranges listed for each putative recombinant. Next, we (1) collected all breakpoints into a single set, (2) complemented this set to generate a set of non-breakpoints, (3) grouped non-breakpoints into contiguous BFRs and (4) sorted these regions by length. A phylogenetic tree—using RAxML v8.2.8 (ref. 62,63), the GTR + Γ model and 100 bootstrap replicates—was inferred for each BFR >500 nt.

We considered (1) the possibility that BFRs could be combined into larger non-recombinant regions and (2) the possibility of further recombination within each BFR.

We named the length-sorted BFRs as: BFR A (nt positions 13,291–19,628, length = 6,338 nt), BFR B (nt positions 3,625–9,150, length = 5,526 nt), BFR C (nt positions 9,261–11,795, length = 2,535 nt), BFR D (nt positions 27,702–28,843, length = 1,142 nt) and six further regions (E–J). Phylogenetic trees and exact breakpoints for all ten BFRs are shown in Supplementary Figs. 110. Regions A–C had similar phylogenetic relationships among the southern China bat viruses (Yunnan, Guangxi and Guizhou provinces), the Hong Kong viruses, northern Chinese viruses (Jilin, Shanxi, Hebei and Henan provinces, including Shaanxi), pangolin viruses and the SARS-CoV-2 lineage. Because these subclades had different phylogenetic relationships in region D (Supplementary Fig. 4), that region and shorter BFRs were not included in combined putative non-recombinant regions.

Regions A–C were further examined for mosaic signals by 3SEQ, and all showed signs of mosaicism. In region A, we removed subregion A1 (nt positions 3,872–4,716 within region A) and subregion A4 (nt 1,642–2,113) because both showed PI signals with other subregions of region A. After removal of A1 and A4, we named the new region A′. In addition, sequences NC_014470 (Bulgaria 2008), CoVZXC21, CoVZC45 and DQ412042 (Hubei-Yichang) needed to be removed to maintain a clean non-recombinant signal in A′. Region B showed no PI signals within the region, except one including sequence SC2018 (Sichuan), and thus this sequence was also removed from the set. Region C showed no PI signals within it. Combining regions A′, B and C and removing the five named sequences gives us putative NRR1, as an alignment of 63 sequences. We say that this approach is conservative because sequences and subregions generating recombination signals have been removed, and BFRs were concatenated only when no PI signals could be detected between them. The construction of NRR1 is the most conservative as it is least likely to contain any remaining recombination signals.

In our second stage, we wanted to construct non-recombinant regions where our approach to breakpoint identification was as conservative as possible. We call this approach breakpoint-conservative, but note that this has the opposite effect to the construction of NRR1 in that this approach is the most likely to allow breakpoints to remain inside putative non-recombining regions. In other words, a true breakpoint is less likely to be called as such (this is breakpoint-conservative), and thus the construction of a non-recombining region may contain true recombination breakpoints (with insufficient evidence to call them as such). In this approach, we considered a breakpoint as supported only if it had three types of statistical support: from (1) mosaic signals identified by 3SEQ, (2) PI signals identified by building trees around 3SEQ’s breakpoints and (3) the GARD algorithm35, which identifies breakpoints by identifying PI signals across proposed breakpoints. Because 3SEQ identified ten BFRs >500 nt, we used GARD’s (v.2.5.0) inference on 10, 11 and 12 breakpoints. A reduced sequence set of 25 sequences chosen to capture the breadth of diversity in the sarbecoviruses (obvious recombinants not involving the SARS-CoV-2 lineage were also excluded) was used because GARD is computationally intensive. GARD identified eight breakpoints that were also within 50 nt of those identified by 3SEQ. PI signals were identified (with bootstrap support >80%) for seven of these eight breakpoints: positions 1,684, 3,046, 9,237, 11,885, 21,753, 22,773 and 24,628. Using these breakpoints, the longest putative non-recombining segment (nt 1,885–21,753) is 9.9 kb long, and we call this region NRR2.

Our third approach involved identifying breakpoints and masking minor recombinant regions (with gaps, which are treated as unobserved characters in probabilistic phylogenetic approaches). Specifically, we used a combination of six methods implemented in v.5.5 of RDP5 (ref. 36) (RDP, GENECONV, MaxChi, Bootscan, SisScan and 3SEQ) and considered recombination signals detected by more than two methods for breakpoint identification. Except for specifying that sequences are linear, all settings were kept to their defaults. Based on the identified breakpoints in each genome, only the major non-recombinant region is kept in each genome while other regions are masked. To evaluate the performance procedure, we confirmed that the recombination masking resulted in (1) a markedly different outcome of the PHI test64, (2) removal of well-supported (bootstrap value >95%) incompatible splits in Neighbor-Net65 and (3) a near-complete reduction of mosaic signal as identified by 3SEQ. If the latter still identified non-negligible recombination signal, we removed additional genomes that were identified as major contributors to the remaining signal. This produced non-recombining alignment NRA3, which included 63 of the 68 genomes.

Bayesian divergence time estimation

We focused on these three non-recombining regions/alignments for divergence time estimation; this avoids inappropriate modelling of evolutionary processes with recombination on strictly bifurcating trees, which can result in different artefacts such as homoplasies that inflate branch lengths and lead to apparently longer evolutionary divergence times. To examine temporal signal in the sequenced data, we plotted root-to-tip divergence against sampling time using TempEst39 v.1.5.3 based on a maximum likelihood tree. The latter was reconstructed using IQTREE66 v.2.0 under a general time-reversible (GTR) model with a discrete gamma distribution to model inter-site rate variation.

Time-measured phylogenetic reconstruction was performed using a Bayesian approach implemented in BEAST42 v.1.10.4. When the genomic data included both coding and non-coding regions we used a single GTR + Γ substitution model; for concatenated coding genes we partitioned the alignment by codon position and specified an independent GTR + Γ model for each partition with a separate gamma model to accommodate inter-site rate variation. We used an uncorrelated relaxed clock model with log-normal distribution for all datasets, except for the low-diversity SARS data for which we specified a strict molecular clock model. For the HCoV-OC43, MERS-CoV and SARS datasets we specified flexible skygrid coalescent tree priors. In the absence of any reasonable prior knowledge on the TMRCA of the sarbecovirus datasets (which is required for grid specification in a skygrid model), we specified a simpler constant size population prior. As informative rate priors for the analysis of the sarbecovirus datasets, we used two different normal prior distributions: one with a mean of 0.00078 and s.d. = 0.00075 and one with a mean of 0.00024 and s.d. = 0.00025. These means are based on the mean rates estimated for MERS-CoV and HCoV-OC43, respectively, while the standard deviations are set ten times higher than empirical values to allow greater prior uncertainty and avoid strong bias (Extended Data Fig. 3). In our analyses of the sarbecovirus datasets, we incorporated the uncertainty of the sampling dates when exact dates were not available. To estimate non-synonymous over synonymous rate ratios for the concatenated coding genes, we used the empirical Bayes Renaissance counting’procedure67. Temporal signal was tested using a recently developed marginal likelihood estimation procedure41 (Supplementary Table 1).

Posterior distributions were approximated through Markov chain Monte Carlo sampling, which were run sufficiently long to ensure effective sampling sizes >100. BEAST inferences made use of the BEAGLE v.3 library68 for efficient likelihood computations. We used TreeAnnotator to summarize posterior tree distributions and annotated the estimated values to a maximum clade credibility tree, which was visualized using FigTree.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.