Main

On 31 December 2019, the Wuhan Municipal Health Commission reported an outbreak of pneumonia on its official website. Subsequently, scientists reported the discovery of a previously undescribed coronavirus obtained from samples of the respiratory system of some of these patients. This virus differed from all known coronaviruses including severe acute respiratory syndrome (SARS) coronavirus (SARS-CoV) and Middle East respiratory syndrome (MERS) coronavirus (MERS-CoV)1,2,3,4,5. The World Health Organization (WHO) named the disease coronavirus disease 2019 (COVID-19) and the International Committee on Taxonomy of Viruses named this new infectious agent SARS-CoV-2 (ref. 6); the seventh coronavirus that can infect humans. SARS-CoV-2 rapidly spread through the globally, producing several variants of concern (VOCs) and developing into a major and devastating pandemic. Here we summarize our current understanding of the emergence, global spread and genetic diversity of SARS-CoV-2.

The emergence of SARS-CoV-2

SARS-CoV-2 related coronaviruses

Many of the early cases of COVID-19 in Wuhan, China, were associated with the Huanan Seafood Market2, which—because of the presence of wildlife at the market—was considered an obvious candidate for the location of the initial zoonotic (that is, cross-species transmission) event. However, none of the animals from the market (including rabbits, snakes, stray cats, badgers and bamboo rats) tested positive for SARS-CoV-2 (ref. 7), and viral genome sequences of environmental samples from the market were not considered to occupy basal positions on the viral phylogeny (although the position of the rooting on the tree is uncertain)8. In addition, some of the early cases of COVID-19 in Wuhan were not epidemiologically linked to the market9, and some were linked to other markets10,11. Therefore, although it has not been resolved fully, the current evidence suggests that the Huanan Seafood Market could be the location of an early ‘superspreading’ event.

From the earliest genomic comparisons, it was clear that SARS-CoV-2 had a genomic organization similar to SARS-CoV2. The spike proteins of both viruses have similar three-dimensional structures, suggesting that these viruses might use the same cell surface receptor—human angiotensin-converting enzyme 2(ACE2)2: this was soon confirmed in vitro4,12 and using structural biology12,13. However, SARS-CoV-2 differs from SARS-CoV in two fundamental ways14. First, there are six amino acid positions in the receptor-binding domain (RBD) of the spike protein that mediate the attachment of the SARS-CoV and SARS-CoV-2 spike proteins to the human ACE2 receptor15. However, amino acids at five of the six positions differed between SARS-CoV and SARS-CoV-2 (refs. 2,14). Notably, such differences caused SARS-CoV-2 to have a higher binding avidity to the human ACE2 receptor11, and may have contributed to the higher transmissibility of SARS-CoV-2 compared with SARS-CoV. Second, there is a 12-nucleotide (nt) insertion at the cleavage site of the spike protein of SARS-CoV-2 that has not yet been identified in closely related betacoronaviruses, but that has a complex evolutionary history across the coronaviruses as a whole, indicating that it is evolutionarily volatile16. This insertion encodes four amino acids—PRRA—that can be recognized by the protease furin, which is extensively expressed in different tissues and organs17. This insertion may decrease the overall stability of the SARS-CoV-2 spike, thereby facilitating the adoption of the open conformation that is required for the binding of the spike to human ACE2 (ref. 18); SARS-CoV-2 without this furin-cleavage site shows reduced replication in a human respiratory cell line and was attenuated in laboratory animals19. Notably, amino acid substitutions have been documented at all four positions in the PRRA motif, with a P-to-H substitution (HRRA) identified in more than 487,000 viral genomes as of June 2021.

SARS-CoV-2—like many other members of the genus Betacoronavirus (including SARS-CoV) in the Coronaviridae family—seemingly has its evolutionary roots in those viruses that commonly infect bats2. Not surprisingly, shortly after the identification of SARS-CoV-2, a close relative of SARS-CoV-2 was described; RaTG13 was identified from a bat (Rhinolophus affinis) sample obtained in Yunnan Province, China, in 2013 (ref. 4). Notably, this sample was collected from a mine cave to which four workers were sent to clean bat faeces and who subsequently developed severe pneumonia20. Although RaTG13 exhibits 96.2% sequence identity to SARS-CoV-2 at the scale of the whole genome, it does not possess similar RBD or cleavage-site sequences. Further analyses suggest that RaTG13—rather than SARS-CoV-2—was a recombinant virus, and the two virus lineages probably diverged more than 30 years ago21. Therefore, the SARS-CoV-2 RBD was an ancestral trait shared with bat viruses21.

Subsequently, a number of groups reported the identification of SARS-CoV-2-related coronaviruses in Malayan pangolins (Manis javanica), which were smuggled to Guangxi and Guangdong provinces, China22,23. These pangolin coronavirus genomes exhibited 85.5–92.4% sequence similarity to SARS-CoV-2 (ref. 22). Notably, however, these pangolin-derived coronaviruses formed two sublineages, with the Guangdong sublineage clustering with RaTG13 and SARS-CoV-2 and sharing 97.4% amino-acid similarity to SARS-CoV-2 in the RBD, with identical amino acids at the five critical residues of the RBD. Furthermore, the Guangdong pangolins appeared to have a similar disease manifestation to people with COVID-19 (ref. 24). Therefore, although the role—if any—of pangolins in the origin of SARS-CoV-2 and the ecology of coronaviruses in general is unknown, it is clear that coronaviruses exist in wildlife and that these viruses possess SARS-CoV-2-like RBDs and have a high binding avidity to hACE2.

Furthermore, a previously undescribed bat coronavirus—RmYN02—was reported, which had been collected during routine surveillance of Rhinolophus malayanus bats in Yunnan Province on 25 June 2019 (ref. 25). RmYN02 shared 97.2% sequence identity with SARS-CoV-2 in open-reading frame (ORF) 1ab. ORF1ab is the largest in coronaviruses with a length of approximately 21,300 nt. In June 2021, we reported four SARS-CoV-2-related coronaviruses genomes from Yunnan Province26. Of these, RpYN06, found in Rhinolophus pusillus, exhibited 94.5% sequence identity to SARS-CoV-2. However, the genome—excluding the spike gene, which has a history of recombination—had a similarity to SARS-CoV-2 of 97.2%, making it the closest related genomic backbone to SARS-CoV-2 identified to date. The other three SARS-CoV-2-related coronaviruses were more distantly related to SARS-CoV-2. However, they carried a genetically distinct spike genes encoding proteins that could bind to the human ACE2 receptor in vitro, albeit weakly.

SARS-CoV-2-like coronaviruses have also been identified in bat populations from other parts of Asia, including Japan27, Cambodia28 and Thailand29. Notably, although two betacoronaviruses (STT182 and STT200) from Rhinolophus shameli bats sampled in 2010 from Cambodia shared 92.6% nucleotide identity with SARS-CoV-2 across the genome as a whole, they share five of the six critical RBD sites observed in SARS-CoV-2 and the Guangdong pangolin coronavirus28. In September 2021, a preprint described a number of SARS-CoV-2-related coronaviruses identified in Laos, including BANAL-52 from R. malayanus, BANAL-103 from R. pusillus and BANAL-236 from Rhinolophus marshalli, which only possessed one or two amino acid mismatches at the seventeen residues that interact with human ACE229. In particular, the RBDs of these viruses could bind as efficiently to the human ACE2 protein as could the SARS-CoV-2 Wuhan strain from the early stage of the pandemic.

Emergence pathways of SARS-CoV-2

There are several hypotheses regarding the origin and emergence of SARS-CoV-2 that have been thoroughly clarified in the WHO–China joint report7. These contradictory hypotheses have raised standing debates, with the central point being two competing hypotheses: zoonotic emergence (including direct zoonotic introduction or introduction through an intermediate host) and a laboratory escape. The discovery of more and more SARS-CoV-2-related coronaviruses from wild animals provides evidence for a zoonotic origin of SARS-CoV-2 (refs. 4,22,23,25,26,27,28,29,30). Notably, all of the SARS-CoV-2-related coronaviruses mentioned above are evidently not the direct ancestor of SARS-CoV-2. Any such direct ancestral virus—which has yet to be identified—would be expected to exhibit more than 99% similarity to SARS-CoV-2 across the genome as a whole. However, the discovery of these viruses again highlights that more-closely related viruses in bats and other wildlife species will be identified with enhanced sampling in a broader geographical region, including most parts of Southeast Asia, which has a high diversity of Rhinolophus species26. As it has seldomly been found that a bat coronavirus is able to efficiently transmit among humans without adaptation and repeated human–animal contacts10, introduction through an intermediate host, such as raccoon dogs, is more likely than a direct zoonotic introduction.

Whether SARS-CoV-2 was introduced through a laboratory accident or whether it has been genetically manipulated is highly debatable. After a thorough analysis of the genetic characterizations of SARS-CoV-2 from both the early and later stages of the pandemic, as well as its close relatives from wild animals, many researchers in the global scientific community have reached the consensus that SARS-CoV-2 is unlikely to have escaped a laboratory and there is no scientific evidence that SARS-CoV-2 has been genetically manipulated10. However, the exact spillover event and emergence process of SARS-CoV-2 is still unclear, and more information from the earliest stage of the epidemic is clearly important to understand how SARS-CoV-2 came into contact with people.

Global genetic diversity of SARS-CoV-2

Genomic surveillance of SARS-CoV-2

Mutations are a natural part of the replication cycle of any RNA virus, leading to the diversification of viral lineages when coupled with inter-host transmission. This is also true for SARS-CoV-2, even though coronaviruses contain certain proofreading mechanisms that enhance genome fidelity31. Genomic surveillance has generated an unprecedented amount of sequencing data for a single virus (Box 1), and has proven an essential tool32,33 for tracing the spread of SARS-CoV-2 at various scales, from individual transmission events to the intercontinental spread of the virus. In addition, it has had a central role in monitoring the evolution of SARS-CoV-2 and identifying new variants with enhanced transmissibility and/or pathogenicity, decreased susceptibility to therapeutic agents and that are capable of evading natural or vaccine-induced immunity (Fig. 1). Genomic surveillance has demonstrated the effectiveness of tracking local transmission events, recognizing importation sources and superspreading events in Australia34,35, for informing public-health decision-making in the Netherlands36, and for adopting social-distancing measures to reduce viral spread in Israel37. In January 2021, du Plessis and colleagues described the analysis of 50,887 SARS-CoV-2 genomes38, quantifying the viral genetic structure of the UK epidemic at a fine scale, including the size, spatiotemporal origins and persistence of lineages as well as the effect of intervention measures.

Fig. 1: Phylogenetic tree of SARS-CoV-2 lineages globally and the temporal distribution of major sequence variants.
figure 1

The phylogenetic analysis was performed using full-length genome sequences of SARS-CoV-2 collected from GISAID as of 12 May 2021. A maximum likelihood tree of 1,715 representative high-quality SARS-CoV-2 sequences carrying specific accumulative mutations was estimated using RAxML159, with 1,000 bootstrap replicates and the GTR nucleotide substitution model. The major VOCs (Alpha to Delta) are shown in orange, and the major variants of interests (Epsilon to Lambda) are shown in purple. Both the thickness of each branch in the phylogenetic tree and the shading from light to dark in the heat map indicate the number of sequences carrying specific sets of mutations. Specific nucleotide substitutions are highlighted on the major branches of the tree. The branches with the D614G substitution are coloured blue.

Below, we use Guangdong Province, China and the USA as examples to illustrate how genomic surveillance has facilitated our understanding of this pandemic.

Guangdong, China

Guangdong is a populous province in Southeast China, with a resident population of more than 100 million people. After the SARS-CoV outbreak, believed to have originated in Guangdong39, long-term reforms in public-health agencies have greatly improved the infrastructures and enhanced the capacity of disease control and prevention. The first case of COVID-19 in Guangdong had an onset of symptom on 1 January and was reported on 19 January 2020 (refs. 11,40). Like many other Chinese provinces, Guangdong experienced three phases—domestic importation, local community transmission and international importation—with an epidemic peak in early February 2020 (ref. 40). Large-scale surveillance (around 1.6 million tests by 19 March 2020 identifying 1,388 cases of COVID-19) and intervention measures were implemented from the beginning of the outbreak, and after 22 February 2020 no more than one case a day was reported40. The genomic epidemiology of SARS-CoV-2 in Guangdong showed that most of the infections before March were imported from Hubei Province, and, in particular, Wuhan. Although some early cases were caused by community transmission, local transmission chains were limited both in size and duration40. These results highlight the efficacy of intensive testing and contact tracing even in such a densely populated urban region. Intensive surveillance also identified two SARS-CoV-2 variants with deletions in the spike gene41. In addition, the Guangdong Centers for Disease Control and Prevention (CDC) successfully identified the imported Alpha and Beta variants on 2 January 2021 (ref. 42) and 6 January 2021 (ref. 43), respectively.

The USA

The first case of COVID-19 in the USA (sequence WA1) was reported on 20 January 2020—a traveller from Wuhan44. By 15 February 2020, the number of laboratory-confirmed and clinically diagnosed cases of COVID-19 had reached 15 (ref. 45). By combining multiple sources of information, Worobey and colleagues showed that transmission of the WA1 (belonging to lineage A) lineage was successfully contained, and the subsequent larger outbreaks in Washington state might have been caused by multiple independent introductions of the virus from China in late January or early February 2020 (ref. 46). However, evidence from various studies revealed that the early viruses that were present between 29 February and 18 March 2020 in New York City were imported from Europe and other parts of the USA by multiple, independent introductions47. In addition, cryptic transmission and a prolonged period of unrecognized community spread has been documented in northern California48, Washington state49 and New York City50 from late January to March 2020. For example, SARS-CoV-2 sequences sampled from Connecticut during 6–14 March 2020 group with those from Washington state, highlighting long-distance domestic transmission51. Genomic surveillance in Dane and Milwaukee counties in Wisconsin between March and April 2020 provided evidence for reduced viral spread after a state-wide ‘safer at home’ order52. Together, these genomic surveillance studies clearly illustrate the early transmission of SARS-CoV-2 and highlight the efficacy of intensive testing, contact tracing and decreasing public gatherings in containing the spread of SARS-CoV-2.

Mutational diversity of SARS-CoV-2

By January 2021, approximately 25,000 out of the 29,800 sites (the length of the complete SARS-CoV-2 genome) have been shown to carry mutational differences (https://bigd.big.ac.cn/ncov/), and it has been estimated that approximately two mutations are fixed in the SARS-CoV-2 genome per month46,53,54. Although most of these mutations represent standard replication errors, host-dependent RNA editing may also shape the short- and long-term evolution of SARS-CoV-2. Indeed, the SARS-CoV-2 genome is characterized by frequent biased C-to-U hypermutation that is probably due to a human APOBEC-like editing process55,56.

Similar to other coronaviruses, the spike protein of SARS-CoV-2 contains important antigen epitopes57,58. As such, mutations in the spike protein will probably affect the receptor-binding efficiency, potentially lead to immune escape and may even weaken vaccine efficacy. The first notable mutation was A23403G, which caused the D614G amino acid substitution in the spike protein. This mutation might have arisen separately as early as late January 2020 in China and later in Europe, representing an interesting evolution of a mutation of convergence, and the frequency of this mutation greatly increased during the outbreak in Europe59,60. There is now compelling evidence that D614G has increased virus infectivity and transmissibility59,60,61,62,63,64, and molecular epidemiological studies suggest that this mutation increased the basic reproduction number (R0) from 3.1 (614D) to 4.0 (614G)60. In addition, a so-called ‘cluster V’ (also called B.1.1.298) SARS-CoV-2 variant was identified in Danish mink that also carried mutations in the spike protein, including Y453F, I692V, M1229I and the deletion of two amino acids (Δ69–Δ70)16,65 (Fig. 2).

Fig. 2: SARS-CoV-2 spike mutations in the Alpha, Beta, Gamma, Delta and mink cluster V variants.
figure 2

Three-dimensional structures were modelled with the Swiss-Model program using the spike protein of SARS-CoV-2 (PDB: 7CWU.1.G) as a template. Left, red spheres represent the mutations found in the Alpha160, Beta69, Gamma146 and Delta146 VOCs, as well as the mink cluster V variants65. The amino acid positions of all of the strains are numbered according to the template. Right, the surfaces of the six amino acid residues (L455, F486, Q493, S494, N501 and Y505) at the RBD are coloured cyan. The molecular surfaces of the mutations in the Alpha (purple), Beta (blue), Gamma (yellow), Delta (green), and mink cluster V (pink) variants are highlighted. *Not all Alpha variants have the E484K and S494P mutations. #Not all Delta variants possess the G142D mutation. It should be noted that we only use this figure to highlight the locations of the mutations in the variants based on the three-dimensional structure of one ancestral Wuhan strain (NC_045512), and this figure does not necessarily represent the true three-dimensional structure of the variants.

Not surprisingly, as the number of cases of COVID-19 continued to rise, mutational variants with a likely greater effect on fitness have also emerged, including some that might result in immune escape. Indeed, there are putative escape mutations to the ten human monoclonal antibodies that target the SARS-CoV-2 RBD66. Of particular note are the major SARS-CoV-2 VOCs that arose in late 2020: Alpha (also known as B.1.1.7 and VOC-202012/01), Beta (also known as B.1.351 and 501Y.V2), Gamma (also known as P.1) and Delta (also known as B.1.617.2); these VOCs were first identified in the UK67,68, South Africa69, Brazil70,71 and India72, respectively (Box 2 and Figs. 1, 2).

The emergence of these variant lineages has raised concerns that the virus has entered a new phase in its evolution73,74,75, characterized by ongoing immune escape in the face of increasing levels of infected hosts that probably affects vaccine efficacy, as well as the possibility of selection for increased transmission due to the imposition of nonpharmaceutical interventions (NPIs)74. The Alpha variant has been associated with increased rates of virus population growth67,68 and has been reported to be able to escape neutralization by most monoclonal antibodies targeting the N-terminal domain (NTD) of the spike protein76. However, there is no widespread escape of the Alpha variant from monoclonal antibodies or antibody responses generated by natural infection or vaccination76,77,78, such that its spread may instead reflect increased transmissibility. In particular, some of the Alpha variants acquired additional mutations in the spike protein, especially E484K, and exhibited a substantial loss of sensitivity to the neutralizing activity of vaccine-elicited antibodies and resistance to neutralization by monoclonal antibodies in COVID-19-convalescent plasma79. More worryingly, the Beta variant can escape neutralization by most RBD-targeting monoclonal antibodies and substantially escape from neutralizing antibodies from COVID-19-convalescent plasma76,80,81. Similarly, the Gamma variant shows marked decreases in neutralization with post-vaccination sera82; although, surprisingly, it is considerably less resistant to naturally acquired or vaccine-induced antibody responses than the Beta lineage83. Furthermore, neutralization of the Delta lineage is reduced when compared with ancestral circulating strains77,78, and convalescent sera from patients infected with the Beta and Gamma variants show a markedly higher reduction in neutralization of the Delta lineage77.

In addition to nucleotide substitutions, the SARS-CoV-2 genome has experienced many deletion events. For example, some viruses from Singapore and Taiwan, China carried a 382-nt deletion truncating ORF7b and covering almost the entire ORF8 sequence84,85,86. This variant showed considerably higher replicative fitness in vitro than the wild-type virus84, but seemed to be associated with a milder infection clinically85 and has not been reported in 2021. Su and colleagues described other ORF7b/8 deletions of various lengths, including viruses from Australia (138 nt), Bangladesh (345 nt) and Spain (62 nt)84. Long deletion events were also found in clinical samples from Beijing, with a 120-nt deletion in ORF7a and a 154-nt deletion in ORF8 (ref. 87).

Global spread of SARS-CoV-2

Initial spread of SARS-CoV-2 in China

Generally, China experienced three distinct phases of SARS-CoV-2 transmission: (1) the initial rapid spread in Wuhan; (2) seeding from Wuhan to cause community transmission in other regions of China; and (3) sporadic outbreaks caused by international importations after China controlled the first wave40,87.

Early spread of SARS-CoV-2 in Wuhan

The initial SARS-CoV-2 outbreak in Wuhan can itself be divided into three phases88: (1) rapid transmission before the implementation of the large-scale population ‘lockdown’ of the city on 23 January 2020 (ref. 9), with an estimated effective reproduction number (Re) of 3.5 (95% credible interval, 3.4–3.7) during this period89; (2) reduction of the rate of virus transmission during the period 23 January–1 February 2020 (through lockdown and home quarantine), producing an average Re of 1.2 (95% credible interval, 1.1-1.3)89; and (3) the interruption of transmission through intensified stringent interventions during 2–16 February 2020 (centralized isolation and treatment of cases of COVID-19) and 17 February–8 March 2020 (community screening). Population-based serological surveys conducted during March–May 2020 revealed that the overall seropositivity rate in Wuhan was 3.2–4.4% (refs. 90,91,92,93), indicating that many cases went undetected due to asymptomatic and mild infections and the limited laboratory-diagnosis capacity during the early stages of the outbreak89,94,95. However, city-wide nucleic acid screening of SARS-CoV-2 between 14 May and 1 June 2020 among nearly 10 million residents of Wuhan only found around 300 individuals who had asymptomatic infections after the lockdown was lifted on 8 April 2020 (ref. 96) and no symptomatic local cases related to the initial wave have been reported in the city after 10 May 2020.

Spread from Wuhan to other provinces

The coincidence of the emergence of SARS-CoV-2 and the large-scale seasonal migration (Chunyun, starting from 10 January 2020) for the Chinese Lunar New Year holiday probably exacerbated the seeding of the virus across China97,98. Movement restrictions from Wuhan, the key transportation hub in central China, commenced on 23 January 2020, and reduced the peak population numbers leaving the city 2 days before the Lunar New Year. Unfortunately, however, the disease had spread to every province in mainland China by this time99,100. In general, after the rapid implementation of stringent and integrated NPIs, the Re in provinces outside Hubei decreased below the epidemic threshold (1.0) from 8 February 2020 (ref. 101). Compared with Wuhan, the seropositivity rate in cities outside Wuhan was much lower. According to a national COVID-19 sero-epidemiological survey in China during March–May 2020 (ref. 92), only 0.44% of the sampled population in other cities of Hubei were positive, and only 2 out of more than 12,000 people outside Hubei tested positive, suggesting that SARS-CoV-2 transmission was well contained across the country during the first wave99,102,103.

Frequent international importation events

More than 6,000 incoming travellers from abroad who were infected with SARS-CoV-2 had been reported in mainland China by 15 June 2021, although reverse-transcriptase–polymerase-chain-reaction (RT–PCR) testing at the border control and a 14-day centralized quarantine implemented in China since March 2020 greatly reduced any transmission risk. For example, in Guangzhou, Guangdong Province in southern China, 73.5% of the imported positive cases were detected at the immigration checkpoint and 19.0% during centralized quarantine in hotels104. Although SARS-CoV-2 is predominantly associated with respiratory transmission, since June 2020, multiple Chinese provinces have detected SARS-CoV-2 RNA or live virus on packages of frozen products105. Indeed, cold-chain food or package contamination was proposed to have triggered the resurgence in Beijing in June 2020 (ref. 106) as well as other sporadic outbreaks in China105, although this warrants further investigation. It is notable that the number of confirmed cases was low in the Xinfadi outbreak, Beijing, in June 2020. Similarly, all of the COVID-19 outbreaks in China triggered by international inbound travellers were small-scale, with a few sustained cases. This was mainly due to the citywide, grid-based mass-screening protocol using RT–PCR testing107.

Intercontinental spread of SARS-CoV-2

From China to other regions

The global spread of SARS-CoV-2 shows how rapidly geographically disparate countries can be reached by an emerging pathogen108,109 (Fig. 3a). Two distinct transmission phases of international exportations of SARS-CoV-2 were identified at the early stage of the pandemic110. In the first phase, many international airline passengers left Wuhan for hundreds of destinations across the world during the two weeks before the Wuhan lockdown92. Cities across Asia, Europe and North America were the main destinations and reported several imported cases during the early stage of the COVID-19 outbreak109,111, and the WHO declared a Public Health Emergency of International Concern on 30 January 2020. Containment of the outbreak in China and, in particular, the implementation of travel restrictions since late January 2020 considerably reduced the further spread of SARS-CoV-2 outside China99,100,102,112,113.

Fig. 3: Global spread of SARS-CoV-2 and cases reported across countries.
figure 3

a, The date of the first reported case of COVID-19 in each country, territory or area, though the origin of SARS-CoV-2 has not been determined for almost two years. The areas without data are shown in grey. b, Reports of VOCs (now denoted VOC Alpha to Delta) based on records published in the COVID-19 weekly epidemiological update by the WHO (https://covid19.who.int/), as of 10 August 2021. c, The seven-day rolling average of the number of confirmed cases of COVID-19 reported by continent. The orange vertical dashed line indicates the date of COVID-19 declared as a pandemic by the WHO. d, e, The weekly proportion of case number in the top-50 ranked countries with the highest number of cases of COVID-19 (d) and the available mobility data (e), as of 8 August 2021. The weekly proportion was calculated as the case count in a specific week and country, divided by the total number of cases reported in each country. e, The changes in human mobility (by 8 August 2021) in the 50 countries presented in d, compared to the normal mobility from 3 January to 6 February 2020. Each row in d and e represents a country, grouped by continent and then sorted by the latitudes of capital cities from north to south (the country list is available in Supplementary Table 1). The grey dotted vertical lines in d and e from left to right indicate the first week of April, July and October in 2020, and January, April and July in 2021, respectively. The dataset of case numbers was obtained from the data repository collated by the Johns Hopkins University (github.com/CSSEGISandData/COVID-19). The anonymized and aggregated data of population mobility in transit stations were obtained from the Google COVID-19 Community Mobility Reports (www.google.com/covid19/mobility/). The administrative boundary maps were obtained from Natural Earth (www.naturalearthdata.com).

From Europe to other regions

However, international travel outside China from mid-February to late-March 2020 facilitated the second phase of international SARS-CoV-2 spread and onward transmissions110,114, with the epicentre quickly shifting to the Middle East115 and Europe (Fig. 3c). Although France was the first country to identify cases of COVID-19 in Europe, Italy soon became the first major hotspot in the continent111,112,116,117, whereas Spain, Belgium and the UK reported the highest numbers of deaths in Europe during the first wave118. The virus exported from Europe acted as a major source of global spread47, and the WHO eventually declared a pandemic on 11 March 2020. Countries quickly placed restrictions on flights from Europe during March–April 2020, although these measures could not fully prevent introduced transmission75,114.

By late March 2020, cases surged in the USA, with North America becoming the global epicentre119,120. By the end of 2020, the total number of confirmed cases recorded in the USA had passed 20 million, including more than 350,000 reported deaths. Although the first case of COVID-19 in the USA was reported in a traveller returning from China on 20 January 2020 (ref. 44), phylogenetic evidence suggests that importations from Europe mainly contributed to the wide spread of the virus across the country110,119. Latin America and south Asia have also been badly affected. SARS-CoV-2 was confirmed in Brazil on 25 February 2020 and a month later it was found in every state, with confirmed cases exceeding 1 million on 19 June 2020 (refs. 121,122). Although the first case of COVID-19 was confirmed in India on 30 January 2020 and the situation was seemingly under control until the end of March 2020 (ref. 123), India has reported the second highest number of cases of COVID-19 since September 2020 (ref. 124). Most African countries experienced community transmission by 31 May 2020, with most imported cases returning from Europe and the USA125, and it is believed that the disease is generally underreported across Africa due to the limited testing and healthcare capacities126,127,128,129.

Spread of secondary waves across countries

NPIs—such as travel restrictions, case isolation and contact tracing, physical distancing, face covering, hand washing and even the closures of businesses and schools—have been widely implemented to reduce the transmission of SARS-CoV-2 (refs. 112,130,131). Full or partial lockdowns during specific periods have also been imposed in many countries118. Although the effectiveness of different interventions and their combinations have varied, these measures have had an important role in the response to the first wave of the pandemic132,133.

Unfortunately, after the relaxation of these interventions, an increase in population movements and the spread of new variants with a higher transmissibility, a new wave of infections has swept through many nations since October 2020 (refs. 134,135,136) Fig. 3d, e and Supplementary Table 1). The first US wave in 2020 mainly affected the northeast of the USA137, whereas the second wave in summer 2020 mainly hit the south and west, and almost every state has seen a spike in cases during the third wave since October 2020 (ref. 138). Brazil has experienced a major second wave since November 2020 and even had a death toll second only to the USA in early 2021 (ref. 139). New SARS-CoV-2 variants also spread throughout Europe after travel resumed in summer 2020140,141, with the highest daily number of cases recorded in many countries between October 2020 and March 2021. After NPIs were implemented together with a second or third lockdown, and combined with ongoing and large-scale vaccination efforts, many countries passed the second wave by the end of May 2021. This has reduced the pressure on healthcare systems and given countries time to vaccinate people at the greatest risk of severe disease142.

However, the emergence and rapid spread of various SARS-CoV-2 VOCs and variants of interest (VOIs) that are more contagious and/or potentially evade immunity has triggered new waves in many countries (Fig. 3b and Extended Data Fig. 1). For example, India has experienced a major second wave from March to June 2021, mostly due to the Delta variant. As of 10 August 2021, a total of 142 countries, territories and areas across the world have reported the Delta variant72 (Extended Data Fig. 1d), including countries with mass vaccination of their populations, such as the UK and Israel143. In particular, community transmission of this variant has been reported in many countries72. In mid-June 2021, the WHO declared that the Delta variant has displaced most of the other VOCs and has become the dominant lineage across the world143,144.

Challenges and outlook

Even though it is of vital importance to the prevention of future emerging infectious diseases that will inevitably affect human populations, our current understanding of the initial SARS-CoV-2 spillover event is limited. Although the closest relatives to SARS-CoV-2 are found in horseshoe bats, it remains unclear whether the virus directly moved from bats to humans or was passed through an intermediate animal host—as was the case for previous coronavirus epidemics—although the latter seems more reasonable7,10.

The genomic surveillance of SARS-CoV-2 is by far the largest pathogen genomic sequencing project undertaken, with more than 2.8 million complete genomes generated as of August 2021. This endeavour has played an essential part in the prevention and control of COVID-19 and shed light on the transmission patterns of SARS-CoV-2 at different scales, such as the time and source of the introduction events, the spatiotemporal characterizations of local spread, the role of superspreading events, and the viral factors associated with the fitness, transmissibility, infectivity and disease severity. Of particular note is the identification of the major SARS-CoV-2 VOCs, as well as several variants of interests (denoted Epsilon to Lambda)145,146 that emerged in different countries and have caused an increased proportion of cases both locally and globally.

The emergence of these SARS-CoV-2 variants has shaped the complex global transmission dynamics of COVID-19. More importantly, there is mounting evidence147 that these SARS-CoV-2 variants are able to cause decreases in neutralizing titres from patients who recovered from COVID-19 and vaccine recipients, and escape neutralization by the monoclonal antibodies that target the NTD and RBD of the spike protein to various degrees. However, genomic surveillance would be more informative if coupled with a system for the risk assessment and phenotyping of these mutations. For example, the infectivity and antigenicity of 106 mutations in the SARS-CoV-2 spike was assessed using pseudotyped viruses148. Deep mutational scanning has also been used to assess all single amino acid variants of the SARS-CoV-2 spike protein66,149. In addition, more and more data on antigenic variations of the SARS-CoV-2 variants, with different sets of single amino acid mutations, to monoclonal antibodies and vaccines are available. A risk assessment system that integrates pathogen surveillance, immune escape data and near real-time human mobility metrics is desirable, although it may be confounded by the different classes of neutralizing and NTD antibodies, vaccine strategies and even host heterogeneity.

That the major SARS-CoV-2 VOCs have reduced the efficacy of monoclonal antibodies and vaccines has posed serious challenges to the control of the COVID-19 pandemic. First, although vaccines can protect people infected with SARS-CoV-2 variants against severe disease, vaccine manufacturers are exploring redesigns of their products to obtain more effective protection—to eventually prevent virus transmission. Second, the suboptimal protection provided by vaccines150 and the deployment of antibody-based treatments of limited or undemonstrated efficacy151 has raised concerns that this would accelerate the emergence of new variants, although there is a strong argument for mass vaccination even if vaccines can only provide partial immunity152,153. Third, this has also raised the possibility that SARS-CoV-2 will become a recurrent seasonal infection154,155. Fourth, because vaccines cannot completely prevent transmission of the major variants, some NPIs such as face covering might have to be implemented to reduce transmission of the virus, as unlimited, large-scale spread of the variants would probably generate more new variants.

The genomic surveillance of SARS-CoV-2 is also facing several major challenges. First, despite this enormous endeavour, in reality only a very small proportion (around 2%) of cases have been sequenced. In addition, the majority of sequences come from a small number of countries and, remarkably, as of August 2021, around 50% of genomes have been generated in the UK and the USA, which have led the worldwide effort in this respect. By contrast, other countries with major outbreaks, such as India and Brazil, have sequenced a much smaller numbers of cases, which may cause delays in identifying variants with previously undescribed phenotypic characteristics. Therefore, it is likely that there are additional new variants that have not yet been detected given the limited genomic surveillance in a number of regions. Indeed, because the major VOCs are genetically divergent, it is possible that they have been circulating cryptically in unsampled locations, or have also emerged in individuals with a chronically infection who shed the virus for extended periods156,157. Second, the complex transmission dynamics caused by different SARS-CoV-2 variants and their continuous evolution clearly necessitate increased genomic surveillance in such a world with global connectivity and travel networks reshaped by the pandemic. Third, it is possible that recombination among viruses will also change the genetic structure of SARS-CoV-2, perhaps generating viruses with an altered phenotype. Indeed, there have already been suggestions of recombination between the Alpha and Epsilon variants in California in early 2021 (ref. 158). Similarly, the potential recombination of SARS-CoV-2 and other mild human coronaviruses should not be neglected.

In summary, SARS-CoV-2 has led to an increased understanding of coronavirus evolution and the virus has entered a new evolutionary phase characterized by the frequent emergence and spread of variants that affect immune escape and reduce the efficacy of vaccines. Of particular concern is that the limited genomic surveillance in many low-income countries may cause delays in identifying variants with previously undescribed phenotypic characteristics. To contain the current and future pandemics, we urgently call for closer international cooperation, increased vaccine supply and sharing, rapid information exchange, and the establishment of both the infrastructure and trained personnel required for the effective genomic surveillance of SARS-CoV-2 and other emerging viruses.