Genomic surveillance of SARS-CoV-2 in Portugal was rapidly implemented by the National Institute of Health in the early stages of the COVID-19 epidemic, in collaboration with more than 50 laboratories distributed nationwide.
By applying recent phylodynamic models that allow integration of individual-based travel history, we reconstructed and characterized the spatio-temporal dynamics of SARS-CoV-2 introductions and early dissemination in Portugal.
We detected at least 277 independent SARS-CoV-2 introductions, mostly from European countries (namely the United Kingdom, Spain, France, Italy, and Switzerland), which were consistent with the countries with the highest connectivity with Portugal. Although most introductions were estimated to have occurred during early March 2020, it is likely that SARS-CoV-2 was silently circulating in Portugal throughout February, before the first cases were confirmed.
Here we conclude that the earlier implementation of measures could have minimized the number of introductions and subsequent virus expansion in Portugal. This study lays the foundation for genomic epidemiology of SARS-CoV-2 in Portugal, and highlights the need for systematic and geographically-representative genomic surveillance.
Plain language summary
Analysing SARS-CoV-2 genetic material and how it changes over time can help us understand how the virus spreads between countries and determine the impact of control measures. In this study, we investigated SARS-CoV-2 transmission and evolution in the early stages of the COVID-19 pandemic in Portugal. In particular, we reconstructed the routes and timeliness of viral introductions into the country and assessed the relative contribution of each introduction in terms of how the epidemic evolved over time. We detected at least 277 independent introductions, mostly from European countries (namely the United Kingdom, Spain, France, Italy, and Switzerland), which were consistent with the countries with the highest connectivity with Portugal. This study reflects an unprecedented effort in the field of the infectious diseases in Portugal, highlighting the need for systematic and geographically-representative surveillance to aid public health efforts to control the virus.
SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2), the causative agent of COVID-19, is a novel betacoronavirus that was first reported in December 2019 in Wuhan, China1,2. By 29 March 2021, it had already caused more than 126 million cases and 2.7 million deaths worldwide3,4. In order to control the virus arrival and spread, many countries adopted rigid public health measures, including complete border closures and general lockdowns, with tremendous consequences at economic and social levels. At the early stages of an epidemic, the success of public health measures is particularly dependent on their timely implementation, which requires comprehensive diagnosis/surveillance systems that are able to efficiently trace where the virus is being introduced and circulating5,6,7. Taking advantage of the extraordinary advances in sequencing technologies, modern surveillance systems are progressively relying on genomic epidemiology as a crucial tool for outbreak investigation and for tracking virus evolution and spread7,8,9. Genomic surveillance of SARS-CoV-2 can be particularly useful to: (i) understand the contribution of “new introductions” versus “local transmission” to the number of new cases at continent/country/regional levels; (ii) evaluate the impact of non-pharmaceutical interventions on the outcomes of transmission chains; (iii) characterize the genetic variability that may negatively affect molecular diagnostic tests; (iv) monitor genetic variability affecting antigens and targets of antiviral drugs with potential impact on the development/effectiveness of prophylactic (vaccines) and therapeutic measures; and (v) investigate potential associations between genetic variants and infectious load, patient immunological status, clinical outcomes (e.g., infection duration, disease severity, etc.)5,10.
Acting as the National Reference Laboratory for SARS-CoV-2, the Portuguese National Institute of Health (INSA) Doutor Ricardo Jorge rapidly established a genome-based molecular surveillance strategy for SARS-CoV-2 in Portugal, setting up a large nationwide network involving more than 50 laboratories. A bilingual website (https://insaflu.insa.pt/covid19) was launched, providing updated data regarding the analysis of the SARS-CoV-2 genetic diversity and geotemporal dynamics, based on state-of-the-art methodologies for real-time tracking pathogen evolution11,12. Also, “situation reports” with major highlights are being released periodically to participating laboratories, national and regional public health authorities, and other stakeholders.
Despite all the advantages of genomic surveillance, the uneven geographic sampling of viral genomes can severely skew phylogeographic inferences based on discrete trait ancestral reconstruction13, therefore hindering the ability to accurately trace the seeding and dissemination patterns of SARS-CoV-2. The COVID-19 pandemic has been characterized by an unprecedented amount of genomic data and associated metadata, such as information on the patients’ recent movements prior to having developed any symptoms.
In the present study, we reconstruct and characterize the spatio-temporal dynamics of SARS-CoV-2 introductions and early dissemination in Portugal using newly developed phylodynamic models that allow integration of individual-based travel history, in order to obtain a more realistic reconstruction of the viral dynamics13. This includes inferences of the timelines of the first introductions, geographic location of ancestral lineages, and the contribution of detected introductions to the epidemic evolution.
Samples used in this study were collected as part of the ongoing national SARS-CoV-2 laboratory surveillance conducted by INSA, Portugal, in collaboration with Instituto Gulbenkian de Ciência (IGC). SARS-CoV-2 positive samples (either clinical specimens or RNA) were provided by a nationwide network, consisting of more than 50 laboratories, that was established at the beginning of the epidemic in Portugal. Anonymized date of sample collection, date of illness onset, and travel history were provided by laboratories and Regional and National Health Authorities. Geographical data presented in this study refers to the Region (“Health Administration region”) of the patients’ residence or, when no information was available, to the Region of exposure or of the hospital/laboratory that collected/sent the sample.
SARS-CoV-2 genome sequencing and assembly
SARS-CoV-2 positive RNA samples were subjected to genome sequencing using a whole-genome amplification strategy with tiled, multiplexed primers14 and the ARTIC Consortium protocol (https://artic.network/ncov-2019; https://www.protocols.io/view/ncov-2019-sequencing-protocol-bbmuik6w), with slight modifications, as previously described15. Analysis of sequence read data was conducted using the bioinformatics pipeline implemented in INSaFLU (https://insaflu.insa.pt/; https://github.com/INSaFLU), which is a web-based (and also locally installable) platform for amplicon-based next-generation sequencing data analysis16. Sequence inspection and validation was performed as previously described15.
Classification by clades and lineages
We explored the diversity of INSA sequences using a variety of nomenclature strategies, namely Nextstrain (using https://clades.nextstrain.org/; 9 November 2020), GISAID (https://www.gisaid.org/; 23 July 2020) and Phylogenetic Assignment of Named Global Outbreak LINeages (cov-lineages.org) (https://pangolin.cog-uk.io/; 16 October 2020)17. While Nextstrain and GISAID clade nomenclatures provide a less detailed categorisation of globally circulating diversity, cov-lineages.org classification is focused on identifying highly specific lineages that are actively transmitting in the population. Classification is provided in Supplementary material (Supplementary Data 1).
Assessment of genome sequencing by country
To assess the contribution of each country to the set of publicly available SARS-CoV-2 genomes and to determine the proportion of the number of genomes on the total number of reported COVID-19 cases (genome sampling) of a given country during the study period (until 31 March 2020), we obtained the number of cases per country from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv) and the number of genomes from GISAID (by 8 August 2020). Only the genomes with collection date until 31 March 2020 were considered. When a given genome lacked the collection day date, only specifying the month of collection, it was assigned to the last day of the respective month. When the number of genomes on a given day was higher than the number of cases, the number of cases was considered for graphical representation. Final data on the assessment of genome sampling by country is provided in Supplementary material (Supplementary Data 2).
Selecting a genomic background dataset
For the phylogenetic analyses, we downloaded full-length viral genome sequences from GISAID (https://www.gisaid.org/) on 6 August 2020 with collection dates before 1 April 2020 (Supplementary Data 2). For computational efficiency in the downstream operations, we analysed the A and B lineages separately. Multiple sequence alignments with a reference genome (MN908947.3) were performed using MAFFT v7.458 with parameter –addfragments18. Sequences with fewer than 75% unambiguous bases were excluded, as well as duplicate sequences defined as having identical nucleotide composition, collected on the same date and in the same country. The resulting dataset was trimmed at the 5′ and 3′ ends resulting in a multi-sequence alignment with 29,780 nucleotides. Sequences with date information only at the year-level were also excluded. This dataset was subjected to multiple iterations of phylogeny reconstruction using IQ-TREE multicore software version 1.6.1219 with parameters -m GTR+G, and exclusion of outlier sequences whose genetic divergence was incongruent with sampling date using TempEst software version 1.5.220, resulting in 1632 and 22,124 sequences for the A and B datasets, respectively. GISAID acknowledgment table for the background dataset is provided as Supplementary Material (Supplementary Data 3).
The magnitude of the lineage B datasets prohibits a full Bayesian inference approach in a reasonable timeframe. To overcome this constraint, we used a subsampling strategy that removes sequences such that monophyletic clusters that consist entirely of sequences from a particular country are represented by a single sequence. The excess sequences in a country-specific monophyletic clade do not contribute any additional information to the between-country diffusion process we aim to infer21. This process resulted in a dataset with 13,489 sequences (B_CS). Despite the almost 40% downsampling in the B lineage dataset, its size is still excessive for timely computational inferences. To further address this issue, we have built a phylogeny using IQ-TREE, as described previously, and partitioned the tree into six monophyletic clades (B_CS1 through B_CS6). These clades were examined for outlier sequences whose genetic divergence and sampling date were incongruent using TempEst software version 1.5.220. All data sets exhibited a positive correlation between genetic divergence and sampling time and appear to be suitable for phylogenetic molecular clock analysis20 (Supplementary Fig. 1).
Bayesian evolutionary inference of SARS-CoV-2 detected in Portugal
A total of 1275 SARS-CoV-2 genome sequences (obtained from positive samples collected until 31 March 2020) from Portugal were analysed in this study (INSA’s collection, as of 23 July 2020; Supplementary Data 1). Our interest lies in estimating the viral evolutionary history and spatial diffusion process during the early epidemics in the country. Travel history data is of particular importance when analyzing low diversity data, such as that for SARS-CoV-2, using Bayesian joint inference of sequence and location traits because sharing the same location state can contribute to the phylogenetic clustering of taxa13. For each of the datasets (A, and B_CS1 through B_CS6), we performed a joint genealogical and phylogeographic inference of time-measured trees using Markov chain Monte Carlo (MCMC) sampling implemented in the Bayesian Evolutionary Analysis Sampling Trees (BEAST) package22. We applied a Hasegawa-Kishino-Yano 85 (HKY85) 23 substitution model with gamma-distributed rate variation among sites24. We used an uncorrelated lognormal relaxed molecular clock to account for evolutionary rate variation among lineages25 and specified an exponential growth coalescent prior in our analyses.
To integrate the travel history information obtained from (returning) travelers, we followed Lemey et al.13 and augmented the phylogeny with ancestral nodes that are associated with a location state (but not with a known sequence), and enforced the ancestral location at a point in the past of a lineage. We specified normal prior distributions on the travel times informed by an estimate of time of infection and truncated to be positive (back-in-time) relative to sampling date. Specifically, we used a period of 14 days (incubation period of 99% of patients26 where travel history information was collected for all recent movements), and a period between symptom onset and testing with an estimated mean of 4.70 days for the patients in the INSA cohort (estimated from data available for 717/1275 individuals), and a standard deviation of 4.06 days to incorporate the uncertainty on the period between symptom onset and testing. The location traits associated with taxa and with the ancestral nodes were modeled using a bidirectional asymmetric discrete diffusion process27. We ran and combined at least eight independent MCMC analyses for 50 million generations, sampling every 50,000th generation and removed 10% as chain burn-in. Stationarity and mixing were investigated using Tracer software version 1.7.128, making sure that effective sample sizes for the continuous parameters were greater than 100. We used the high-performance computational capabilities of the Biowulf cluster at the National Institutes of Health (Bethesda, MD, USA) (http://biowulf.nih.gov) to perform these analyses. Portuguese clusters were assumed for phylogeographic summaries if their topology posterior probability was ≥0.001. If the excluded genomes had known travel history, they were re-integrated along with same-cluster sequences if those did not cluster in a clade-defining polytomy (recovered a total of 14 sequences, 7 of them with travel history). The location of the most recent common ancestor (MRCA) of Portuguese clusters is usually inferred as Portugal, thus we compared the estimated locations and times for the parent nodes of the MRCA (for simplicity, here on referred to as PMRCA) across Portuguese BEAST clades representing the origin and timing of seeding events into Portugal.
Real-time data sharing of SARS-CoV-2 genetic diversity and geotemporal spread in Portugal
A website (https://insaflu.insa.pt/covid19) was launched on 28 March 2020 for real-time data sharing on SARS-CoV-2 genetic diversity and geotemporal spread in Portugal. This site gives access to “situation reports of the study and provides interactive data navigation using both Nextstrain (https://nextstrain.org/)11 and Microreact (https://microreact.org/)12 tools. As of 23 July 2020, genomic and phylogenetic analysis were performed using the SARS-CoV-2 Nextstrain pipeline version from 23 March 2020 (https://github.com/nextstrain/ncov), with slight modifications15. For data navigation, an IQ-TREE-derived19 phylogenetic tree enrolling the 1275 studied sequences, and the associated metadata, can be visualized interactively at https://microreact.org/project/cM6KURnU7rUpqdAnBq5DAf/a2d3840e.
Samples were obtained in the frame of the ongoing national SARS-CoV-2 genomic surveillance coordinated by the Portuguese National Institute of Health (INSA), being collected as part of the routine clinical care and laboratory procedures of the laboratories/hospitals (“Portuguese network for SARS-CoV-2 genomics”) collaborating in this system. This study was approved by the Ethical Committee (“Comissão de Ética para a Saúde”) of INSA, dismissing the need for individuals’ informed consent. Designations of all genome sequences are fully anonymized, and no identifying information of the associated patients is provided. Anonymized date of sample collection, date of illness onset and travel history were provided by laboratories and Regional and National Health Authorities.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Epidemiological trends and circulating diversity during the early COVID-19 pandemic in Portugal
The first COVID-19 confirmed cases in Portugal were reported on 2 March 2020 after laboratory confirmation by the National Institute of Health (INSA) Doutor Ricardo Jorge. As the COVID-19 epidemic progressed in the country, INSA, as the National Reference Laboratory, gradually supported and validated the extension of the laboratory network throughout the country, where 37 laboratories were already set up to perform molecular testing of SARS-CoV-2 by March 31, 2020. With COVID-19 cases exponentially increasing in Portugal (Fig. 1a) and Europe (with alarming trends in Italy and Spain)4, Portugal adopted rigid public health measures, including suspension of flights from/to Italy (10 March 2020), closure of land borders and schools (16 March 2020), suspension of flights to non-EU countries and general lockdown (18 March 2020). INSA rapidly implemented and coordinated the nationwide genomic surveillance of SARS-CoV-2 (https://insaflu.insa.pt/covid19) with a particular focus on providing a comprehensive picture of the introductions, genetic diversification, and propagation of SARS-CoV-2 during the early-stage pandemic at a country scale. A total of 1275 SARS-CoV-2 positive samples collected in March were successfully subjected to virus genome sequencing (Supplementary Data 1), which corresponds to 15.5% (1275/8251) of all COVID-19 cases confirmed during the first month of the pandemic in Portugal (Fig. 1b). The viral genomic sequence sampling ranged from 10.7% (523/4910) in the Northern Health Administration region (the region with the highest number of confirmed cases) to 85.4% (41/48) in Madeira Archipelago (the region with the lowest number of confirmed cases) (Fig. 1c). Despite the delayed epidemic trajectory in comparison with other European countries, Portugal was among the countries generating the highest volumes of SARS-CoV-2 genomic data during the early COVID-19 pandemic (Fig. 1d and Supplementary Data 2).
The relative frequency trends of SARS-CoV-2 clades and lineages in Portugal during March 2020 (Supplementary Fig. 2) did not differ significantly from the scenario observed at the European level5. Most of the analyzed genomes (89.8%) belong to the 20A/G, 20B/GR, and 20C/GH clades (Nextstrain/GISAID nomenclatures), which display, among other genetic markers, the D614G amino acid replacement in the Spike protein. The SARS-CoV-2 G614 variant was most likely introduced in Europe by mid-January/early February 2020 and became dominant at a worldwide level29. Clades 19 A/L/V/O (dominant during early pandemic in China) and 19B/S (rare in Europe, with the notable exception of Spain)5 were found at the relative frequencies around 7.5% and 2.8% in March, respectively, and showed a decreasing frequency trend similarly to what has been observed at the global level. When applying the Pango lineages (cov-lineages.org) nomenclature, the main B lineage (roughly covering 19A/20A/20B/20C Nextstrain clades) was dominant in March, showing ample diversification into sublineages. Among these, it is highlighted the increased frequency of the B.1.1 sublineages (roughly corresponding to clade 20B/GR), as well as of the B.1, B.1.5, and B.1.91 lineages (all mostly including 20A/G virus). Of note, the B.1.91 sub-lineage, and part of its ancestor sublineage B.1, correspond to the Spike G614 + Y839 variant that massively disseminated in Portugal during the early epidemic (22% and 59% of the sampled genomes from the North and Center regions of Portugal by 30th April)15.
Introductions and early spread of SARS-CoV-2 in Portugal
In order to assess the origins and measure the number of SARS-CoV-2 introductions in Portugal prior to 31 March 2020, we performed Bayesian phylodynamic analyses integrating the rich travel history data. The global TMRCA for the A lineage tree ranged between 11 November 2019–18 December 2019 (95% highest posterior density) and the global TMRCA for the B lineage trees ranged between 11 November 2019–3 February 2020 (95% highest posterior density). Travel history could be collected for 619 (48.5%) out of 1275 confirmed cases detected in March, with 128 (10.0%) reporting international travels within the potential incubation period (i.e., 14 days before clinical onset date) and 491 (38.5%) reporting no travels within the same period (Fig. 2a).
Overall, we detected 277 clades (16 from lineage A and 261 from lineage B) representing SARS-CoV-2 introductions in Portugal until 31 March 2020, involving a total of 979 out of the 1275 sequences analyzed. Of these, 296/979 genomes belong to clades that include sequences from cases with known travel history (Fig. 3b). Overall, the phylogeographic reconstruction revealed that UK, France, and Spain seeded 69% of all introductions into Portugal (Fig. 3), mostly to the Lisbon and Tagus Valley region, which was estimated to have received ~30% of all introductions, followed by the North (27%), Center (10%), Algarve (4%), Azores (2%), Alentejo (2%), and Madeira (1%) regions (Figs. 4 and 5A and Supplementary Data 4).
We also examined the number of infections generated by introductions from different countries (Figs. 3, 4, and 5B and Supplementary Data 5). Interestingly, despite the UK, France, and Spain having seeded the majority of introductions, at least one seeding event linked to travel history from Italy generated the largest outbreak recorded in the early stages of the COVID-19 epidemic in Portugal (cfr. below)15. We noticed a similar pattern for an introduction from Thailand, however, this event had a low supported topology posterior probability (Figs. 4 and 5B and Supplementary Data 5).
The spatial analysis reconstructed 16 introductions of lineage A into Portugal encompassing 33 Portuguese sequences. The majority of this lineage viral flow was seeded by Spain (69%), with 41% of all of these being introduced into Lisbon & Tagus Valley, 12% into the Center and Alentejo regions, respectively, and 6% into the North region. The remainder of all lineage A seeding events were estimated to have originated from Australia, France, Italy, Oman, Saudi Arabia and were mostly captured by the ability to integrate travel history data into the phylogeographic reconstruction (Supplementary Fig. 3 and Supplementary Data 6). The introductions resulted in transmission clusters of varying size, with the largest, seeded by Spain, generating a transmission chain of at least 7 cases in Alentejo.
A similar phylogeographic analysis for lineage B resulted in the reconstruction of 261 introductions covering 946 Portuguese genomes. The majority of lineage B introductions was seeded by the United Kingdom (UK) (36%), Spain (20%), and France (14%). The introductions resulted in transmission clusters of varying sizes. The largest cluster was seeded by Italy and generated a transmission chain of at least 252 cases in the North and Center regions. Switzerland (4%), United Arab Emirates (3%), United States of America (3%), Netherlands (2%) also contributed to the remainder of lineage B introductions and establishment in the country (Supplementary Fig. 4 and Supplementary Data 7).
These analyses revealed a median time lag of 14 days (range = 3 and 53 days; Fig. 6) between the time to the PMRCA (TPMRCA; see Methods section for details), representing the earliest an introduction could have occurred, and surveilled genomes. In particular, the TPMRCA of most clades (~58%) occurred between the last week of February and the first week of March (Supplementary Data 8). The temporal reconstruction estimated the earliest introduction of lineage A into Portugal to have occurred as early as 7 March 2020 [95% Highest Posterior Density (HPD): 4 March 2020–10 March 2020], from Spain to the Lisbon & Tagus Valley region. The earliest seeding event for lineage B was estimated to have taken place on 2 February 2020 (95% HPD: 30 January 2020–5 February 2020) via the United Kingdom into the North region. Of note, 108 out of the 277 introductions were seeded by Nextstrain 20B clade viruses. Because of both the high abundance of 20B sequences in the dataset and the low genetic diversity within this clade at the time (leading to polytomy-rich topologies), the TPMRCAs of these introductions might be less accurate in representing the importation date into Portugal. Notwithstanding, our data generally indicates that most introductions were seeded before main public health interventions and when the cumulative number of detected cases was still very low (Figs. 1 and 6), suggesting a period of several weeks where undetected transmission likely occurred (Fig. 6).
The present study provides a comprehensive description of the early establishment of the COVID-19 pandemic in Portugal based on a genomic epidemiology and phylodynamics approach. We detected multiple independent SARS-CoV-2 introductions, mostly from European countries (namely the UK, Spain, France, Italy, and Switzerland), which was broadly consistent with the available travel history data, as well as with the countries with most frequent flight connectivity with Portugal (INE, 2020) and/or with the highest number of Portuguese immigrants30,31.
The genomic surveillance efforts in Portugal culminated with the generation of 1275 SARS-CoV-2 whole genomes, which represents 15.5% of all confirmed cases in March, making Portugal the country generating the 5th highest number of SARS-CoV-2 genomic sequences during the early COVID-19 pandemic. In particular, our data uncovered the importation of extensive SARS-CoV-2 genomic diversity during the early epidemic in Portugal, as supported by the identification of at least 277 seeding events, covering all Nextstrain/GISAID clades and dozens of Pango lineages. Despite the vast genetic diversity introduced, 32% of the 1275 analyzed sequences could be linked to just ten out of the 277 detected introductions. In particular, a single introduction (from Italy) of a Spike Y839 variant represented about 20% of all sampled genomes collected until the end of March15. Moreover, 56% (155/277) of BEAST clades involve one single sequence (singletons), thus suggesting that most introductions have not seeded substantial local transmission.
In general, these data suggest that the implemented preventive and early control measures have been successful in mitigating the establishment of community transmission from most SARS-CoV-2 independent introductions. On the other hand, it also highlights that the few introductions not captured by public health surveillance and control can still seed large community transmission events and give rise to a significant number of cases. This underlies the challenges in defining a public health strategy aimed at preventing sustained community transmission while maintaining open borders and global connectivity. Portugal presents a geopolitical and demographic context involving high circulation of both migrant workers and tourists, and close international networks with multiple countries from different latitudes (particularly from Europe, South America, and Africa). More stringent border restrictions, including flight and land borders closure, were implemented after March 10, when 74% (205/277) of the detected introductions were estimated to have already occurred. Despite the delay in the epidemic start in Portugal in comparison with other European countries, which certainly contributed to a more favorable situation during the pandemic’s first epidemic wave, the earlier implementation of measures (close borders, travelers testing, etc.) could have largely minimized the number of introductions and subsequent virus expansion.
The detection lag observed between the estimated time of the most recent common ancestor and the collection date of the earliest sample of each BEAST clade denotes that cryptic transmission might have occurred to some extent. Although most introductions were estimated to have occurred during the last week of February and, especially, during the first week of March 2020, SARS-CoV-2 was silently circulating in Portugal a few weeks before the first confirmed local cases on 2 March 2020. For instance, 12 out of 19 sequences collected during the first week of the epidemic were obtained from patients with no travel history, including patients linked to the introduction leading to the highest number of cases during the first wave. These results are reasonable epidemiologically given the period needed for the diagnostics infrastructure to be set up. Also, the period of cryptic transmission reported in this study for Portugal is within the estimated for other countries during the early pandemic period32,33.
Phylodynamic modeling relies on the accumulation of genetic diversity over time for the estimation of the evolutionary rate and timing of other relevant events. Being a recently emerged pathogen with a relatively low substitution rate (in comparison with other RNA viruses) and a large genome, it is challenging to study the early dynamics of SARS-CoV-2, as a large proportion of sequences collected across time and space are very closely related genetically, thus complicating the reconstruction of the phylogenetic topology. To address this, we opted to focus on a subset of the results with higher topological support. Furthermore, due to the computational burden required for phylodynamic analysis, it is unfeasible to include all available genomic sequences. One limitation of this and other studies trying to infer the origin of introductions is the sequencing sampling bias by country. For this reason, we resorted to filtering and subsampling strategies that aim at reducing the number of genomes that do not significantly contribute to the evolutionary or phylogeographic reconstructions. Despite being less pronounced, there were still discrepancies in the number of genomes included across countries. For instance, it is very likely that the number of introductions via the UK (the country generating the highest volume of sequences) is overestimated, while the number of introductions from countries with none or few available genome sequences is underestimated. However, the integration of travel history during the phylodynamic inferences allows for a more accurate reconstruction of the spatial pathways from both over and undersampled locations. Additionally, using this recently developed model13 provides insight into the genetic diversity circulating in countries for which genomic surveillance is still lacking. This makes the present work among the first to apply this novel model, which allowed us to gain insight into the early circulating diversity in countries like Saint Thomas and Prince, for which (as of 15 February 2021) there are no sequences available, and other undersampled countries such as the United Arab Emirates and Qatar.
Overall, our findings highlight the use of genomic data to trace the introduction and spread of an emerging virus, showing the need of systematic, continuous, and geographically representative genomic surveillance to detect and monitor the emergence and dissemination of biologically and/or epidemiologically relevant variants. Together with open data sharing, the timely generation of SARS-CoV-2 genomic data has been shown to be an invaluable tool to guide national and international public health authorities towards the identification and control of highly transmissible and/or immune evading variants34.
In this context, by laying the foundation of the genomic epidemiology of SARS-CoV-2 in Portugal involving an unprecedented collaborative effort at national and international levels, this work and concomitant capacity building, was pivotal for Portugal’s response to the current and upcoming needs of genomic-informed surveillance and epidemiology of COVID-19, as strongly recommended by ECDC and WHO10,35,36.
SARS-CoV-2 genome sequences generated in this study were uploaded to the GISAID database (https://www.gisaid.org/). Accession numbers can be found in Supplementary material (Supplementary Data 1). Source data for the main figures in the manuscript can be accessed as: (i) Fig. 1. Daily reported COVID-19 confirmed cases in Portugal (https://covid19.min-saude.pt/relatorio-de-situacao/) (Fig. 1a–c); SARS-CoV-2 genome sequences reported/obtained in Portugal, until 31 March 2020 (Supplementary Data 1; https://microreact.org/project/cM6KURnU7rUpqdAnBq5DAf/a2d3840e) (Fig. 1b, c); Cumulative total number of COVID-19 confirmed cases and SARS-CoV-2 genome sequences reported by country between January 22 and March 31 (Supplementary Data 2 and 3; https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv); (ii) Fig. 2. Sequences linked to detected introductions, stratified by travel history data (https://microreact.org/project/cM6KURnU7rUpqdAnBq5DAf/a2d3840e); (iii) Fig. 3. Number and size of SARS-CoV-2 introductions per country (https://microreact.org/project/cM6KURnU7rUpqdAnBq5DAf/a2d3840e and Supplementary Data 8); Figs. 4e and 5. Relative number of transitions between countries of origin and different regions in Portugal (Supplementary Data 4; (https://microreact.org/project/cM6KURnU7rUpqdAnBq5DAf/a2d3840e) and relative number of infections generated by introductions from countries of origin and different regions in Portugal. (Supplementary Data 5; (https://microreact.org/project/cM6KURnU7rUpqdAnBq5DAf/a2d3840e); Fig. 6. Daily reported COVID-19 confirmed cases in Portugal (https://covid19.min-saude.pt/relatorio-de-situacao/) and size, mean date estimates (and respective 95% highest posterior density intervals) of parent nodes of the MRCA and the earliest date of sample collection for each BEAST cluster (Supplementary Data 8).
Wu, F. et al. Author Correction: a new coronavirus associated with human respiratory disease in China. Nature 580, E7 (2020).
Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–273 (2020).
World Health Organization (WHO). Weekly operational update on COVID-19 https://www.who.int/docs/default-source/coronaviruse/weekly-updates/wou_2021_29-mar_project_cleared.pdf?sfvrsn=c0cd7458_3&download=true (2021).
Dong, E., Du, H. & Gardner, L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis. 20, 533–534 (2020). Erratum in: Lancet Infect Dis. 2020 Sep;e215. PMID: 32087114; PMCID: PMC7159018. https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6.
Alm, E. et al. Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region, January to June 2020. Euro Surveill. 25, pii=2001410 (2020).
Gámbaro, F. et al. Introductions and early spread of SARS-CoV-2 in France, 24 January to 23 March 2020. Euro Surveill. 25, 2001200 (2020).
du Plessis, L. et al. Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK. Science 371, 708–712 (2021).
Fauver, J. R. et al. Coast-to-coast spread of SARS-CoV-2 during the early epidemic in the United States. Cell. 181, 990–996.e5 (2020). Epub 2020 May 7.
Lemieux, J. E. et al. Phylogenetic analysis of SARS-CoV-2 in Boston highlights the impact of superspreading events. Science. 371, eabe3261 (2021). Epub 2020 Dec 10. PMID: 33303686; PMCID: PMC7857412.
Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health. Geneva: World Health Organization; 2021. Licence: CC BY-NC-SA 3.0 IGO. https://www.who.int/publications/i/item/9789240018440.
Hadfield, J. et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018).
Argimón, S. et al. Microreact: visualizing and sharing data for genomic epidemiology and phylogeography. Microb. Genom. 2, e000093 (2016).
Lemey, P. et al. Accommodating individual travel history and unsampled diversity in Bayesian phylogeographic inference of SARS-CoV-2. Nat. Commun. 11, 5110 (2020). PMID: 33037213; PMCID: PMC7547076.
Quick, J. et al. Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nat. Protoc. 12, 1261–1276 (2017).
Borges, V. et al. Massive dissemination of a SARS-CoV-2 Spike Y839 variant in Portugal. Emerg. Microbes Infect. 2, 1–58 (2020).
Borges, V., Pinheiro, M., Pechirra, P., Guiomar, R. & Gomes, J. P. INSaFLU: an automated open web-based bioinformatics suite “from-reads” for influenza whole-genome-sequencing-based surveillance. Genome Med. 10, 46 (2018).
Rambaut, A. et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol. 5, 1403–1407 (2020).
Katoh, K., Asimenos, G. & Toh, H. Multiple alignment of DNA sequences with MAFFT. Methods Mol. Biol. 537, 39–64 (2009).
Nguyen, L. T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Rambaut, A., Lam, T. T., Max Carvalho, L. & Pybus, O. G. Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Virus Evol. 2, vew007 (2016).
Hong, S. L. et al. In search of covariates of HIV-1 subtype B spread in the United States-a cautionary tale of large-scale Bayesian phylogeography. Viruses 12, 182 (2020).
Suchard, M. A. et al. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 4, vey016 (2018).
Hasegawa, M., Kishino, H. & Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174 (1985).
Shapiro, B., Rambaut, A. & Drummond, A. J. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Mol. Biol. Evol. 23, 7–9 (2006).
Drummond, A. J., Ho, S. Y., Phillips, M. J. & Rambaut, A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 4, e88 (2006).
Lauer, S. A. et al. The incubation period of Coronavirus Disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Ann. Intern. Med. 172, 577–582 (2020).
Edwards, C. J. et al. Ancient hybridization and an Irish origin for the modern polar bear matriline. Curr. Biol. 21, 1251–1258 (2011).
Rambaut, A., Drummond, A. J., Xie, D., Baele, G. & Suchard, M. A. Posterior summarization in Bayesian phylogenetics using Tracer 1.7. Syst. Biol. 67, 901–904 (2018).
Nadeau, S. A., Vaughan, T. G., Scire, J., Huisman, J. S. & Stadler, T. The origin and early spread of SARS-CoV-2 in Europe. Proc. Natl Acad. Sci. USA 118, e2012008118 (2021).
United Nations, Department of Economic and Social Affairs. Population Division (2019). International Migrant Stock 2019 (United Nations database, POP/DB/MIG/Stock/Rev.2019).
Alto Comissariado para as Migrações (ACM). (n.d.). Saber mais sobre as migrações portuguesas. Available at: https://www.acm.gov.pt/-/saber-mais-sobre-as-migracoes-portuguesas-#.
Bedford, T. et al. Cryptic transmission of SARS-CoV-2 in Washington state. Science. 370, 571–575 (2020).
Davis J. T. et al. Cryptic transmission of SARS-CoV-2 and the first COVID-19 wave. Nature https://doi.org/10.1038/s41586-021-04130-w (2021).
European Centre for Disease Prevention and Control (ECDC). Risk related to spread of new SARS-CoV-2 variants of concern in the EU/EEA – 29 December 2020 (ECDC, 2020). https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-risk-related-to-spread-of-new-SARS-CoV-2-variants-EU-EEA.pdf.
World Health Organization. Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health. Geneva: Licence: CC BY-NC-SA 3.0 https://apps.who.int/iris/handle/10665/338480 (2021).
European Centre for Disease Prevention and Control (ECDC). Sequencing of SARS-CoV-2: first update. https://www.ecdc.europa.eu/sites/default/files/documents/Sequencing-of-SARS-CoV-2-first-update.pdf (ECDC, 2021).
We gratefully acknowledge to Sara Hill and Nuno Faria (University of Oxford) and Joshua Quick and Nick Loman (University of Birmingham) for kindly providing us with the initial sets of Artic Network primers for NGS; Rafael Mamede (MRamirez team, IMM, Lisbon) for developing and sharing a bioinformatics script for sequence curation (https://github.com/rfm-targa/BioinfUtils); Philippe Lemey (KU Leuven) for providing guidance on the implementation of the phylodynamic models; Joshua L. Cherry (National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health) for providing guidance with the subsampling strategies; and all authors, originating and submitting laboratories who have contributed genome data on GISAID (https://www.gisaid.org/) on which part of this research is based. The opinions expressed in this article are those of the authors and do not reflect the view of the National Institutes of Health, the Department of Health and Human Services, or the United States government. This study is co-funded by Fundação para a Ciência e Tecnologia and Agência de Investigação Clínica e Inovação Biomédica (234_596874175) on behalf of the Research 4 COVID-19 call. Some infrastructural resources used in this study come from the GenomePT project (POCI-01-0145-FEDER-022184), supported by COMPETE 2020 - Operational Programme for Competitiveness and Internationalisation (POCI), Lisboa Portugal Regional Operational Programme (Lisboa2020), Algarve Portugal Regional Operational Programme (CRESC Algarve2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), and by Fundação para a Ciência e a Tecnologia (FCT).
The authors declare no competing interests.
Peer review information
Communications Medicine thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Borges, V., Isidro, J., Trovão, N.S. et al. SARS-CoV-2 introductions and early dynamics of the epidemic in Portugal. Commun Med 2, 10 (2022). https://doi.org/10.1038/s43856-022-00072-0