Global disparities in SARS-CoV-2 genomic surveillance

Genomic sequencing is essential to track the evolution and spread of SARS-CoV-2, optimize molecular tests, treatments, vaccines, and guide public health responses. To investigate the global SARS-CoV-2 genomic surveillance, we used sequences shared via GISAID to estimate the impact of sequencing intensity and turnaround times on variant detection in 189 countries. In the first two years of the pandemic, 78% of high-income countries sequenced >0.5% of their COVID-19 cases, while 42% of low- and middle-income countries reached that mark. Around 25% of the genomes from high income countries were submitted within 21 days, a pattern observed in 5% of the genomes from low- and middle-income countries. We found that sequencing around 0.5% of the cases, with a turnaround time <21 days, could provide a benchmark for SARS-CoV-2 genomic surveillance. Socioeconomic inequalities undermine the global pandemic preparedness, and efforts must be made to support low- and middle-income countries improve their local sequencing capacity.

More than 2 years into the COVID-19 pandemic, many countries continue to face large epidemics of SARS-CoV-2 infections 1 , mostly driven by the emergence and spread of novel viral variants 2 , and unequal access to vaccines, especially earlier in the pandemic [3][4][5][6] . Genomic surveillance has been critical to study many rapidly evolving pathogens 7 , and has been employed to investigate SARS-CoV-2 evolution and spread, to design and optimize diagnostic tools and vaccines, and to rapidly identify and assess viral lineages with altered epidemiological characteristics, including variants of concern (VOCs) such as Alpha/B.1.1.7, Beta/B.1.351, Gamma/P.1, Delta/ B.1.617.2 and Omicron/B.1.1.529. These lineages pose increased global public health risks due to their greater transmissibility and potential immune escape from neutralizing antibodies induced by natural infections and/or vaccines 8,9 . Variants of interest (VOIs) also require continued monitoring for changes in transmissibility, disease severity, or antigenicity 10 . Such variants with higher epidemic potential have been demanding more specific measures, proportional to the risk posed by them, and to do so, policy makers need to know "what" pathogen is present locally, "where" it circulates in the community, "when" such variants may arrive, "why" they represent more risks, and "who" is most at risk 11 . Without answers to these questions, efficient public health policies cannot be implemented, and lives are unnecessarily impacted (high morbidity: long COVID, sequelaes) or lost (high mortality). Throughout this pandemic, genomic information has been instrumental for planning measures to curb the impacts of variants in low-, middle-and high income countries that implemented evidence-based policies in response to the emergence and spread of VOCs [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27] . To help guide public health responses to evolving variants, it is essential to track the diversity of SARS-CoV-2 lineages circulating worldwide in near real-time 8,28,29 . Data generators around the world have been submitting an unprecedented number of SARS-CoV-2 genomes in publiclyaccessible databases: up to June 9th, 2022, >11.3 million consensus sequences (FASTA) were shared via the EpiCoV database hosted by the GISAID Data Science Initiative 30 . Over 5.5 million sequences can also be found in the archives of the International Nucleotide Sequence Database Collaboration 31 together with >4.5 million raw read sequences (FASTQ) 32 . By way of comparison, 1,614,498 influenza sequences have been shared via GISAID since 2008 33 . Despite improvements in models for equitable sharing of pathogen genomic data 34 , there are striking differences in the intensity of genomic surveillance worldwide. Here we examine global publicly-accessible SARS-CoV-2 genomic surveillance data 2 years of COVID-19 pandemic (from March 2020 to February 2022) to identify key aspects associated with sequencing intensity and timely variant detection, and investigate the consequences of surveillance disparities.

Global disparities in SARS-CoV-2 genomic surveillance
To investigate spatial and temporal heterogeneity in SARS-CoV-2 genome sequencing intensity, we explored the percentage of COVID-19 cases sequenced each week per country from March 2020 to February 2022 ( Fig. 1 and Supplementary Data 1). It has been proposed that at least 5% of SARS-CoV-2 positive samples should be sequenced to detect viral lineages at a prevalence of 0.1 to 1.0% 35 , but we identified that only 13 out of 189 countries (6.8%) worldwide had 5% or more of their total confirmed cases sequenced, while 86 out of 189 countries had <0.5% of confirmed cases sequenced (Figs. 1 and 2A and S1). Throughout the first 2 years of pandemic, only seven countries or territories depended mostly on the sequencing capacity from other countries, having 25% or more of their genomes sequenced abroad ( Fig. S2 and Supplementary Data 2). Until late February 2022, while the total number of reported cases was relatively similar in high-income countries (HICs) and low/middle-income countries (LMICs) (i.e., 232.7 and 199.1 million cases, respectively), HICs submitted 10-fold more sequences per COVID-19 case (3.53% and 0.35% sequenced cases, respectively) (Supplementary Data 3). Countries that faced mostly  Another key aspect of genomic surveillance is timeliness, which we evaluated by looking at the turnaround time (TAT; defined as the time in days between sample collection and genome submission to GISAID) of SARS-CoV-2 genome sequencing across 19 geographic regions ( Fig. 2C; see also ref. 36). We observed that following the detection of more transmissible variants (VOCs) in late 2020, almost all geographic regions decreased their TAT ( Fig. 2C and see Fig. S6). Countries in Northern Europe, which had the fastest TAT (Fig. 2C), decreased their median TAT from 20 to 10 days in the second pandemic year. The overall global decrease in TAT also matches a series of bulletins and guidelines for SARS-CoV-2 sequencing, which were published by the WHO and ECDC in early 2021, in the aftermath of the detection of the Alpha VOC [37][38][39][40] . In the second pandemic year, we only observed large increases in TATs for Northern and Western Africa (Fig. 2C). When we compare the timeliness of countries based on their income classes, improvements were observed in all classes, except among low income countries, which had higher median TAT in the second pandemic year (median change from 71 to 109 days of TAT, see Fig. S5B). Rapid generation and sharing of pathogen sequence data from regularlycollected samples is essential to maximize public health impact of genomic data 41,42 . The VOCs Alpha and Gamma, for example, reached up to 50% frequency within 2-3 months of their emergence in the UK and Manaus, respectively 43,44 , while with its faster epidemic spread, Omicron took less than a month to reach predominance in South Africa 45 . These examples illustrate that rapid TATs are essential for the early recognition and timely assessments of VOC's transmissibility 41 . The fast detection and characterization of VOCs and VOIs, both in HICs and LMICs, highlights positive examples of how rapid genomic surveillance efforts can aid public health responses, both locally and globally. Genomic surveillance especially in LMICs has provided critical information on the early spread and transmissibility of four novel VOCs (Beta, Gamma, Delta, and Omicron), an important achievement that also set the foundations for pandemic preparedness in areas that are most at risk for the emergence of zoonotic diseases.
In countries with limited sequencing capacity and/or long TATs, more affordable PCR-based tests, such as RT-PCR tests that distinguish VOCs based on target failures (for example, "S gene target failure"), have been extremely valuable to provide evidence of the spread of a few variants, such as the VOCs Alpha and Omicron, which contain specific deletions that lead to target failures 46 . These tests, however, can only be deployed once enough genomes of a new lineage are sequenced, not only to verify its public health relevance, but also to confirm the presence and high prevalence of unique alleles (with deletions or extensive genetic changes) that allow differential RT-PCR detection. Thus, without rapid sequencing and genomic characterization in the first place, as we observed for Omicron in late 2021 45,46 , low-cost PCR-based methods cannot be developed nor deployed.

Sampling strategies for rapid variant detection
We then investigated the impact of genome sequencing intensity and TAT on the detection of SARS-CoV-2 lineages. First, we found that the number of globally observed lineages correlates with the number of SARS-CoV-2 genomes available per country (Pearson's r = 0.96, p value < 0.0001) and the overall proportion of sequenced cases in each country (Pearson's r = 0.51, p value < 0.0001) (Fig. S7), similar to what has been observed for the UK 47 . This suggests that limited genome sequencing intensity delays the identification and response to new viral lineages with altered epidemiological and antigenic characteristics.
To investigate strategies for rapid variant detection, we simulated the impact of the percentage of sequenced cases and TAT on the reliable detection of previously-identified SARS-CoV-2 lineages using metadata from Denmark, which has one of the most comprehensive SARS-CoV-2 genome surveillance systems (see "Materials and methods", Fig. S8). Here, we assumed a recommended scenario of random sampling, whereby samples for virus genomic sequencing are selected independently of sample metadata such as age, sex, or clinical symptoms 48 . When calculating the probability of detecting at least one genome of a rare lineage (0-5% prevalence) under different sequencing intensities, we found that sequencing at least 300 genomes per week is required to detect, with a 95% probability, a lineage that is circulating in a population at a weekly prevalence of 1%. For a weekly prevalence of 5%, this number decreases to 75 genomes per week (Fig. 3A). These figures are independent of outbreak and population size of a given location, assume representative sampling, and can only tell if a lineage is present, not how prevalent it is. By simulating a scenario of non-random sampling, focused in the most populous region of a country, we observed that the power to detect lineages decreases, but remains moderately useful when TAT is below 21 days, and sequencing intensity is at least 0.5% of all cases (Fig. S9). For other countries, successful detection of domestic lineages from individual regions will also depend on the distribution of population density and human mobility, aspects that are worthy of further investigation in future research. On average, genome surveillance programmes in high income countries should be able to detect circulating virus lineages at 5% prevalence with maximum probability with their current TATs and sequencing intensities, and under the assumption of random sampling ( Fig. 3B and Table 1). However, under a scenario of random sampling, low income countries that typically sequence an average of 10 genomes per week may miss a SARS-CoV-2 lineage circulating at up to 21.7% prevalence (Table 1). This will present a substantial limitation to the lines of inquiry available to such countries from genomic sequencing data (Table 1). Within the range of 0.05-5% of sequenced cases considered here, increasing sampling intensity, and to a lesser extent reducing TAT, strongly improves the rapid detection of viral lineages (Fig. 3B).
Next, we simulated 25 scenarios with 100 replicates, in which we varied sampling frequency (from 0.05 to 5%) and TAT (from 7 to 35 days) to compute the probabilities of detecting at least one genome of a given lineage before the lineage reaches a cumulative size of 100 cases (Fig. 3B), using as "ground truth" a dataset from a well characterized setting (see "Materials and methods" and Fig. S8). The simulated scenario shows that when sequencing percentages of 5% per week and TATs of 7 days are achieved in a given setting, a viral lineage is always detected before it reaches 100 cases. When the proportion of sequenced cases per week decreases by 100-fold, to 0.05%, the probability of the timely detection of a viral lineage before it reaches 100 cases decreases to 4% for TATs of 7 days, and further declines to 2.0% when TAT is 35 days (Fig. 3B). These estimates, however, apply to a scenario of random sampling. The power to detect lineages decreases when the sampling is non-random, for example, when focusing only on the most populous region of a country; however, sequencing at least 0.5% of the reported cases with a TAT <21 days remains an important Article https://doi.org/10.1038/s41467-022-33713-y factor in successful detection even in non-random sampling scenarios (Fig. S9).
For an optimistic scenario of 0.5% sequenced cases (achieved by 78% HICs and 40% LMICs) and a TAT of 21 days (observed in 25% of the genomes submitted by HICs, and in 5% by LMICs) (Supplementary Data 4), we found a 34% probability of detecting a lineage before it reaches 100 cases. Throughout the pandemic, many countries reported weekly incidences as high as 100 cases per 100,000 inhabitants (Figs. 1C and S3 and S4). For example, in a scenario of high incidence, for Manaus, a city with 2.2 million inhabitants in the Amazonas state located in the North of Brazil, the 0.5% sequencing threshold would correspond to 11 randomly selected genomes per week. With a 21-day TAT, this would allow the detection of a given lineage with a 34% probability (Fig. 3B). For São Paulo city (12.4 million inhabitants), this number increases to 62 genomes per week. For Brazil (212.6 million inhabitants), this would correspond to 1063 weekly genomes selected from a random population of samples, in the above mentioned scenario of high incidence. Although the 0.5% ratio of sequenced cases per week in near real-time is a reasonable benchmark for SARS-CoV-2 genomic surveillance in 78% of high income countries (Supplementary Data 4), this often comes as a result of close coordination between diagnostic centers and well-funded, decentralized infrastructures to integrate sequencing data and sample-associated metadata (see e.g. ref. 49).

Factors associated with genomic surveillance capacity
While many HICs were able to rely on previously established networks and laboratory infrastructure to perform molecular testing and sequencing 50,51 , many LMICs-including Brazil, South Africa, and India where four VOCs were first detected 43,52-54 -have faced additional challenges to the rapid expansion of genomic surveillance 51,55,56 . Pathogen genomics complements but often competes for limited resources with other aspects of pandemic response, for instance, surveillance and testing capacity, medical supplies, laboratory reagents, public health and social measures and vaccine development 57 . To investigate how socioeconomic factors can impact SARS-CoV-2 genomic surveillance response around the world, we explored the correlation between the percentage of sequenced COVID-19 cases in each country, and 20 country-level socioeconomic and health quality covariates ( Fig. 4 and Supplementary Data 5). We found that the percentage of sequenced cases is significantly associated with expenditure on research and development (R&D) per capita (r = 0.47, p value <0.0001) (Fig. 4A), gross domestic product (GDP) per capita (r = 0.37, p value <0.0001) (Fig. 4B), sociodemographic index (r = 0.31, p value <0.001) (Fig. 4C), and established influenza virus genomic surveillance capacity prior to the COVID-19 pandemic (r = 0.30, p value <0.001) (Fig. 4D and Supplementary Data 6).
A total of 74% (140 out of 189) of the countries that submitted SARS-CoV-2 genomes to GISAID had also shared influenza virus sequences to that same database in 2019. When compared by income class, we observed that the majority of UMCs (77%) and HICs (78%) currently sequencing SARS-CoV-2 had already reported influenza virus sequences in public databases up to 2019. For LIC countries, this drops to 37.5%, suggesting that many LICs initiated or enhanced their genome sequencing programs during the COVID-19 pandemic. While disparities in investment in national health, research, and development continue to impact the ability of countries to scale up genomic sequencing intensity 28,51,58 , recent improvements in genomic surveillance by many LMICs (Fig. S5) and the association of sequencing efforts with established genomic surveillance capacity paint an encouraging picture for future pandemic preparedness programs.
When we explored correlations with mean TAT (Supplementary Data 7), we found that healthcare access and quality index (r = −0.56, p value <0.0001), universal health coverage (r = −0.56, p value <0.0001), health worker density (r = −0.56, p value <0.0001), and health expenditure per capita (r = −0.54, p value <0.0001) are significantly correlated with mean TATs (Fig. S10 and Supplementary Data 7). Our results quantify only correlations between socioeconomic covariates, sequencing intensity, and TAT, and cannot be interpreted as causal. Future studies should focus on additional variables that may affect genomic surveillance, especially in LMICs, such as training laboratory and bioinformatic personnel, metadata standards, costs associated with imported consumables, and shipment delays that may be exacerbated by border closures and travel restrictions 28,55,56,58,59 . Other factors associated with delays in reporting VOCs include social and political stigma and perceived negative impact on travel when reporting potential VOCs, and concerns of having findings scooped and published by other researchers 60 . Longer TATs are also expected in countries where virus genomics activities are focused on retrospective genomic studies to investigate SARS-CoV-2 reinfections 61 , vaccine breakthrough infections 62 , and past epidemic dynamics 63,64 .

Discussion
Leveling up pathogen genomic surveillance efforts, particularly in LMICs, should be a priority to improve pandemic preparedness worldwide 60 . Our findings demonstrate that global SARS-CoV-2 genomic surveillance efforts are currently highly unbalanced, and contingent upon socioeconomic factors and pre-pandemic laboratory and surveillance capacity. Our results suggest that sequencing 0.5% of total confirmed cases, with a TAT below 21 days, could provide a benchmark for genomic surveillance studies targeting SARS-CoV-2 and future emerging viruses. Alongside with the guidance provided by the WHO and other international public health authorities (see 37,38,40,[65][66][67][68][69], ongoing surveys to understand barriers to virus genome sequencing and sampling selection strategies will provide valuable information for future surveillance programs. Implementation of metagenomic approaches for virus discovery followed by virus-genome specific sequencing approaches could help overcome existing limitations of molecular and syndromic surveillance strategies 70 . Adoption of standardized protocols for representative genomic surveillance strategies 40,48 , establishment of data and minimal metadata standards, efficient and facilitated access to information, following equitable data sharing agreements 65 , and collaboration between academia, public health laboratories, private laboratories and other stakeholders will be essential to maximize cost-effectiveness and public health impact of genomic surveillance. While a random sampling strategy may provide accurate information into SARS-CoV-2 variant emergence and frequency estimation, we note that genome sampling strategies should be considered pathogen-and question-specific 48,65,66 . For example, non-random selection of samples stratified by disease severity may be required to identify genes or mutations associated with clinical outcomes 71 .
There are several global efforts underway to improve genomic sequencing capacities around the world, including the AFRO-Africa Centre for Disease Control, the Pan American Health Organization COVIGEN Network, Regional Genomic Surveillance Consortium from WHO Southeast Asia Region, and the ACT-A WHO Global Risk Monitoring Framework. Global efforts must be made to improve in-country genomic surveillance capacity, and to provide sustainable research funding for strengthening sequencing capacity and outbreak analytics, particularly in LMICs. Improved pathogen surveillance at the human, animal and human-animal interfaces is also urgently needed 72 . Retaining existing and expanding local capacity efforts acquired during the SARS-CoV-2 pandemic will be critical to contain and respond to the next "Disease X" 72 .

Genomic surveillance and epidemiological data
To obtain the percentage of sequenced cases for each country, per week and cumulative, we used metadata related to the "country of exposure" of genomes submitted to GISAID 30 up to March 18th, 2022, collected from EW (epidemiological week) 10 of 2020 (March 1st, 2020) to EW 8 of 2022 (February 26th, 2022). We obtained global daily COVID-19 case counts from Johns Hopkins University, Center for Systems Science and Engineering (http://github.com/CSSEGISandData/ COVID- 19), and population data from each country from the United Nations' Department of Economic and Social Affairs 73 . Countries were grouped by income using the current classification by the World Bank 74 . We calculated weekly percentages of COVID-19 cases sequenced per country by aggregating and dividing genome and case counts per EW, using a custom pipeline "subsampler" (http://github. com/andersonbrito/subsampler) 75 .

Analysis of covariates correlated with genomic surveillance capacity
Covariates related to health systems were available from the Institute for Health Metrics and Evaluation (IHME) 76 , GDP data were also available from IHME 77 , and data on R&D expenditure per capita were available from UNESCO 78 . For the covariates from IHME 76  Simulation of scenarios of genome sampling As shown in Fig. 1, Denmark has one of the most comprehensive genomic surveillance programs in this COVID-19 pandemic, sequencing around 14.5% of its reported cases up to February 26th, 2022 (2,733,807 cases and 396,994 genomes with >70% coverage; access date: March 18th, 2022) 79 . In order to simulate the impact of the percentage of sequenced cases and the TAT (time between sample collection and genome submission) in the detection of previouslyidentified SARS-CoV-2 lineages in a given country, we used metadata from genomes obtained by the Danish COVID-19 genome consortium, with collection dates between EW 10 of 2020 (March 1st) and EW 8 of 2022 (February 26th) 79 .
To evaluate the impact of temporal delays between reported dates of sample collection and dates of genome submission on GISAID, we generated lists of genomes with adjusted submission dates, to simulate TAT representing delays between 7 and 35 days (5 weeks) between sample collection and genome submission. Considering the high percentage of sequenced cases per EW in Denmark (often above 20%), we produced several genome datasets by simulating scenarios with different percentages of sequenced cases per EW (0.05, 0.1, 0.5, 1 and 5%). In doing so we were able to simulate 25 scenarios (with 100 replicates each) with combinations of different TAT and percentage of sequenced cases in order to assess how these two parameters may impact our ability (expressed as a probability) to detect circulating lineages. Specifically, we randomly sampled each column of the observed data (considered them to be case counts across all circulating lineages) according to the targeted percentage of sequenced cases which would become available after a given TAT, ignoring rare lineages that never reached 100 sampled genomes. Each combination of percentage of sequenced cases and TAT yielded one table of genomes available across the EWs. This procedure was repeated 100 times to mitigate random sampling effects, and results were used to generate a probability of detection for each circulating lineage. Summarizing the 100 replicates led to detection probabilities for each lineage in each epidemiological week. To simulate uneven geographic distribution of sequenced cases, we also simulated an analogous scenario to the one described above but where only the sequencing intensity in Hovedstaden, Denmark's capital region, was used in simulations and compared to actual lineage frequency data for all of Denmark (Fig. S9). Figure 3A shows the probability of not drawing 0 from a Poisson distribution whose mean is the product of lineage prevalence and sequenced cases. In Fig. 3B, we show the computed probabilities of detection across simulation replicates, at a given sampling frequency and delay, which were able to have at least one detection of a given lineage before reaching a cumulative size of 100 cases in the full dataset without delays ("ground truth", see Fig. S8). Figure 3C-G similarly map this out, but in time, asking how long it takes for a given lineage to be detected over time using the first instance of a lineage in the "ground truth" dataset as its emergence.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The findings of this study are based on metadata associated with 8,949,097 sequences available on GISAID up to March 18th, 2022, and accessible at https://doi.org/10.55876/gis8.220330me. Epidemiological data of global reported cases were downloaded from the GitHub account of the CSSE at Johns Hopkins University (https://github.com/ CSSEGISandData/COVID-19). All relevant data used in this study are available as Supplementary files in this manuscript, and on the following GitHub repository: https://github.com/andersonbrito/paper_ 2022_metasurveillance.

Code availability
The pipeline used to calculate the percentages of sequenced cases per country is available on the following GitHub repository: https://github. com/andersonbrito/subsampler 75 .