Global landscape of SARS-CoV-2 genomic surveillance and data sharing

Genomic surveillance has shaped our understanding of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants. We performed a global landscape analysis on SARS-CoV-2 genomic surveillance and genomic data using a collection of country-specific data. Here, we characterize increasing circulation of the Alpha variant in early 2021, subsequently replaced by the Delta variant around May 2021. SARS-CoV-2 genomic surveillance and sequencing availability varied markedly across countries, with 45 countries performing a high level of routine genomic surveillance and 96 countries with a high availability of SARS-CoV-2 sequencing. We also observed a marked heterogeneity of sequencing percentage, sequencing technologies, turnaround time and completeness of released metadata across regions and income groups. A total of 37% of countries with explicit reporting on variants shared less than half of their sequences of variants of concern (VOCs) in public repositories. Our findings indicate an urgent need to increase timely and full sharing of sequences, the standardization of metadata files and support for countries with limited sequencing and bioinformatics capacity.


Nomenclature of SARS-CoV-2 variants
The dynamic SARS-CoV-2 nomenclature system from Phylogenetic Assignment of Named Global Outbreak Lineages (PANGOLIN) adopted a phylogenetic framework to identify new lineages 1 , which is the nomenclature that is used for the sequences deposited in public genomic datasets. On 31 May 2021, the WHO announced a new naming system that uses the letters of the Greek alphabet (e.g., Alpha for B.1.1.7) for an easy and coherent application 2 and is hereafter used in our study. Based on a comparative assessment of the phenotypic impact of the SARS-CoV-2 variants, those with significant signals were classified into variants of concern (VOCs) and variants of interest (VOIs).

Literature search for genomic surveillance strategy
Data on the genomic surveillance strategy were supplemented by a literature search. We searched PubMed and Europe PMC for peer-reviewed and preprint studies that characterized the country-level strategies for SARS-CoV-2 genomic surveillance from January 1, 2020, to October 31, 2021. The search was performed using the following terms: "SARS-CoV-2", "COVID-19", "sequencing", and "genomic surveillance". Articles published in English and containing information about genomic surveillance and sequencing capability are included.
The data from the literature were entered into the structured dataset (Supplementary Table 3).

Classification of genomic surveillance and sequencing availability
We classified the surveillance strategy of each country into four categories: 1) high level of routine genomic surveillance, 2) moderate level of routine genomic surveillance, 3) low level of routine genomic surveillance, and 4) limited genomic surveillance. Sequencing 5% of positive samples has been recommended for African countries by the WHO 3

and for
European countries by the European Commission 4 , and one modelling study demonstrating that sequencing 5% of positive specimens allows the detection of emerging variants at a prevalence level of 0.1% to 1.0% 5 . Therefore, sequencing 5% of positive samples is regarded as the definition of a high level of routine genomic surveillance in our study. However, we considered that countries with high burdens of newly confirmed cases probably struggle to achieve 5% sequencing proportions despite sequencing tens of thousands of positive specimens per week. Therefore, we also adopted an alternative definition for a high level of routine genomic surveillance, namely, it is acceptable if the sequenced sample sizes are sufficient to detect a new variant at a prevalence of 1.0% that is tailored for locations that have a specific range of case numbers per week. The sample size was recommended by the ECDC guideline 6 . For example, if a country has 75,000 cases per week (range: 50,001-100,000) and needs to detect a new variant at a prevalence of 1.0%, the required sample size for classification as a high level of routine genomic surveillance is 1,500 per week. The classifications of the other three categories were similar (Extended Table 1).
In addition, we also categorised the global sequencing availability, which is classified into three categories: high availability, moderate availability, and low availability. High availability was defined as the category that can collect viral isolates from clinical samples and conduct in-country genomic sequencing. Countries that use regional sequencing networks or are required to ship samples to external labs outside of these countries were placed in the moderate availability category (Extended Table 2). The regional networks in Africa contain several reference laboratories to provide services to countries in their subregions 7 ; therefore, the countries where the reference laboratories were located were defined as "high availability", while those countries that are served by the reference laboratories were defined as "moderate availability". If countries had no sequencing capabilities and had low levels of supportive sequencing services from external laboratories and other organizations, we placed them in the low availability category.

Data cleaning for genomic data and aggregated dataset
We used the genomic data from GISAID to assess the sequencing technology and metadata completeness since it contained more abundant variables that deposited in the metadata than the data in the 2019nCoVR repository. We used the genomic data from 2019nCoVR to conduct other analyses, since this dataset merged and deduplicated the sequences from multiple genomic repositories.
In terms of cleaning the genomic data, we first double checked the duplicates within the repository and between repositories. Second, we removed the sequences of nonhuman hosts or non-assignments of the PANGO lineage and removed those that did not belong to the 194 Member States. Then, all assigned PANGO lineages were classified into eight categories were not provided but a month was provided, we selected the middle of the month as the sampling date.
To clean the officially aggregated data, we mainly chose the aggregated results obtained by sequencing rather than screened by PCR assays. The Alpha variant cases included those B.1.1.7 cases with or without the E484K mutation site, as well as the sub-lineages of Q. The Delta variant included lineage B.1.617.2 and its sub-lineages of AY. Regarding the reporting dates in the aggregated dataset, we employed a fixed three-week lag to extrapolate the date of collection 8,9 , unless the tailored-made delay information was known for a country.

Checking the variant classifications
We performed some analyses to check the consistency of the nomenclatures of variant classifications. First, we checked the consistency between "Lineage call" and "Scorpio call" for the same sequences. We conveniently selected some sequence samples that were collected at four timepoints (midpoints of December 2020, March 2021, June 2021, and September 2021), given the handling power of the online nomenclature systems. Then, we ran those sequences by using the Pangolin tool (https://pangolin.cog-uk.io/) to obtain the assigned "Lineage call" and "Scorpio call" for each sequence. We selected some specific variants (including Alpha, Beta, Gamma, Delta, Lambda, Mu, Kappa, Epsilon, Theta, Iota, Eta, and EU1) after undergoing classification by each nomenclature system to calculate the assignment consistencies. The sequences without assigned a variant designation were not included in the consistency analysis. For example, if a sequence was assigned to "none" by "Lineage call" but was assigned to "alpha-like" by "Scorpio call", we removed this sequence to calculate the consistency. As a result, there was a high consistency (100.0%) between "lineage call" and "Scorpio call" (Supplementary Table 8).
Second, we also evaluated the degree of consistency between Pangolin and Nextstrain nomenclature systems for the same sequences downloaded as above. We ran these same sequences in the Pangolin tool (https://pangolin.cog-uk.io/) and Nextclade (https://clades.nextstrain.org/results) to assign lineage or clade classifications, respectively.
Similarly, we selected those specific variants after undergoing classification by each nomenclature system to calculate assigned consistencies. Finally, we found that the degree of consistency reached 99.9% (Supplementary Table 8).
Although the nomenclature is sometimes changing at a rapid pace of the continuous updates and improvements of the Pangolin tools, incorrect lineage assignments still occasionally occur, especially when there is a certain proportion of ambiguous sites 10 . However, the genomic dataset (e.g., GISAID) assigns a lineage designation only for sequences that have less than 5% of ambiguous nucleotide sites in the coding regions 10 ; therefore, the proportion of misallocations is expected to be very low.

Classification of sequencing technologies
We carefully cleaned the sequencing technical information of each sequence in GISAID and divided the sequencing technologies into three types: first-generation sequencing, secondgeneration sequencing, and third-generation sequencing (Supplementary Table 9). When only the information about the sequencing assay/panel is available, and it is compatible for second generation sequencing; we considered those sequences were generated from second generation sequencing. We also assumed that all Illumina ® platforms generating SARS-CoV-2 sequences are categorized as second-generation sequencing 11 . Supplementary Figure 1

. Proportions of cases sequenced by income groups
Each dot represents one country that deposited at least 10 sequences from May 1, 2021 to September 30, 2021. The green dots refer to those low-income and lower middle-income countries with more than or equal to a 2.5% proportion of cases sequenced, which include Papua New Guinea, Nigeria, Congo. Rep., and The Gambia, while the red dots refer to those 14 high-income countries (Brunei Darussalam, Chile, Andorra, Greece, Israel, Malta, The Bahamas, Barbados, Trinidad and Tobago, Bahrain, Kuwait, Oman, United Arab Emirates, Seychelles) with less than a 2.5% proportion of infections sequenced. The blue and black horizontal dotted lines represent 5.0% and 2.5% of the sequenced percentages, respectively.
Note: the sequenced percentage is a rough proxy due to the potential non-sharing of some genomic data and underreporting of confirmed cases. Administrative boundaries were obtained from the database of Global Administrative Areas (GADM).

incidence levels per 100 people in low-income and lower middle-income countries
The blue horizontal dotted lines represent 1.5% of the sequenced percentage. We added the country name where the proportions of cases sequenced were more than 1.5%. Among the low-and lower middle-income countries, high sequencing percentages are mainly distributed in locations with low COVID-19 incidence rates. In addition, the African reference laboratories of SARS-CoV-2 genomic surveillance are partly located in Nigeria, Kenya, The Gambia, and Ghana, which may be another potential reason why these countries perform well.

incidence levels per 100 people in high-income countries
The blue horizontal dotted lines represent 1.5% of the sequenced percentage. Among the high-income countries, high sequencing percentages are also mainly distributed in locations with low COVID-19 incidence rates. In addition, some countries have low extent of public availability of genomic data (e.g., Greece), which may be another potential reason why these countries seem to perform not well (as this analysis was based on the publicly available genomic data in public repositories). Supplementary Table 1

Classification Definition
High availability Be able to collect viral isolates from clinical samples and conduct incountry genomic sequencing.

Moderate availability
Be able to collect viral isolates from clinical samples, but the process of genomic sequencing needs extra supports from external sequencing labs, including the following scenarios: 1) samples need to be shipped to the regional reference labs or high-income countries for sequencing; 2) purchase the commercial kits for the detect SARS-CoV-2 variants or get a donation of that; 3) establish the sequencing laboratory with the support from others during the pandemic of COVID-19.
Low availability Lack of sequencing capability and have little supportive sequencing services from the external labs.

Supplementary Table 3. Country-specific SARS-CoV-2 genomic surveillance strategy
[ Here we used the 15 th of each month to replace those with only month information available and adopted the median days to replace those only with date range available. b Arrival time from other countries. Here we assumed that the sample will be immediately collected after landing.