Emergence and spread of a SARS-CoV-2 lineage A variant (A.23.1) with altered spike protein in Uganda

Here, we report SARS-CoV-2 genomic surveillance from March 2020 until January 2021 in Uganda, a landlocked East African country with a population of approximately 40 million people. We report 322 full SARS-CoV-2 genomes from 39,424 reported SARS-CoV-2 infections, thus representing 0.8% of the reported cases. Phylogenetic analyses of these sequences revealed the emergence of lineage A.23.1 from lineage A.23. Lineage A.23.1 represented 88% of the genomes observed in December 2020, then 100% of the genomes observed in January 2021. The A.23.1 lineage was also reported in 26 other countries. Although the precise changes in A.23.1 differ from those reported in the first three SARS-CoV-2 variants of concern (VOCs), the A.23.1 spike-protein-coding region has changes similar to VOCs including a change at position 613, a change in the furin cleavage site that extends the basic amino acid motif and multiple changes in the immunogenic N-terminal domain. In addition, the A.23.1 lineage has changes in non-spike proteins including nsp6, ORF8 and ORF9 that are also altered in other VOCs. The clinical impact of the A.23.1 variant is not yet clear and it has not been designated as a VOC. However, our findings of emergence and spread of this variant indicate that careful monitoring of this variant, together with assessment of the consequences of the spike protein changes for COVID-19 vaccine performance, are advisable.

June to August 2020, the lineage B.1 and B.1.393 strains were abundant, similar to patterns observed in Kenya 10 (Fig. 1b) although lineage A viruses did not decline as seen in US and Europe. Lineage A.23 strains were first observed in two prison outbreaks in Amuru and Kitgum, Uganda in August 2020; by September-November, A.23 was the major lineage circulating throughout the country (Fig. 1c). The A.23 virus continued to evolve into the A.23.1 lineage, first observed in late October 2020. Given the diversity of virus lineages found in the country from March until November 2020, it was unexpected that by late December 2020 to January 2021, lineage A.23.1 viruses represented 90% (102 of 113 genomes) of all viruses observed in Uganda (Fig. 1d). In all time periods, the SARS-CoV-2-positive sample were obtained from multiple clinical and surveillance locations throughout Uganda (Extended Data Fig. 5b), indicating that the differences are unlikely to be due to sampling different subpopulations in the country at different times.

Virus sequence diversity
All newly and previously generated Uganda genomes that were complete and high-coverage (n = 322) were used to construct a maximum-likelihood phylogenetic tree (Fig. 2).
A number of A and B variant lineages were observed briefly at low frequencies and may have undergone extinction, similar to patterns observed in the UK 11,12 . Although based on limited sampling, genomes identified from a truck driver are often observed basal to community clusters (Fig. 2), suggesting the importance of this route in the introduction and spread of the virus into Uganda. Most of the genomes from truck drivers sampled at ports of entry (POEs) bordering Kenya belonged to lineage B.1 and B.1.393, which is consistent with the pattern reported in Kenya 10 . However, genomes identified from truck drivers from Tanzania and from the Elegu POE bordering South Sudan, albeit small numbers, belonged to both the A and B.1 lineages. Continued monitoring of truck drivers coming in and out of Uganda provides a useful description of the inland circulation of strains in this part of world, where genomic surveillance is not as detailed as in other parts of the world.

Emergence of A.23 and A.23.1
Outbreaks of SARS-CoV-2 infections were reported in the Amuru and Kitgum prisons in August 2020 (ref. 13,14 ). The SARS-CoV-2 genome sequences from individuals in the prisons were exclusively belonging to lineage A ( Fig. 2) with three amino acid changes encoded in the spike protein (F157L, V367F and Q613H; Fig. 3) that now define lineage A. 23. By October 2020, lineage A.23 viruses were also found outside of the prisons in a community sample from Lira (a town 140 km from Amuru), in two samples from the Kitgum hospital, in several community samples from Kampala, Jinja, Mulago, Tororo and Soroti as well as in 2 truck drivers collected at the POE bordering Kenya. By November 2020, the A.23 viruses had spread further to northern Uganda in Gulu and Adjumani, as observed in this study. Lineage A.23 viruses were not seen in Uganda (or anywhere in the world) before August 2020 (Fig. 3c), yet the A.23 viruses were attributed to 32% of the viruses in Uganda (Fig. 1) from June to August 2020 and 50% of the observed viruses in September-November 2020. In late October, the A.23.1, a variant evolving from A.23, with additional change in the spike protein (P681R) was observed (Fig. 3b

Important changes observed in the spike protein
The spike protein is crucial for virus entry into host cells, for tropism and is a critical component of COVID-19 vaccine development and monitoring. The changes in spike protein observed in Uganda and the global A.23 and A.23.1 viruses are shown in Fig. 3b. Many amino acid changes were single events with no apparent transmission observed. However, the initial lineage A.23 genomes from Amuru and Kitgum encoded three amino acid changes in the exposed S1 domain of the spike protein (F157L, V367F and Q613H; Fig. 3b). The V367F change is reported to modestly increase infectivity 15 and the Q613H change may have similar consequences as the D614G change observed in the B.1 lineage found predominantly in Europe and the US; in particular, D614G was reported to increase infectivity, spike trimer stability and furin cleavage [15][16][17][18] . These changes were not observed in previously reported genomes from Uganda 8 . Of some concern, the mutations E484K and N501Y amino acid changes in the receptor-binding domain were observed in the A.23 viruses identified in the Adjumani cases on 9-11 November 2020 (Fig. 3b). These two amino acid changes are shown to substantially compromise vaccine efficacy and antibody treatments.
Of concern, the recent Kampala and global A.23.1 virus sequences from December 2020 to January 2021 now encoded 4 or 5 amino acid changes in the spike protein (now defining lineage A.23.1) plus additional protein changes in nsp3, nsp6, ORF8 and ORF9 (Figs. 3b and 4). The substitution of proline by arginine at spike  position 681 importantly adds a positively charged amino acid adjacent to the cleavage site for the host furin protease. An identical proline to arginine change enhances the fusion activity of the SARS-CoV-2 spike protein in in vitro experiments and this has been proposed to increase spike protein cleavage by the cellular furin protease 19 ; importantly, a similar change (P681H) is encoded by the recently emerging VOC B.1.1.7 that is spreading globally across 75 countries as of 5 February 2021 (refs. 5,20 ). There are also changes in the spike N-terminal domain, a known target of immune selection, observed in samples from the Kampala A.23.1 lineage, including P26S and R102I (Fig. 3b). Additionally and importantly, an A.23.1 strain identified in Kampala on 11 December 2020 carried the E484K change in the receptor-binding domain, which may add further concern of this particular variant as it gains higher transmissibility and enhanced resistance to vaccines and therapeutics. Outside of the spike protein, a single nucleotide change (G27870T) leading to early termination of ORF7b (E39*) was observed in the A.23.1 from the community cases in Tororo in late December 2020. Although the clinical implication of this change is yet to be determined, it is important to document such changes for further follow-up.

Lineage A designations
The viruses detected in Amuru and Kitgum met the criteria for a SARS-CoV-2 lineage 4,21 by clustering together on a global phylogenetic tree, sharing epidemiological history and source from a single geographical origin and encoding multiple defining single-nucleotide polymorphism (SNPs). These features, especially the three spike changes F157L, Q613H and V367F, define the A.23 lineage. Continued circulation and evolution of A.23 in Uganda was observed and two additional changes in spike R102I and P681R were observed in December 2020 in Kampala, with the later amino acid change adding to the list of defining SNPs for the sublineage A.23.1 (F157L, V367F, Q613H and P681R). Additional changes in non-spike regions also define the A. 23    Screening SARS-CoV-2 genomic data from GISAID (12 March 2021), the A.23 and A.23.1 viruses were found in 26 countries outside of Uganda (Fig. 3c). A.23 was first observed in Uganda in August 2020, subsequently in the US in October and Kenya and Rwanda in December (Fig. 3c). The first A.23.1 genomes in Uganda were detected in community cases in Mbale on 28 October 2020 and in Jinja on 29 October 2020 and were soon spreading across the country in early November 2020. Outside Uganda, A.23.1 was found in England and Cambodia from the end of November and in Rwanda from the beginning of December. Of note, international flights out of Uganda were restarted on 1 October 2020 with flights to Europe, Asia and the US. Phylogenetic analysis supported the evolution of A.23 to A.23.1 (Extended Data Fig. 1).

Additional changes in the A.23 and A.23.1 genomes from Uganda compared to other VOC genomes
Although a main focus has been on spike protein changes, there are changes in other genomic regions of the SARS-CoV-2 virus accompanying the adaptation to human infection. We employed profile Hidden Markov Models (pHMMs) prepared from 44 amino acid peptides across the SARS-CoV-2 proteome 22 to detect and visualize protein changes from the early lineage B reference strain NC_045512. Measuring the identity score (bit-score) of each pHMM across a query genome provides a measure of protein changes in 44 amino acid steps across the viral genome (Fig. 4). This method applied to the A.23 and A.23.1 genome sequences revealed the changes in spike and changes in the transmembrane protein nsp6 and interferon modulators ORF8 and ORF9 (Fig. 4).
We asked if a similar pattern of evolution was appearing in VOCs as SARS-CoV-2 adapted to human infection. We gathered the sets of genomes described in the initial published descriptions of these VOCs (B.1.1.7 (ref. 5 Fig. 4b). Lineage B.1.351 encodes nsp3, nsp6, RDRP, spike and ORF6 changes (Fig. 4c) and lineage P.1 encodes nsp3, nsp6, RDRP, nsp13, spike and ORF8 and ORF9 changes (Fig. 4d). Although the exact amino acid and positions of change within the proteins differ in each lineage, there are some striking similarities in the common proteins that have been altered. Of interest, the nsp6 change present in B.1.1.7, B.1.351 and P.1 is a 3-amino acid deletion (106, 107 and 108) in a protein loop of nsp6 predicted to be on exterior of the autophagy vesicles on which the protein accumulates 24 . The three-amino acid nsp6 changes of lineage A.23.1 are L98F in the same exterior loop region; the M86l and M183I changes are predicted to be in intramembrane regions but adjacent to where the protein exits the membrane 24 (Extended Data Fig. 2). The A23.1 ORF8 gene encodes changes in the C-terminal domain (Extended Data Fig. 3). A compilation of the amino acid changes in A.23.1 and the VOC lineages is found in Supplementary Table 1 with proteins that are altered in all four lineages marked in red.

Discussion
We report the emergence and spread of a SARS-CoV-2 variant of the A lineage (A.23.1) with multiple protein changes throughout the viral genome. The pattern of A.23.1 emergence and dominance has also been observed in the neighbouring country of Rwanda 25 . A similar phenomenon recently occurred with the B.1.1.7 lineage, detected first in the southeast of England 5 and now globally, and with the B.1.351 lineage in South Africa 6 and the P.1 lineage in Brazil 26 suggesting that local evolution (perhaps to avoid the initial population immune responses) and spread may be a common feature of SARS-CoV-2. Importantly, lineage A.23.1 shares many features found in the lineage B VOCs, including alteration of key spike protein regions, especially the angiotensin-converting enzyme 2 binding region, which is exposed and immunogenic, the furin cleavage site and the 613/614 change that may increase spike multimer formation. The VOC and A.23.1 strains also encode changes in the similar region of the nsp6 protein, which may be important for altering cellular autophagy pathways that promote replication. Changes or disruption of ORF7, ORF8 and ORF9 are also present in the VOC and A.23.1. ORF8 changes or deletion probably indicates that this protein is unnecessary for human replication; similar deletions accompanied SARS-CoV-2 adaption to humans 27,28 .
This study has potential limitations. We report the results of full-genome virus sequencing in a resource-limited region during a period with severe restraints on reagent procurement, travel and laboratory staffing; thus, total numbers were limited to 322 full genomes. Ideally, all positive cases in the country would be sequenced but this was practically not possible. On the other hand, the genome to case percentage we reported was 0.79% (322 genomes/40,490 cases), which is comparable with the case sequencing rate reported in South Africa (0.2%) and Nigeria (0.37%) for comparison. The geographical origin of the genomes (Extended Data Fig. 5) shows coverage across the country. Certainly, given the small number of genome sequences available from this study and from the region, we should caution that the particular evolutionary pathway proposed in this study (A.23 emergence in Uganda in August, evolution to A.23.1 and then spread to the region and globally) is supported by the available sequencing data but limited by the less then 100% sequence/case coverage and limited sampling in the region. Alternate pathways are possible if, for example, A.23.1 had evolved in Tanzania or another unsampled country and then moved into Uganda. Additionally, Uganda or the East Africa region does not have the resources to provide the detailed surveillance and diagnostic testing seen in Europe or North America, so national or sentinel surveillance may not be as detailed and comprehensive as that occurring in the north. Moreover, the MinION technology, like all other sequencing technologies currently in use (Illumina, Ion Torrent, Sanger Dideoxy), has a sequencing error profile. Nonetheless, MinION has been used to generate about 40% of over 1 million SARS-CoV-2 sequences now available in GISAID and is accepted as a reasonable sequencing technology. To limit any potential MinION sequencing errors in our sequences, we have reported and analysed only complete, high-coverage sequences (>10,000-fold coverage) and have manually checked all single nucleotide changes and deletions in the assembled genomes.
Independent of pangolin lineage assignation, it is clear that a SARS-CoV-2 lineage emerged (A.23) and evolved into a sublineage (A.23.1) that dominated the epidemic in Uganda by January 2021. This can be confirmed independently of pangolin use since we examined the maximum-likelihood phylogenetic trees (Fig. 2 and Extended Data Fig. 1) where the A.23 cluster of genomes is basal to the A.23.1 second cluster. Also independent of pangolin use, the pattern of amino acid changes observed with the substitutions observed in spike proteins from genomes identified as A.23 (F157L, V367F, Q613H) is a clear subset of the substitutions observed in the genomes designated A.23.1 (F157L, V367F, Q613H, P681R). Further support for the pangolin lineage assignation can be seen in the global timing of the observations of the two lineages illustrated in Fig. 3c, with the lineage A.23 cases observed before the A.23.1 samples. Certainly, the temporal pattern could have occurred by chance in a few places due to sequencing capacity and coverage. However, the global temporal pattern, particularly occurring in countries with massively extensive sequencing efforts like the UK and US, would indicate that the phenomenon is consistent with A.23.1 evolving from A.23 and consistent with the lineage classification by the pangolin tool.
We suspect that emerging SARS-CoV-2 lineages may be adjusting to infection and replication in humans and it is notable that the VOC and A.23.1 lineage share some common features in their evolution. The spike changes are best understood due to the massive global effort to define the receptor and develop vaccines against the infection. The analysis reported in Fig. 4 reveals common functions of SARS-CoV-2 that have been altered in all four variants, especially nsp6 and ORF8 and ORF9. The functional consequences of the additional non-spike changes warrant additional studies and the current analysis may focus the efforts of the proteins that are commonly changed in the variant lineages. Finally, determining the susceptibility of A.23.1 to vaccine immune responses is of great importance as vaccines become available in this part of Africa.

Statistics and reproducibility.
No statistical method was used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during the experiments and outcome assessment.
Sample collection, whole-genome MinION sequencing and genome assembly. SARS-CoV-2 PCR with reverse transcription-positive samples were obtained from the Central Public Health Laboratories (Kampala, Uganda). All testing facilities across the country contribute to the sample collection at Central Public Health Laboratories and the sample catchment area is country-wide, including clinical sites, testing sites at border crossings and commercial laboratories testing the entry and exit of international travellers. The fraction of genomes per district compared to cases per district is shown in Extended Data Fig. 5a and the geographical source of the samples across Uganda is shown in Extended Data Fig. 5b. The samples reported in this manuscript span the period from the first positive case in Uganda (21 March 2020) until 23 January 2021. We attempted to sequence all samples that could be shared with us and each sample was only sequenced once (no replication was performed).
The nucleic acid extracted from samples was converted to complementary DNA and amplified using a SARS-CoV-specific 1,500-base pair amplicon spanning the entire genome as described previously 29 . The resulting DNA amplicons were used to prepare sequencing libraries, barcoded individually and then pooled to sequence on MinION R.9.4.1 flow cells, according to the manufacturer's standard protocol.
Genome assemblies were performed as described previously 8 . Briefly, reads from FAST5 files were base-called and demultiplexed using Guppy v.3.6 running on the UMIC HPC. Adaptor and primer sequences were removed using Porechop v.0.2.4 (https://github.com/rrwick/Porechop) and the resulting reads were mapped to the reference genome Wuhan-1 (GenBank NC_045512.2) using minimap2-2.17 (r941) 30 and consensus genomes were generated in Geneious Prime 2021.1.1 (Biomatters). Genome polishing was performed in Medaka v.1.3.4 and SNPs and mismatches were checked and resolved by consulting raw reads. To limit any possible MinION sequencing errors in our sequences, we have reported only high-coverage sequences and have manually checked all single nucleotide changes and deletions in the assembled genomes; non-supported changes have been replaced with NS.
For the phylogenetic analyses of the Uganda lineage A. 23  The pHMM domain analysis of A.23/A.23.1 and VOC genomes was performed as described previously 22 with some changes. A database of pHMMs was generated from the first 65 lineage B SARS-CoV-2 genome sequences. All 3 forward open reading frames of each genome were translated computationally and then sliced into a 44-amino acid segment overlapping with 22 amino acids. All 44 amino acid query peptides were then clustered with the uclust module from usearch11.0.667_i86osx32 (ref. 39 ) and their original identity and coordinates determined by BLASTp search against a protein database made from the NC_045512 reference strain.
Query sets of genomes were processed to remove any genomes containing ambiguous nucleotides, which disrupt the HMM scoring process. The hmmscan function from HMMER v.3.3.2 (ref. 40 ) was used with the early B database. Query matches were identified using an E-value cut-off of 0.0001; the bit-score values for each hit (a measure of the distance between the query 44-amino acid peptide and the lineage B reference) was collected. Bit-scores for each domain were normalized by dividing each query score by the maximum score for that domain (x/x_max). In all analyses, the original lineage B NC_045512 reference genome was included to define the maximum bit-score.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed The exact sample size ( ) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. , , ) with confidence intervals, effect sizes, degrees of freedom and value noted Policy information about availability of computer code Data collection We use published software and programs to collect the data, as described in the manuscript.

Data analysis
We use published software and programs to analyse the data, as described in the manuscript.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability All genome sequences reported here are deposited in GISAID and available under accession numbers EPI_ISL_954226--EPI_ISL_954300. A second tranche of genome sequences have been deposited and we are waiting for accession numbers