Temporal lineage replacements and dominance of imported variants of concern during the COVID-19 pandemic in Kenya

Background Kenya’s COVID-19 epidemic was seeded early in March 2020 and did not peak until early August 2020 (wave 1), late-November 2020 (wave 2), mid-April 2021 (wave 3), late August 2021 (wave 4), and mid-January 2022 (wave 5). Methods Here, we present SARS-CoV-2 lineages associated with the five waves through analysis of 1034 genomes, which included 237 non-variants of concern and 797 variants of concern (VOC) that had increased transmissibility, disease severity or vaccine resistance. Results In total 40 lineages were identified. The early European lineages (B.1 and B.1.1) were the first to be seeded. The B.1 lineage continued to expand and remained dominant, accounting for 60% (72/120) and 57% (45/79) in waves 1 and 2 respectively. Waves three, four and five respectively were dominated by VOCs that were distributed as follows: Alpha 58.5% (166/285), Delta 92.4% (327/354), Omicron 95.4% (188/197) and Beta at 4.2% (12/284) during wave 3 and 0.3% (1/354) during wave 4. Phylogenetic analysis suggests multiple introductions of variants from outside Kenya, more so during the first, third, fourth and fifth waves, as well as subsequent lineage diversification. Conclusions The data highlights the importance of genome surveillance in determining circulating variants to aid interpretation of phenotypes such as transmissibility, virulence and/or resistance to therapeutics/vaccines.

1) The study appears to use samples collected majorly from some parts of Western, Rift Valley and central regions of Kenya (although this is not clearly stated), how representative is your sample for the whole of Kenya to generalize the findings for the country? A table and/or map showing the sampling density across the country and time will be useful for the readers to understand your findings.
2) The criteria for selecting the samples sequenced needs clarification. The laboratory identified 4,109 positives during the study period but only 483 were genome sequenced. Other than Ct value <33, were there any additional criteria for a sample to be selected for sequencing? There is need to detail how many samples were considered at every stage and perhaps showing this in a sample flow gram.
3) Although work describing the original laboratory methods has been referenced as necessary, the methods in this work have minimal details available. For instance the whole genome sequencing procedure need to include reagent cat no. plus manufacturer name, quantities and concentrations used, number of reaction pools, thermocycling conditions, quality controls (e.g +ve/-ve Control) and , extent of multiplexing during sequencing. 4) Related to (3) above, the bioinformatic analysis needs details. For instance, the parameters used with ngs mapper v1.5, nature of curation done using the nextclade Web,, there are over 4M genomes on GISAID, how was the subsampling done to end-up with the few context genomes, exact version of Pangolin and Pangolearn used, how were multiple sequence alignments produced, details of the augur pipeline and parameters. 5) The background section needs references especially when describing the Kenya epidemic. There is now significant published literature on the Kenya /Africa epidemic including genomics work e.g(https://pubmed.ncbi.nlm.nih.gov/34618602/, https://www.nature.com/articles/s41467-021-25137-x, https://pubmed.ncbi.nlm.nih.gov/34473191/, https://www.science.org/doi/10.1126/science.abj4336, and https://www.medrxiv.org/content/10.1101/2021.07.01.21259583v1). This need to be discussed in the introduction to be clear on what knowledge gaps this study is attempting to fill and the context of the results here.
Minor observations -It will be interesting to see some analysis on the estimated number of introductions for the different VOCs/VOI and a formal inference on the potential source -Why is there no Beta specific tree albeit the being fewer in numbers? -GISAID requires login to access the genomic data, do the authors plan to deposit the genomic data in a fully public database e.g. GenBank? -Line 26, not sure about the evidence of CIVET cats as intermediate hosts for SARS-CoV-2 -Line 14 and line 124, delta lineage is "B.1.617.2" NOT "B.1.167.2" -Line 62, is Congo neighboring Kenya? - Table 1: Are there participants who were repeatedly sampled? These will skey the participant demographics; -It will be interesting to see the demographics of the tested individuals (positive vs negative) as this is a less biased sample compared to what is shown on Table 1 in understanding the underlying demographics -In the methods, add information on who you defined wave (criteria) 1, 2, and 3 periods.
-Parts of line 110-111 are repeated in the discussion section line 179-180. This should be removed from the results section -Line 206, I don't understand what you mean by "passed" -It will be great if the authors can also highlight and discuss the limitations of this work.
-Other than genomes, do the authors plan to make the scripts and metadata used in their analysis public? Not seen any link for this.
-If the authors have acquired more genomic data during the review period, they should be encouraged to include it into this report to improve sample size Reviewer #3 (Remarks to the Author): This article is well written, original and in adequacy with the current problematic of the variants of SARS-CoV-2. Analysis of variants associated with epidemic peaks in countries is important to understand the phenomenon of viral ecology shift in favor of more contagious variant which could with selective advantages.
Reviewer #4 (Remarks to the Author): Comments on Kimita et al. "A genomics dissection of Kenya's COVID-19 waves: temporal lineage replacements and dominance of imported variants of concern" The authors report an analysis of SARS-CoV-2 genomic sequences from a portion of the Kenyan Covid-19 epidemic. The work has the potential to be informative about viral movement into and evolution within Kenya. There are however a number issues with the manuscript and the analysis that could be improved.
1. Line 9 "Kenya's COVID-19 epidemic was slow to peak. It was seeded early in March 2020 and did not peak until late July 2020 (wave 1), mid-November 2020 (wave 2) and late March 2021 (wave 3)." Line 244: "Three COVID-19 waves occurred in Kenya, and by the time of pressing, a 4th wave had emerged.
From the Our World in Data website (https://ourworldindata.org/coronavirus) there were 4 distinct waves in the OWID data with the most recent wave peaking in mid-August, nearly three months ago. For completeness all waves should be covered, including genomic sequencing across the 4th wave. The publications from the other groups in Kenya who generated these data (with much hard work) should be cited. 5. Line 9: "The data highlight the importance of genome surveillance in determining circulating variants to aid in public health interventions." How are these sequence data used to aid in public health interventions? This should be better described.
6. Line 25: "zoonotic spillover event believed to have been from a progenitor bat coronavirus and civet cats as intermediates (Zhou et al., 2020)." I don't know of any evidence of civet cats as an intermediate for SARS-CoV-2 and there is nothing in the Zhou et al. 2020 reference to support this statement. Perhaps the authors are confusing this with SARS-CoV. 7. Figure 1. Odd X axis, the unit changes, some times every day, sometimes very 2 or 3 days. The authors would be better off plotting lineages detected by month or by week at best, it would also be easier to see the lineage changes. The lineages are almost impossible to discern, poor choice of color (e.g. there are 6 nearly identical dark blue/blacks used) and the X axis unit is too fine. Also case numbers across the country should be plotted as a reference to show when the three waves occurred.
8. Figure 2: Venn plot showing unique and shared lineages across the three COVID-19 waves. Without quantitative visualization, this figure is fairly non-informative and could be dropped. 9. Figure 3. "Time-scaled phylogenetic tree of Kenyan samples against global isolates. The tree was constructed with 112 genomes sampled from GISAID and 323 genomes from this study. Thin lines represent context global samples, while thick lines represent Kenyan samples. The different colors on circular tips of branches represent the Pango lineages." How were these 112 GISAID genomes selected? Not clear which nodes are from this study. Not clear what question is being asked with this analysis.
10. Figure 4. Not clear what question is being asked with this analysis. How were global B.1.1.7 sequences selected? Do we need to distinguish every global country? It is hard to discern the Kenyan nodes with similar red, dark oranges used for Kenya, Denmark, Libya, South Sudan. 11. Line 158 "Wuhan/WHO1/2019 reference" Not clear what this is. the authors should specify a Genbank or GISAID accession number.
12. Figure 5. "Phylogenetic tree of the B.1.617.2 lineage from our study samples and those from across Africa. The tree was constructed with 893 genomes, including those from Kenya (n=33), those from other parts of Africa (n=812) and early B.1.617.2 lineages (n=46) traceable to India. Kenyan samples are shown as circular red branch tips. The red stars show the earliest delta variant introduction in late April 2021 from Nairobi samples, while samples from Kisumu (the county that had the first major delta variant outbreak) are contained in clades represented by blue and purple stars." 13. Why were only genomes from African countries included global data used for Figure 5?. There is much tourist and European traffic into Kenya and a substantial Indian population so it is equally likely that B.1.617.2 variants entered from many parts of the world. What is the question being asked with this analysis? 14. Line 102 and Table 1 demographics. Alone, the demographic data of the sequenced samples are not very meaningful without the sample parameters for all COVID-19 cases and for all of Kenya for comparison. Is the median age of 33 different from the median age of all diagnosed COVID-19 cases and from the median age of the total population? The data from all cases is needed if the authors want to argue that there is no bias in the sequenced samples. I suspect the pattern will be similar to Kenya from the region sampled. But again, what is the question being asked with this analysis? 15. Line 109 "Each wave was preceded by low infection rates, probably as variants competed through narrow transmission bottlenecks that selected the fittest variants, some of them to eventually become the dominant variants in succeeding waves (Lythgoe et al., 2021)." This is speculation and should not be in results section. There could be a lot of reasons for these patterns and the authors present this as a fact, which can be misleading. Also I doubt that the selection occurred in Kenya, the VOC B.1.1.7 and B.1.251 entered from UK or South Africa already fit.
16. Line 21: "has literally been the 2020/21 blockbuster virus" Considering the large number of deaths and the long-term sequelae of this infection, referring to the virus as a "blockbuster" (which has bestseller book and movie connotations) seems glib and and an inappropriate word choice. I would change this The authors made no mention of how they aligned their sequences and which tools and settings were used. The information regarding phylogenetic analysis was not provided, no mention of tree model or properties, please authors should include details of all bioinformatic methods/tools used not summary.
We thank the reviewer for catching this omission. Details requested by the reviewer are now provided. See lines 140 to 155 and lines 156 to 161 in the MS with tracked changes.

Results
The information provided in relation to sequenced samples too poor not sufficient. Authors are advised to include other information such as clinical condition/outcome, occupation, recent travel history, economic status e.t.c.
Additional epidemiological data available to us are now provided in Table 1 and Supplementary Table  1. See lines 166 to 172 in the MS with tracked change. Unfortunately, being a testing Lab, we did not receive enough epidemiological information.
The structure of the results section also needs to be improved upon, they seem to be lumped up thereby capable of confusing readers, and also authors need to provide a separate table detailing the linages identified in each wave and their sources/location. Also table 1 need to be expanded to include additional information as mentioned above.
The results' section has been restructured with subtitles to make reading easier. Table detailing lineages identified in each wave, their sources and location are now provided as Supplementary Table  1. Also, as stated above, we did not receive enough epidemiological information.
Lines 129 to 175, are written in form of figure legend and not results, the authors are advised to move them into a separate section under the heading figure legends, and write out the data generated under the results section and not combine both.
Am not sure I understand what the reviewer is unhappy about. Figure legends are provided in the following their first mention. Anyhow, we have tried to make the legends and results more distinct.
The authors also need to support their finding with evidence showing spacial dispersal and phylogeographic movements of the various strains identified through the 3 waves under consideration in this study, a good example of such analysis can be seen in "Tegally et al. Nat Med. 2021 Mar;27(3):440-446." This will go a long way in highlighting the relationships between the origins and multiple introductions of the virus variants through the waves.
Unfortunately, most our samples did not come with enough epidemiological information to allow the analysis proposed by the reviewer. We have nevertheless speculated on sources, especially for VOCs.

Reviewer #2 (Remarks to the Author):
1) The study appears to use samples collected majorly from some parts of Western, Rift Valley and central regions of Kenya (although this is not clearly stated), how representative is your sample for the whole of Kenya to generalize the findings for the country? A table and/or map showing the sampling density across the country and time will be useful for the readers to understand your findings.
The reviewer is correct that our study genomes are not representative of the whole country. We now state the number of samples received from each region (see lines 171 to 172). We have also included Supplementary Table 1 that shows place and date of sample collection. Lastly, we acknowledge the data skew as a limitation (see lines 465 to 468 in the manuscript with tracked changes).
2) The criteria for selecting the samples sequenced needs clarification. The laboratory identified 4,109 positives during the study period but only 483 were genome sequenced. Other than Ct value <33, were there any additional criteria for a sample to be selected for sequencing? There is need to detail how many samples were considered at every stage and perhaps showing this in a sample flow gram.
We have now provided additional information on criteria for sample selection. We say "Of the 1089 COVID-19 nasal specimens that passed the threshold for whole genome sequencing (Cts <33), 45 were dropped because they did not pass the threshold required for assigning Pango lineages. Ten additional samples were dropped because they lacked date of collection. The remaining 1034 genomes collected between May 2020 and January 2022 were used to monitor the evolution of SARS-CoV-2 lineages across the five COVID-19 waves. See lines 124 to 127 in the manuscript with tracked changes.
3) Although work describing the original laboratory methods has been referenced as necessary, the methods in this work have minimal details available. For instance the whole genome sequencing procedure need to include reagent cat no. plus manufacturer name, quantities and concentrations used, number of reaction pools, thermocycling conditions, quality controls (e.g +ve/-ve Control) and , extent of multiplexing during sequencing.
We thank the reviewer for catching these discrepancies. The Methods section has been elaborated to include more details than previously provided. See lines 90 to 103 in the manuscript with tracked changes.
4) Related to (3) above, the bioinformatic analysis needs details. For instance, the parameters used with ngs mapper v1.5, nature of curation done using the nextclade Web,, there are over 4M genomes on GISAID, how was the subsampling done to end-up with the few context genomes, exact version of Pangolin and Pangolearn used, how were multiple sequence alignments produced, details of the augur pipeline and parameters.
We again thank the reviewer for catching these discrepancies. More details are now provided in the Methods section, to include more details than previously provided. See lines 129 to 161 in the manuscript with tracked changes.

We have now expanded the references in the background section and included the relevant literature, as suggested by the reviewer.
Minor observations -It will be interesting to see some analysis on the estimated number of introductions for the different VOCs/VOI and a formal inference on the potential source Unfortunately, most our samples did not come with enough epidemiological information to allow the analysis proposed by the reviewer. We have nevertheless speculated on sources, especially for VOCs.

See lines 277 to 359 in the manuscript with tracked changes.
-Why is there no Beta specific tree albeit the being fewer in numbers?
We have now included a Beta VOC phylogenetic tree. See Figure 4, described in lines 310 to 324 in the manuscript with tracked changes.
-GISAID requires login to access the genomic data, do the authors plan to deposit the genomic data in a fully public database e.g. GenBank?
Yes, once our manuscript is published, we deposit the genomic data in a freely accessible public database. -Line 62, is Congo neighboring Kenya?

This has been corrected (see lines 84 in the manuscript with tracked changes).
- Table 1: Are there participants who were repeatedly sampled? These will skey the participant demographics; No, the data does not include participants with repeat sampling -It will be interesting to see the demographics of the tested individuals (positive vs negative) as this is a less biased sample compared to what is shown on Table 1 in understanding the underlying demographics We agree with the reviewer that a comparison of demographics between individuals who tested positive Vs negative would be interesting. Our study was however on individuals who provided useable genome sequences and not even on those who tested positive.
-In the methods, add information on who you defined wave (criteria) 1, 2, and 3 periods. -It will be great if the authors can also highlight and discuss the limitations of this work.
A paragraph highlighting limitations of the study has now been added. See lines 465 to 468 in the manuscript with tracked changes.
-Other than genomes, do the authors plan to make the scripts and metadata used in their analysis public? Not seen any link for this.
We thank the reviewer for this suggestion. We had not planned on doing this, but come to think of it, we should. We will eventually make them accessible to the public.
-If the authors have acquired more genomic data during the review period, they should be encouraged to include it into this report to improve sample size Great suggestion. We have done just that. We now include data collected up to January 2022

Reviewer #3 (Remarks to the Author):
This article is well written, original and in adequacy with the current problematic of the variants of SARS-CoV-2. Analysis of variants associated with epidemic peaks in countries is important to understand the phenomenon of viral ecology shift in favor of more contagious variant which could with selective advantages.

Reviewer #4 (Remarks to the Author):
Comments on Kimita et al. "A genomics dissection of Kenya's COVID-19 waves: temporal lineage replacements and dominance of imported variants of concern" The authors report an analysis of SARS-CoV-2 genomic sequences from a portion of the Kenyan Covid-19 epidemic. The work has the potential to be informative about viral movement into and evolution within Kenya. There are however a number issues with the manuscript and the analysis that could be improved.

This has been corrected. See lines 10-18 in the manuscript with tracked changes.
Line 244: "Three COVID-19 waves occurred in Kenya, and by the time of pressing, a 4th wave had emerged.
From the Our World in Data website (https://ourworldindata.org/coronavirus) there were 4 distinct waves in the OWID data with the most recent wave peaking in mid-August, nearly three months ago. For completeness all waves should be covered, including genomic sequencing across the 4th wave.
Great suggestion. We have done just that. We now include data collected up to January 2022, that includes all the five waves. The publications from the other groups in Kenya who generated these data (with much hard work) should be cited.
We cannot find a specific paper on Beta variant, but we have cited the Policy brief reported by the Wellcome Trust group in Kilifi, Kenya (KEMRI, 2021). We say "The Beta VOC (B.1.351 lineage) appears to have been introduced in Kilifi County, coastal Kenya in January 2021 (KEMRI, 2021)". See line 415 to 422 the manuscript with tracked changes. 5. Line 9: "The data highlight the importance of genome surveillance in determining circulating variants to aid in public health interventions." How are these sequence data used to aid in public health interventions? This should be better described.
We have rephrased the sentence to say "The data highlight the importance of genome surveillance in determining circulating variants to aid interpretation of phenotypes such as transmissibility, virulence and/or resistance to therapeutics/vaccines". See lines 26-26 in the manuscript with tracked changes.
6. Line 25: "zoonotic spillover event believed to have been from a progenitor bat coronavirus and civet cats as intermediates (Zhou et al., 2020)." 7. Figure 1. Odd X axis, the unit changes, some times every day, sometimes very 2 or 3 days. The authors would be better off plotting lineages detected by month or by week at best, it would also be easier to see the lineage changes. The lineages are almost impossible to discern, poor choice of color (e.g. there are 6 nearly identical dark blue/blacks used) and the X axis unit is too fine. Also case numbers across the country should be plotted as a reference to show when the three waves occurred.
We agree with the reviewer that the x-axis scale was too fine. We have now collapsed the x-axis into a month-year system and also collapsed the lineages based on their virulence i.e. VOC/VOI vs the other non-VOC/VOI. Case numbers across the country have now been plotted and are available as "Extended data 1". 8. Figure 2: Venn plot showing unique and shared lineages across the three COVID-19 waves. Without quantitative visualization, this figure is fairly non-informative and could be dropped.
We humbly disagree with the reviewer. Although the Venn plot is not quantitative, it provides a good visual summary of lineage relationships across the waves. As stated in the test, the figure shows the phylogenetic placement of the study genomes in a global context. In the new figure, we place more emphasis on VOIs or VOCs, and color these branches appropriately.
10. Figure 4. Not clear what question is being asked with this analysis. How were global B.1.1.7 sequences selected? Do we need to distinguish every global country? It is hard to discern the Kenyan nodes with similar red, dark oranges used for Kenya, Denmark, Libya, South Sudan. Figure 4 has been re-done. The analysis sought to identify how the alpha variants in Kenya relate phylogenetically to those sub-sampled globally. We also track lineages introductions and subsequent diversification. The study samples are now shown in deep blue circular tips, other Kenyan genomes in light blue and global genomes in yellow circular tips.

This tree, now
11. Line 158 "Wuhan/WHO1/2019 reference" Not clear what this is. the authors should specify a Genbank or GISAID accession number.
Thank you for noting this omission. Genbank accession number has been added. See lines 303 in the manuscript with tracked changes. We have also added Genbank/GISAID accession numbers for the SAR-CoV-2 references we used to root all the trees.
12. Figure 5. "Phylogenetic tree of the B.1.617.2 lineage from our study samples and those from across Africa. The tree was constructed with 893 genomes, including those from Kenya (n=33), those from other parts of Africa (n=812) and early B.1.617.2 lineages (n=46) traceable to India. Kenyan samples are shown as circular red branch tips. The red stars show the earliest delta variant introduction in late April 2021 from Nairobi samples, while samples from Kisumu (the county that had the first major delta variant outbreak) are contained in clades represented by blue and purple stars." 13. Why were only genomes from African countries included global data used for Figure 5?. There is much tourist and European traffic into Kenya and a substantial Indian population so it is equally likely that B.1.617.2 variants entered from many parts of the world. What is the question being asked with this analysis?
The reviewer raises a good point. We have now changed the approach and used a global sub-sample rather than a regional sub-sampling, owing to a solid argument of high traffic between Kenya and the world. The analysis sought to identify how the delta variants in Kenyan relate phylogenetically to those sub-sampled globally. We also track first introductions and subsequent diversification, especially for clusters such as the AY.46 and AY.16 that were driving the outbreak in Kenya.
14. Line 102 and Table 1 demographics. Alone, the demographic data of the sequenced samples are not very meaningful without the sample parameters for all COVID-19 cases and for all of Kenya for comparison. Is the median age of 33 different from the median age of all diagnosed COVID-19 cases and from the median age of the total population? The data from all cases is needed if the authors want to argue that there is no bias in the sequenced samples. I suspect the pattern will be similar to Kenya from the region sampled. But again, what is the question being asked with this analysis?
We humbly disagree with the reviewer's view on Table 1. First, the Table shows data on individuals who contributed genomes, not to be confused with those who had SARS-CoV-2 by PCR. There is a bias on sequenced samples in that only those with enough viremia (Ct <33) were sequenced. In the section titled "Sample acquisition" we say: "Of the 63,542 tested, 8.59.0% (n=5,3754) were positive for SARS-CoV-2 at varying cycle thresholds (Ct). Of these positive samples, 1089 with Cts <33 were selected for whole genome sequencing". See lines 91 to 92 in the manuscript with tracked changes.
15. Line 109 "Each wave was preceded by low infection rates, probably as variants competed through narrow transmission bottlenecks that selected the fittest variants, some of them to eventually become the dominant variants in succeeding waves (Lythgoe et al., 2021)." This is speculation and should not be in results section. There could be a lot of reasons for these patterns and the authors present this as a fact, which can be misleading. Also I doubt that the selection occurred in Kenya, the VOC B.1.1.7 and B.1.251 entered from UK or South Africa already fit.
The statement has now been removed.
16. Line 21: "has literally been the 2020/21 blockbuster virus" Considering the large number of deaths and the long-term sequelae of this infection, referring to the virus as a "blockbuster" (which has bestseller book and movie connotations) seems glib and and an inappropriate word choice. I would change this