Detailed phylogenetic analysis tracks transmission of distinct SARS-COV-2 variants from China and Europe to West Africa

SARS-CoV-2, the virus causing the COVID-19 pandemic emerged in December 2019 in China and raised fears it could overwhelm healthcare systems worldwide. Mutations of the virus are monitored by the GISAID database from which we downloaded sequences from four West African countries Ghana, Gambia, Senegal and Nigeria from February 2020 to April 2020. We subjected the sequences to phylogenetic analysis employing the nextstrain pipeline. We found country-specific patterns of viral variants and supplemented that with data on novel variants from June 2021. Until April 2020, variants carrying the crucial Europe-associated D614G amino acid change were predominantly found in Senegal and Gambia, and combinations of late variants with and early variants without D614G in Ghana and Nigeria. In June 2021 all variants carried the D614G amino acid substitution. Senegal and Gambia exhibited again variants transmitted from Europe (alpha or delta), Ghana a combination of several variants and in Nigeria the original Eta variant. Detailed analysis of distinct samples revealed that some might have circulated latently and some reflect migration routes. The distinct patterns of variants within the West African countries point at their global transmission via air traffic predominantly from Europe and only limited transmission between the West African countries.


Results
Phylogenetic tree and diversity. The phylogenetic tree shown in Fig. 1a displays similarities of West-African virus sequences with representative reference sequences from China and multiple European countries. The tree can be divided into two major branches resulting from the A23403G (D614G) substitution. The branch at the bottom is directly associated with the first recorded sequences from Wuhan, China and does not carry the D614G amino acid substitution. The Nigerian samples cluster with these early Chinese samples in the bottom branch of the tree. The branch on top is associated with sequences prevalent in Europe as demonstrated by reference sequences from Germany, France, Italy, Austria, Netherlands and UK. Ghanaian samples are about equally distributed between the top (European) and bottom branch of the tree. Senegalese samples cluster close with the French reference sample at the top of the tree.
The phylogenetic tree can be viewed interactively via the nextstrain.org framework under the URL: https:// nexts train. org/ commu nity/ wwruck/ wa.
The split of the tree by the A23403G (D614G) substitution into two major branches corresponds to the highest diversity found at that location (Fig. 1b). This mutation resides within the spike protein.
Association with clades. We associated the West-African and reference samples via their characteristic amino acid substitutions with clades according to the GISAID nomenclature. The phylogenetic tree in Fig. 2 is coloured by these clades. The West-African samples are distributed over all clades, thus suggesting introductions from China and European countries. However, each of the investigated countries has a specific pattern: most Senegalese samples have close similarity with the French reference, most Nigerian samples cluster in early Chinese-based clade S and Ghanaian samples are spread over all clades, the three Gambian samples are distributed over clades V, GR and GH. Within the clade S, there are putatively specific West-African amino acid substitutions at the branches at C24370T and G22468T. Ghanaian samples predominate in the branch associated with the C24370T mutation. The branch determined by the mutation G22486T (Supplementary Figure 1) may reflect migration routes because in the nextstrain analysis of the entire Africa there are also samples from Mali and Tunisia in this branch (https:// nexts train. org/ ncov/ africa? f_ region= Africa, accessed August 14 2020). Two of the non-French-related Senegalese samples emanate from the C24370T and G22468T branches whilst the other (Senegal/136) has strong similarity with Spanish end-February samples from the early clade S (Supplementary Figure 2) pointing at multiple introductions to Senegal from France, Spain and African countries.
Timeline of clade distribution. In the temporal course of the clade distribution in Fig. 3, the increased share of the Europe-associated G-clades becomes obvious. The G-clades harbor the putatively more infective D614G amino acid substitution 1,12 . Surprisingly, the later Europe-associated G-clades (G, GH, GR) emerged before the earlier clades L, S and V in West African sequenced samples. This could be due to founder effects introduced from France closely connected to Senegal and displaying a similar clade distribution and by migration and travel routes such as in the first registered Nigerian case infected in Italy 13 . Furthermore, the Chinabased L-, V-and S-clade samples were obtained in mid-March 2020, a time point within the Wuhan lockdown and when the epidemic in China was nearly over. Thus, the virus may have circulated in several countries before the first samples were sequenced. Surprisingly, the abundance of the S-clade is relatively high mainly due to the contribution from Nigeria and Ghana. However, without the S-clade distribution, the change in abundance resembles the global one with a delay of about 2-4 weeks.  Figure 4 shows that West African countries have acquired distinct patterns of China-and Europe-based clades. The first row contains the clade distribution charts of the West African countries investigated here whilst the second row contains charts of countries with comparable distributions. Nigeria has the highest percentage of the China-based early clades (L, S, V). Ghana has nearly equally distributed percentages of China and Europe-based clades (G, GH, GR) and in that sense has similarities with the German distribution. Senegal's clade distribution resembles the one from France but includes also a few samples from the early China-based clades. There were only three sequences from Gambian, two from Europebased clades GR and GH and one from China-based clade V. That pattern resembles the one from Italy when the clade G is substituted by the G-derived GH clade which however does not infer a connection to Italy but instead a similar combination of Chinese and European-related clades. Also the UK distribution in the last row shares similarities with the Gambian distribution but as it also includes Chinese clades it also resemble the one from Ghana. The Dutch distribution which is quite similar to the German also resembles the clade distribution from Ghana. Last but not least, there are the quite distinct distribution from the US West and East Coast (California, CA and New York, NY). The Californian chart has similarity with the Nigerian because of the high percentage of Chinese-based clades while the chart from New York has a comparable high percentage of clade GH as the ones from France and Senegal.  Highest diversity is at the A23403G (D614G) substitution splitting the tree in the bottom (Chinese) and top (European) branch. This variant has been reported to increase infectivity 1,12 . Graphics were generated using the nextstrain pipeline including software Augur (version 7.0.2, https:// docs. nexts train. org/ proje cts/ augur/ en/ stable/ index. html) and TreeTime (version 0.7.6, https:// github. com/ neher lab/ treet ime) 28 We set out to further explore the above-mentioned surprising observation (Fig. 3) that in West Africa the early clades emerged after the later Europe-associated G-clades. Possible explanations could be (1) latent circulation of the early clades in West Africa or (2) later introduction of the earlier clades. With the aim to find evidence for one of these alternatives, we looked into detail of the phylogeny of samples from the earlier clades. We picked two samples from the early clades: sample Senegal/136/2020 comes from a phylogenetic branch predominated by Spanish samples but also including samples from Asia and Latin America (Suppl. Figure 2), several West African samples from Nigeria (dated March 29th, 2020), Ghana and Senegal in the phylogenetic branch in Suppl. Figure 3 have a long latency time of about 2 months to the estimated common predecessor estimated on January 29th, 2020. Thus, there is evidence for a combination of both explanations: SARS-CoV-2 samples of the early clades may have circulated latently in West Africa since January 2020 but additionally there might have been introductions of the early clades from Europe and Asia or via maritime trade.   Figure 3. Temporal course of clade distribution confirms gaining of share of the Europe-associated G-clades harboring the putatively more infectious D614G amino acid substitution (February-April 2020). Interestingly the younger Europe-associated G-clades emerged earlier in West African sequenced samples. This could be due to founder effects by introductions from France being closely connected to Senegal and displaying a similar clade distribution. Furthermore, the China-based L-, V-and S-clade samples started in mid-March at a time when the epidemic in China was nearly entirely suppressed. Thus, the virus may have circulated in several countries before the first samples were sequenced. Surprisingly, the abundance of the S-clade is relatively high mainly due to Nigeria and Ghana but without that exception the clade distribution resembles the global one with a delay of about 2-4 weeks. The plot was generated using R (version 3.

Discussion
In this phylogenetic analysis of SARS-CoV-2 sequences from the West African countries Gambia, Ghana, Nigeria and Senegal, we identified country-specific patterns of earlier (L, S, V) and later Europe-associated (G, GR, GH) clades. In Senegal and Gambia, the later Europe-associated clades were predominant, in Ghana earlier and later clades were more equally distributed and in Nigeria the earlier clades were the predominant samples downloaded from the GISAID database in June 2020.  Figure 4. West African countries display distinct patterns of China-and Europe-based clades (until April 2020). Nigeria has the highest percentage of the China-based early clades (L, S, V) and Ghana has nearly equally distributed percentages of China and Europe-based clades (G, GH, GR). Senegal has a similar clade distribution as France but also a few samples from the early China-based clades. In Gambia there were only three sequences, two from Europe-based clades GR and GH and one from China-based clade V. The charts were generated using R (version 3. www.nature.com/scientificreports/ second quartile-with the fourth quartile having the highest risk. There was a lack of data for Senegal and Gambia therefore hinting to no or only low-level direct air traffic connection to China, thus suggesting a predominant introduction from Europe. Against our expectations, we found that the later European-associated clades (G, GR, GH) emerged before the earlier Chinese-based clades (L, S, V) in the registered cases in the investigated West African countries. We propose the following hypothesis as an explanation to this surprising observation: the early clades were already circulating within the populations before the later European-associated clades were introduced. A higher disease severity of the later European clades might then be a possible explanation for their earlier detection. Intriguingly, most of the cases investigated in this study occurred within the time interval of the Wuhan lockdown between January 23rd and April 8th, 2020. Thus, transmission of the early clades must have taken place very early or via intermediate countries or other Chinese provinces. Besides the later Europe-associated G-clades, the early clades were also circulating in Europe and the US West coast of USA, for example, the Senegal sample no. 136 from the early clade S bears similarity with Spanish samples (Suppl. Figure 2). Other explanations for the relatively long latency may be founder effects that by chance individuals infected with the later clades travelled to West Africa before individuals infected with the earlier clades-or slower means of transportation such as ships commuting between China, America, Europe and West Africa.
Based on previous reports 1 , it might probably be that the later G clades will replace the early clades in Nigeria and Ghana. The question if that correlates with the severity of the disease still needs to be addressed, Brufsky infers it from the higher mortality at the East Coast of USA with predominantly D614G-carrying G-clades compared to the West Coast with the predominant early clades 2 . Becerra-Flores et al. found significant correlations between the percentage of D614G and case-fatality on a country by country basis 20 . However, others find evidence for higher transmissibility and also higher viral-load but no evidence for higher disease severity 1,21,22 . A correlation of the amino acid substituted D614G associated with the G-clades and case fatality in the West African countries can only be identified at a marginal level of r = 0.28 (Supplementary Table 1). The case fatality is fortunately rather low ranging from 0.6 in Ghana up to 3.2 in Gambia. Other factors such as climate, sunlight exposure 23 and associated Vitamin D 24 , medical infrastructure and demographics might influence the etiopathology even more. There are also perspectives of decreased disease severity as Benedetti et al. argue that SARS-CoV-2 will mutate continuously and attenuate naturally to become endemic at a low mortality rate 25 , as has been observed with earlier viruses 26 .
The limitations of this study are the sample size, possible selection bias of the samples and the intrinsic incompleteness of the phylogenetic analysis which may lead to altered results when more samples are included. Nonetheless, this is the first study of its kind, the data and concept should form the basis for a more extensive analysis due to an increased number of sequenced samples becoming available.
Supplementary Figure 4 shows the positive rates and total tests performed in the West African countries without and with reference curves from Switzerland, a country which applied no extreme strategies to manage    The distinct distribution plots of SARS-CoV-2 variants in the West African countries showed that there may be more virus transmission via air traffic predominantly from Europe than via regional traffic between the West African countries themselves. Indications for this are limited to marginal occurrence of the putatively Nigerian Eta variant 14,15   In conclusion, in this phylogenetic analysis of SARS-CoV-2, we found distinct patterns of viral variants, until April 2020 the later Europe-associated G-clades were predominant in Senegal and Gambia, and combinations of the earlier (L, S, V) and later clades in Ghana and Nigeria. Intriguingly, the later clades emerged before the earlier clades which could simply be due to founder effects or due to latent circulation of the earlier clades. In June 2021, again introductions from Europe were predominant in Gambia (Alpha) and Senegal (Alpha and Delta from India via Europe), a combination of Europe-associated and others in Ghana and a putatively Nigerian-originating Eta variant in Nigeria. Only a marginal correlation of the G-clades in the West African countries until April 2020 can be associated with mortality which fortunately is at a rather low level therefore disproving fears that the pandemic would massively overwhelm the health systems in Africa. The rather young population and the climate might be factors favoring this low infectivity and fatality rate in comparison to Western countries but nevertheless a cautious balance between health protection and economics might prevent future disastrous outbreaks.

Methods
Sample collection. We downloaded SARS-CoV-2 viral sequences for West African samples and reference samples from European, North and South American countries and China from the GISAID database of June 2020. The samples used in this study are shown in Table 1.
Construction of the phylogenetic tree. The phylogenetic tree was constructed using a pipeline adapted from the Zika virus pipeline on the nextstrain.org web page 28  Here, the command from the Zika pieline was adapted to -keep-polytomies to keep all samples. Metadata information was manually supplemented with country and region information and associated with the tree via a call to Augur: augur traits -tree wa_tree.nwk -metadata metadata_countries.tsv -output wa_traits.json -columns region country -confidence.
Augur was called to infer ancestral states of discrete character again using World map chart. The world map chart was built using the R-package rworldmap 32 . Clade distribution pie charts were copied to the distinct country locations. Connections between countries were based on the nextstrain Africa analysis and our own auspice analysis. Further connections between countries were retrieved from literature on virus introductions into countries or regions. The first patient on the West Coast of the United States returned from a journey to Wuhan, China 33 . The first introductions in New York came from multiple independent infected individuals mainly from Europe 7 . The first cases in France and Europe were Chinese travelers from the predominantly affected Hubei province who entered the country in mid-January and were tested positive on January 24th, 2020 34 . Patient zero in Germany was a Chinese resident from Wuhan visiting Germany 35 . The Italian outbreak started with two Chinese travelers who arrived in Milan-Lombardy, went to Rome later on and were tested positive on January 31st, 2020 36 . The first Italian citizen was confirmed positive for COVID-19 on February 21st, 2020 in Lombardy 36 . In the Netherlands, the first patient diagnosed on February 27th, 2020 had probably infected himself on a trip to Northern Italy between February 18th and 21st 37 . The first cases in the UK returned from travels to the Chinese Hubei province and were tested positive for SARS-CoV-2 on January 30th, 2020 38 .
June 2021 update of the variant distribution and test-positive-rate plots. Variant distribution plots were downloaded from the website nextstrain.org 28,31 on June 22nd, 2021 filtering for Africa and the distinct countries Gambia, Ghana, Nigeria and Senegal. Plots of test-positive-rates and tests per thousand people were downloaded from the website OurWorldInData.org/coronavirus 39 selecting the countries Gambia, Ghana, Nigeria and Senegal for both plots. Data from Switzerland, a country which had neither an extreme pandemic management strategy nor an extreme outbreak, was integrated as reference in additional plots.