The evolution and phylodynamics of serotype A and SAT2 foot-and-mouth disease viruses in endemic regions of Africa

Foot-and-mouth disease (FMD) is a major livestock disease with direct clinical impacts as well as indirect trade implications. Control through vaccination and stamping-out has successfully reduced or eradicated the disease from Europe and large parts of South America. However, sub-Saharan Africa remains endemically affected with 5/7 serotypes currently known to be circulating across the continent. This has significant implications both locally for livestock production and poverty reduction but also globally as it represents a major reservoir of viruses, which could spark new epidemics in disease free countries or vaccination zones. This paper describes the phylodynamics of serotypes A and SAT2 in Africa including recent isolates from Cameroon in Central Africa. We estimated the most recent common ancestor for serotype A was an East African virus from the 1930s (median 1937; HPD 1922–1950) compared to SAT2 which has a much older common ancestor from the early 1700s (median 1709; HPD 1502–1814). Detailed analysis of the different clades shows clearly that different clades are evolving and diffusing across the landscape at different rates with both serotypes having a particularly recent clade that is evolving and spreading more rapidly than other clades within their serotype. However, the lack of detailed sequence data available for Africa seriously limits our understanding of FMD epidemiology across the continent. A comprehensive view of the evolutionary history and dynamics of FMD viruses is essential to understand many basic epidemiological aspects of FMD in Africa such as the scale of persistence and the role of wildlife and thus the opportunities and scale at which vaccination and other controls could be applied. Finally we ask endemic countries to join the OIE/FAO supported regional networks and take advantage of new cheap technologies being rolled out to collect isolates and submit them to the World Reference Laboratory.

Southern African Territories (SAT) 1, 2 and 3 serotypes found predominantly in sub-Saharan Africa, serotype C (which has not been reported since 2004 6 ) and Asia1 (found in Asia).
FMD has historically been thought of as important only in middle or high income economies with highly improved breeds and intensive production systems. Introductions into these systems can have devastating effects such as in the UK 2001 outbreak 7 . FMD is often not considered a priority in endemic regions where low production and extensive livestock systems predominate 8 . However, there is growing evidence to support the alternative argument that at the individual household level FMD is important in terms of direct losses of production 9 , lower fertility and loss of other livestock services such as draft power 10 . Furthermore, it has indirect impacts preventing access to international markets, limiting genetic improvement and preventing development of diary production 8,10,2 . Also, there is a global societal good to control of FMD in endemic areas, as these reservoirs of disease present a clear and present risk to disease free areas or areas using vaccination, through trade in animal products and animal movements as demonstrated by the recent rapid expansion of the serotype O/ME-SA/IND-2001 lineage across the Middle East and North Africa Southeast and East Asia from Indian sub-continent 11,12 .
Sub-Saharan Africa (SSA) is an enormous geographical area with a huge diversity of cultures, cattle rearing practices and wildlife, including the Cape buffalo (an important reservoir of SAT serotypes), which influence the epidemiology of FMD in different regions 13 . Currently 5 of the 7 serotypes of FMD virus (O, A, SAT1, SAT2 and SAT3) are reported to be circulating in SSA 5,14,13 , and West and Central Africa have had at least 4 of these serotypes (O, A, SAT1 and SAT2) in recent years [15][16][17][18] . However, because much of SSA has endemic FMD there are relatively few submissions to the World Reference Laboratory (WRLFMD), at The Pirbright Institute, from Africa. This lack of basic surveillance data on FMD viruses has major implications for control both locally, regionally and internationally.
FMD control has been very successful in many parts of the world, such as Europe and South America and is making significant impacts now in S.E. Asia 19 . These have largely been based around vaccination campaigns at 4-6 month intervals and movement controls. In addition, recent support from the Food and Agricultural Organization (FAO) through the European Commission for the Control of FMD (EuFMD) together with the World Organization for Animal Health (OIE) has led to the development of the progressive control pathway (PcP) for FMD 19 that helps countries frame their control efforts and provides realistic step-wise targets for national governments to help focus efforts. A key element in this pathway is the development of robust surveillance systems to allow regular collection of FMD viruses for typing and sequencing as well as vaccination. However, the details of what individual countries in Africa are doing in regards to surveillance and vaccination and the number of outbreaks they experience is not generally available. Endemic areas in SSA experience regular clinical outbreaks although the serotypes may be different with each epidemic wave 4,20 . Through better surveillance information, understanding of livestock movements 21 and contact with wildlife and international cooperation and coordination it may be possible to isolate and control individual serotypes in certain zones or to use natural geographical barriers to help zone off areas for control. The emerging view is that FMD control produces a significant amount of public good, justifying the need for national and international public investment in control 2 .
The FMD virus genome encodes for a single polyprotein that is post-translationally processed into 4 capsid proteins (VP1-4) and 10 non-structural proteins, bounded by 5 and 3 untranslated regions 22 . The evolutionary changes observed in phylogenetic studies of FMD viruses reflect both genetic drift of strains through repeat selection via transmission bottle necks as small sub-populations of viruses spread from animal to animal [23][24][25] , as well as possible recombination events in the non-structural coding regions 26,27 . Molecular epidemiology and evolution of FMDV has traditionally been studied using the sequences coding for VP1 (627-657 nt, depending on serotype), the most variable capsid structural protein containing relevant antigenic domains. This region reliably classifies the viruses into serotypes, topotypes and strains for most practical epidemiological applications 13 .
In 2012 we conducted a follow-up study to our 2000 study 15,4 , of circulating FMD viruses in the Adamawa and the Northwest Regions of Cameroon, to investigate the genetic changes in the serotypes and strains over the intervening 12 years and understand their relationship to the relatively small number of other sequences submitted from Africa over this period particularly in relation to the North African outbreaks 28 . This is important as FMD surveillance in SSA is still fragmented and at very low coverage. This paper describes the latest patterns of serotypes A and SAT2 in SSA, as these were the dominant strains circulating in Cameroon at the time, estimates the dates of most recent common ancestors for key lineages and topotypes, and discusses the different rates of spread of key lineages and topotypes.

Results
Circulation patterns in serotype A. A set of 875 serotype A VP1 sequences was compiled from GenBank, the WRLFMD and Cameroon studies, and used to create a maximum likelihood phylogenetic tree ( Fig. 1) showing the relationship between strains on different continents. The data set consists of African (19%), Asian (63%), European (3%) and South American (15%) sequences, and shows three global clades of predominantly Asian (purple), South American and European (blues) and African (red) taxa. There are three African sequences within the South American and European clade (from Morocco andLibya 1977-1983), and seven African sequences within the Asian clade (from Libya in 2009 and Egypt 2010-2013), showing that there are occasional incursions of non-African lineages into Northern Africa, however the majority of the circulation in Africa is from African strains, and no sequence from elsewhere clusters with the main African clade.
Next, to look at the transmission pattern of serotype A within Africa, we used sequences from Africa clustering within the main African clade (154 in total -also see Fig. S1) to perform a time resolved phylogeographic analysis using BEAST. Region labels were added to the sequences (Central, Eastern, Northern and Western; there were no Southern), and the number of transmissions between regions were inferred using an asymmetric discrete traits model on the time resolved trees. Figure 2 shows the time resolved tree of the African serotype A sequences, www.nature.com/scientificreports www.nature.com/scientificreports/ with the branches and nodes coloured by inferred discrete region trait and the size of the ancestral nodes reflecting the probability of the inferred region at that point.
The tree in Fig. 2 indicates that all these serotype A sequences have an Eastern Africa most recent common ancestor in the 1930's (median 1937 with 95% HPD: 1922-1950), and suggests that there is an interchange of viruses between Northern, Central and Western Africa (genotypes A-IV, V and VI), and between Eastern only, or Eastern and Northern Africa separately (genotypes A-I and A-VII). From the tree, it can also be seen that there is only one case of Central to Eastern transmission (within A-IV) related to a sample from Eritrea in 1998 (A/ ERI/3/98), highlighting that although Eastern-Central transmission is possible, it does not seem to occur very frequently.   Figure 3 (and Fig. S2) shows the spatial distribution of the sampling locations, together with the approximate spatial extent of the genotypes. From these it can be seen that A-I, II, III and A-VII are in Eastern Africa, and overlap with A-IV in the Northern region only.
Overall, the pattern suggests that the serotype A viruses are potentially circulating in at least 2 quite separate pools in Central-Western Africa and Eastern Africa, and that Cameroon and Central Africa more widely may be an important bridging zone between virus pools (Fig. 3). However as with all these analyses in Africa, note the actual virus sampling coverage is extremely low so phylogenetic analyses might give a distorted view of the true transmission pattern.

Serotype SAT2 Diversity.
A set of all available good quality SAT2 VP1 sequences was compiled from GenBank, the WRLFMD and Cameroon studies, of these only 9 were from outside Africa. These were in Saudi Arabia, Bahrain, Yemmen and Palistine (Gaza Strip), showing that the SAT2 serotypes are indeed mostly restricted to Africa with only occasional incursions into the Middle East. Using a set of 334 SAT2 VP1 sequences (see also  Fig. S6) shows the approximate spatial distribution of these topotype clades from the sampling locations. Similar to the overall pattern observed for serotype A, the SAT2 strains also show distinct topotype clades in Eastern-Southern Africa and Central-Western Africa, and the SAT2 strains from Cameroon fall into topotype VII, which is very widely geographically spread.
The time resolved maximum clade credibility tree for all 334 SAT2 sequences (Fig. S5) additionally indicates that the topotype clades are diverse; in contrast to the serotype A tree, the SAT2 tree shows a much longer time to most recent common ancestor (TMRCA for SAT2 is 1709 with 95% HPD:1502-1814), implying that the SAT2 serotype has been circulating in Africa for at least a few centuries. Topotype VII appears to have a common ancestor from the 1950s (1954: 95% HPD:  in East Africa and the more recent isolates in Cameroon from 2012/13 all appear to have a common ancestor with those from 2000 but not apparently directly descended from them. Genetic Diversity. The average pairwise distance between all 154 African serotype A amino acids sequences is 13% (28 amino acids -also see Figs S7 and S9), and the average pairwise distance between the 72 sequences in A-IV, V and VI is 8% (20 amino acids). The genetic diversity between SAT2 topotype clades as estimated from amino acid sequences is quite large; and the average pairwise distances between sequences from one topotype to another is 18% (41 amino acids from alignment length of 224), whereas the average pairwise distance within a topotype is 7% (17 amino acids -also see Figs S8 and S9). It is likely that the differences between SAT2 topotypes represent distinct antigenic variants 29 , and in comparison the average amino acid difference between H3N2 human seasonal influenza vaccine updates is 12 (range 6-20) amino acids in HA1 (length 344, and the most variable part of the HA surface protein, equivalent to the partial VP1 sequences here).      www.nature.com/scientificreports www.nature.com/scientificreports/ under positive and negative (purifying) selection in the receptor binding site RGD, and Table 2 summarises the sites under directional selection.
In all clades there were many sites under negative selection as expected, and the RGD site is conserved and under negative selection in the SAT2 serotypes. However, in serotype A not all positions of the RGD site are conserved or are detected as under negative selection, particularly in clade A-VII. None of the clades have positively selected RGD sites. In clades SAT2-IV, SAT2-VII, and in the A clades, there are a few sites under positive selection and these tend to occur in the two surface exposed antigenic regions identified by Reeve et al. 30,29 . Additionally, several sites identified as under directional selection also occur in these antigenic regions. Interestingly, A-IV has more sites under directional selection than the other clades, including sites near the RGD sites. The sites under directional selection in the antigenic regions suggest that particular antigenic variants might be becoming more widespread over time (especially for A-IV).

Discussion
FMD remains a major global problem both for developed economies, where it has largely been eradicated but remain under threat of reintroductions and in low and middle income economies where it is under varying levels of control but still impacts production and causes trade blocks. SSA remains as a major reservoir of at least 5 serotypes of FMD viruses and until controls are improved in these areas it will remain a threat to global livestock production. This analysis using phylogeographical approaches highlights a number of key issues in FMDV epidemiology in SSA. It suggests that serotype A may have only been introduced relatively recently, and that East Africa may have been the origin, in the late 1930s, for the North, Central and West African isolates studied. In contrast the SAT2 serotype appears to have much older origins going back hundreds of years (and may have originated in Africa thousands of years ago in wildlife). Interestingly, the phylogenetic tree is also dominated by East African viruses and suggests that the appearance of SAT2 in Central Africa may be linked to East Africa and a common ancestor from the 1970s. The isolates from 2005 and later in Central and West Africa and the outbreaks in North Africa all appear to have a common ancestor in the 1970s or 1980s.
It is impossible to fully understand the degree of virus exchange between different regions as these areas are considered endemic for FMD and have not been routinely submitting isolates or reporting outbreaks. However, it is clear that from time to time there are transmission events between regions. However, again the lack of available isolates makes interpretation of these data difficult and there is a real risk of over interpretation. Furthermore, it is also not possible to say if vaccination is driving evolution of viruses in the region due to lack of data on outbreaks, vaccination coverage and vaccine use.
This analysis also highlights that although there is cross over between regions there appear to be antigenically distinct viruses circulating in different parts of Africa and therefore there needs to be more work done to identify strains that will provide protection in these different regions. Additionally, it seems that the clades SAT2-VII and A-IV containing Central, Western and Northern strains are not only distinct from their Eastern and Southern counter parts, but appear to be undergoing faster evolution and faster spatial spread. There are signatures of positive and directional selection in antigenic regions close to the receptor binding site of the VP1 protein in these clades, which are consistent with the faster evolutionary rates, and consequently provide an intriguing hypothesis to be tested in the future: that the increased spread of these clades could be due to a genetic change in the virus. An alternative hypothesis is that the increased spread of these clades is due to increased cattle movements (or human movements of animal products) particularly between the Central and Northern African regions. This analysis has also highlighted our lack of data and understanding of the epidemiology of FMD viruses in SSA and points to the need for a major scaling up of surveillance and sampling across SSA to start to understand key questions on the flow and circulation of FMD viruses in SSA, the scale of persistence of strains and the role of wildlife. Without much higher coverage, whole genome sequences and spatial resolution of phylogenetic data it will be difficult to design strategies for FMD control across the continent. However, the development of simple to use tools such as the lateral flow device 31 which could be extended to include African serotype specific bands 32 would give local veterinary services the means to type viruses in the field. This would be a major step forward for local veterinary staff and livestock keepers to understand the complexity of the epidemiology of FMD due to the number of different serotypes and start the education process needed to drive controls from the bottom up. Furthermore, these devices can be safely and cheaply submitted to reference laboratories where the virus can be eluted and sequenced 33 . Additionally mobile field sequencing technology is also now available as has been used in the Ebola outbreak in West Africa 34 . This also has the potential to increase the number of sequences, particularly  www.nature.com/scientificreports www.nature.com/scientificreports/ whole genomes and thus help improve local understanding as well as international understanding of the phylogeography of FMD viruses, recombination events and vaccine driven selection in SSA.

Methods
Most of the sequences used in this study were taken from existing databases but a small set of new isolates from Cameroon were collected for this study. Details of field sample collection and processing are given in the supplementary material. This field work was carried out with the Cameroonian Ministère de l'Élevage des Pêches et Industries Animales (MINEPIA) staff, following local guidelines and regulations, as part of their disease surveillance activities and had approval from the Cameroonian Academy of Science.
GenBank Data. Global data sets of the VP1 region from serotypes A and SAT2 were downloaded from GenBank; an initial search for VP1 (or 1D) regions gave >900 and >400 sequences respectively. Of these we included good quality sequences in the final set with at least 240 nucleotides and both date and country information, together with the new sequences from Cameroon, resulting in data sets of 875A and 394 SAT2 VP1 sequences (95% and 75% of which had at least 600 nucleotides respectively).
African sequences were extracted from the global data sets and used for more detailed analysis. For the serotype A sequences, the 10 sequences from Northern Africa (6 Egypt, 2 Libya, 2 Morocco) which clustered with non-African sequences were excluded, as was Kenya/1965 (the outgroup). For the SAT2 sequences, the 9 of non-African origin were excluded (all from the Middle East). 9 serotype A and 9 SAT2 sequences from previously unreported isolates are also included in the data sets. These were collected in 2012 from the Adamawa and North West Regions of Cameroon and the sample codes and accession numbers of the sequenced samples are given in Table 3. A further 3 sequences from the Extreme North Region of Cameroon were provided by Ludi et al. 35 . Sequences from the same location, isolated within one week of each other and with less than 3 nucleotides difference were considered the same and the duplicates were removed from the final data set, leaving 154 serotype A sequences in the main monophyletic African clade, and 334 serotype SAT2 sequences. phylogenetic analysis. Alignment. Initial nucleotide alignment was performed using ClustalW in BioEdit, the sequences were manually adjusted to the coding frame, and then aligned using MUSCLE in MEGA6 on translated sequences.
Maximum Likelihood Trees. Maximum Likelihood trees on the global data sets were generated using RAxML with the general time reversible (GTR) model and gamma distributed site to site rate variation (4 categories) and 100 bootsraps.
Phylodynamics and Phylogeography. Time scaled phylogenetic trees were inferring using BEAST 1.8 36 . Initial model selection was performed by comparing the fit of substitution models, clock models and effective population size tree prior models calculated using AICM with 100 bootstraps in Tracer 1.6, and Path sampling (PS) and Stepping Stone sampling (SS) in BEAST 1.8 37,38 . Models were fitted on serotype A sequences using an MCMC chain with length 50,000,000 sampling every 5,000 and discarding the first 10% of samples as burn-in. The SRD06 substitution model 39 (which uses two HKY models with site to site rate variation, one model for positions 1 and 2, and another for position 3) was favoured above the GTR model with gamma distributed site to site rate variation (4 categories) with a Bayes Factor >100 (AICM, PS, SS), and the uncorrelated relaxed lognormal clock model  www.nature.com/scientificreports www.nature.com/scientificreports/ was favoured over the strict clock with Bayes Factor >100 (AICM, PS, SS). The constant population size tree priors were favoured over flexible skygrid effective population size tree priors with Bayes Factor >6 for AICM 40 , but the reverse was true when using PS and SS (Bayes Factor >10). Therefore following Hall et al. 41 we chose to use the flexible skygrid effective population size over time model for serotype A, although this made less than 1% difference to the TMRCAs for recent clades. Since the SAT2 sequences were very diverse and the TMRCA was much longer than for the serotype A sequences, we used a constant population size model when analysing all the topotypes together but a skygrid model when analysing individual topotypes.
A posterior set of 1000 trees for each of serotype A and SAT2 data sets was composed from a minimum of two independent runs, and the location states were reconstructed upon these trees in BEAST 1.8 using (i) asymmetric discrete location trait models with Markov Jumps and (ii) continuous latitude and longitude homogeneous Brownian motion diffusion models 42,43 , (a random jitter of 0.01 was applied to the GPS sampling coordinates in order to give each isolate a unique location).

Selection tests.
Sites under positive (diversifying) and negative (purifying) selection were detected within the SAT2-I, IV, VII and A-I, IV, VII clades using HyPhy on the DataMonkey server at https://www.datamonkey.org/ to estimate the non-synonymous to synomyous (dN/dS or dN-dS) ratios. SLAC (Single Ancestor Counting), FEL (Fixed Effects Likelihood), MEME (Mixed EffectsModel of Evolution) and FUBAR (Fast Unbiased Bayesian AppRoximation) methods were used (ref1) and a GTR (General Time Reversible, also known as "REV") model was used as the basis for the codon models. On the HyPhy server, neighbour joining trees are automatically created from the uploaded sequences, and it is these which are used as the phylogeny within the codon-based selection tests. These selection tests have different detailed assumptions and performance depending on the size and variability of the data, but here a site was reported to be under positive or negative selection in the main text if it was significant under any of the tests with a p-value of 0.01 or FUBAR ≥0.99 posterior probability.
The SLAC 44 method reconstructs the maximum likelihood ancestral sequences at each site in the alignment up the phylogenetic tree according to an overall fitted codon model (disallowing stop codons), and then counts the number of synonymous vs non-synonymous substitutions. This is the fastest and crudest method. FEL 44 is also a site-by-site model, but here the rates of non-synonymous and synonymous substitution for each site are estimated independently as extra parameters besides the branch lengths and overall site variability. A likelihood ratio test is performed for each site to test whether a one parameter model (non-synonymous rate is the same as the synonymous rate) or a two parameter model (the rates are different) is applicable and hence the type of selection the site is under. In FUBAR 45 , independent non-synonymous and synonymous substitution rates are also inferred for each site (using Bayesian inference), and these are considered as random effects drawn from overall discretised distributions. MEME 46 is another random effects model for individual sites, but rather than assuming that the selection pressure must be constant for a site over all time (and all branches), it allows the detection of both pervasive and episodic diversifying evolution along only some of the branches.
To further investigate the type of selection that individual sites were under, we also performed a Directional Evolution for Protein Sequences (DEPS) analysis 47 on the HyPhy server; this analysis used only amino acid sequences and uploaded rooted neighbour joining trees, created from the nucleotide sequences and the oldest sequence per serotype as outgroup. The best fitting amino acid substitution model was determined from each clade separately and was JTT 48 for serotype A A-I, IV, VII and SAT2-VII, and WAG 49 for SAT2-I and IV. DEPS analyses considers the number and type of amino acid transitions per site and reports sites where the transitions indicate a: selective sweep site, consensus replacement site, repeated substitutions site (convergent evolution), rare residue substitution, or highly polymorphic site.
Recombination analyses. Since the presence of recombinant sequences can affect the phylogenetic results and results from selection tests, the VP1 sequences from the SAT2-I, IV, VII and A-I, IV, VII clades were screened for recombination using Single Break Point analysis with a GTR model on the HyPhy server, but no recombination was found.