A traditional evolutionary history of foot-and-mouth disease viruses in Southeast Asia challenged by analyses of non-structural protein coding sequences

Recombination of rapidly evolving RNA-viruses provides an important mechanism for diversification, spread, and emergence of new variants with enhanced fitness. Foot-and-mouth disease virus (FMDV) causes an important transboundary disease of livestock that is endemic to most countries in Asia and Africa. Maintenance and spread of FMDV are driven by periods of dominance of specific viral lineages. Current understanding of the molecular epidemiology of FMDV lineages is generally based on the phylogenetic relationship of the capsid-encoding genes, with less attention to the process of recombination and evolution of non-structural proteins. In this study, the putative recombination breakpoints of FMDVs endemic to Southeast Asia were determined using full-open reading frame sequences. Subsequently, the lineages’ divergence times of recombination-free genome regions were estimated. These analyses revealed a close relationship between two of the earliest endemic viral lineages that appear unrelated when only considering the phylogeny of their capsid proteins. Contrastingly, one lineage, named O/CATHAY, known for having a particular host predilection (pigs) has evolved independently. Additionally, intra-lineage recombination occurred at different breakpoints compared to the inter-lineage process. These results provide new insights about FMDV recombination patterns and the evolutionary interdependence of FMDV serotypes and lineages.

Recombination-free region r1, non-structural protein L pro . The r1 sequence (530 nt long) was located entirely within the L pro protein-coding region. The phylogenetic topology of r1 nucleotide sequence showed a distinct clustering pattern from that observed for nucleotide sequence encoding the capsid proteins (Fig. 3). The two sublineages identified in the O/Mya-98 lineage (sublineages A and B) grouped more closely with A/Sea-97 than with other lineages, having an estimated time to most recent common ancestor (tMRCA) at 1994 (95% high posterior density (HPD) 1990-1997; Table 2). Asia-1 viruses classified within the As1/Gr-IV were related to the common ancestor of the O/Mya-98 and A/Sea-97 lineages. O/PanAsia lineage viruses were more closely related to the As1/ Gr-V lineage with the tMRCA estimated to date back to 1990 (95% HPD 1984(95% HPD -1994 Recombination-free region r2, capsid proteins VP2, VP3 and VP1. Region r2 (1494 nt long) contained most of the structural proteins including partial VP2, complete VP3 and partial VP1 coding regions (Fig. 3). As expected, the traditional grouping of serotypes, topotypes, and lineages were clearly reproduced by the phylogeny of this region. There was an early time of divergence between viruses of serotype Asia-1 and serotypes A and O in 1889 Recombination-free region r3, non-structural protein 2B. Overall tMRCA estimated for r3 (246 nt long), encoding partial 2B, were more recent than the ones estimated by other regions, however, the grouping pattern had a similar topology as exhibited by other NSP coding regions (Fig. 3)    Mean substitution rates. The molecular clock was computed for each recombination-free genome region and for each lineage ( Table 3). As expected, region r2 (structural proteins coding region) had the highest overall substitution rate across all lineages 6.52 × 10 −3 (95% HPD 4.85-8.40 × 10 −3 ) compared to all other regions. The second highest substation rate occurred in region r1, L pro (5.88 × 10 −3 −95% HPD 4.70-7.08 × 10 −3 ) followed by region r9 3D 5.44 × 10 −3 (95% HPD 3.54-7.32 × 10 −3 ). Lower substitution rates were estimated for other NSP coding regions r3 (2B) and r8 (3D) with a mean rate of 5.02 × 10 −3 (95% HPD 3.47-6.64 × 10 −3 ) and 5.00 × 10 −3 (95% HPD 3.60-6.44 × 10 −3 ) respectively. Region r7 (3A-3C) had a lower substitution rate compared to all other regions estimated at 4.29 × 10 −3 (95% HPD 3.22 × 10 −3 , 5.38 × 10 −3 ). In general, a slightly lower rate was calculated for As1/Gr-V viruses followed by CATHAY in r1, r2, r7 and r8, while very similar distribution of substitution rates in all lineages were estimated for r3 (Table 3).   Intra-lineage recombination. The computed mean within-lineage p-distance for the sequences coding the

Discussion
In this study, we have characterized the ancestral relationships between the viral lineages endemic to Southeast Asia using full ORF sequences. These analyses indicate that recombination plays an important role in FMDV evolution, and provides a new perspective of the importance of the co-circulation of different lineages and serotypes in endemic regions.
This study provides evidence for genetic association of the NSP coding regions between the endemic lineages that co-circulate within Southeast Asia. The most apparent association was found between O/Mya-98 and A/Sea-97 viruses, which have been endemic to Southeast Asia for decades, and their NSP are genetically closely related. Furthermore, the three recent recombinant viruses identified in this study are a result of exchange of genomic sequences between these two lineages (A/Sea-97 and O/Mya-98). Viruses from As1/Gr-IV also shared recent common ancestors with O/Mya-98 and A/Sea-97 in most of their NSP genomic regions. As1/Gr-IV has a long history of endemic circulation in Southeast Asia 17 , which may have allowed the exchange of NSP coding regions with the endemic serotypes A and O. However, a more detailed relationship with Asia-1 is limited by the few full ORF sequences available for analyses.
Another interesting aspect of the endemic lineage O/Mya-98, which had previously been classified into two different sublineages: A and B (based upon VP1 coding region), is the evolutionary process inferred by the analyses of the NSP coding regions 10 . Using ORF sequences, we determined that these viral sublineages were not only divergent in their structural proteins but also had distinct ancestral relationships in their NSP regions. r1 (L pro ), r7 (3 A, 3B, and 3 C), and r9 (3D) had a remarkable consistency with Mya-98 A-B capsid-based time divergence estimates of 1999-2000, while r3 and r8 phylogenies yielded an even earlier divergence time. The initial recombination event in these regions (r3 and r8) may have triggered the divergence of the two lineages, later evidenced  by the differences in the capsid coding regions. However, it is not possible to conclude this hypothesis without further field and experimental evidence. By contrast, O/PanAsia and As1/Gr-V viruses have been endemic to Southeast Asia only since approximately 2000 and 2005, respectively, and no recombinants between these viruses and the earlier Southeast Asia endemic lineages were detected. A study which included analysis of 12 representative viruses of serotype O and Asia-1 had previously suggested recombination inferred by a higher identity of NSP, as well as the 5′ UTR and 3′ UTR between an O/PanAsia and an As1/Gr-V virus from China 32 . Using a combination of recombination detection analysis and Bayesian time divergence estimation, we were able to further characterize the ancestral relationship between O/PanAsia and As1/Gr-V and a potential more recent recombination event in 2B. Common recent ancestors of O/PanAsia and As1/Gr-V and possible recombination events are consistent with the historical geographic circulation of these viruses. O/PanAsia lineage emerged in India during the 1980s, whereas known circulation of As1/Gr-V can also be traced to India at similar times (1976)(1977)(1978)(1979)(1980)(1981) 17,18,25 . Only one As1/Gr-VI full genome sequence was available, which NSP were closely related to O/PanAsia lineage. However, because only one viral sequence from this lineage was included, it is difficult to infer if this was a result of an isolated recombination event, or if this ancestral association was imprinted in all As1/Gr-VI lineage descendants.
Overall, these results indicate that recombination may not be equally likely between all FMDV lineages. Copy-choice recombination is the most likely mechanism used by FMDV, and it requires the RNA polymerase to switch between the RNA molecule templates guided by sequence similarities. Critical genetic similarities between NSP may be necessary for copy-choice occurring between certain lineages. This study suggest that these similarities may be shared by the O/Mya-98-A/Sea-97-As1/G-IV groups, and the O/PanAsia-As1/G-V groups 38 although further field and experimental data may be needed to support this hypothesis.
The CATHAY topotype was first identified in 1970 in Hong Kong pig farms and subsequently spread sporadically into several Southeast Asian countries 20,24 . Divergence between CATHAY topotype and other serotype O viruses within capsid-coding regions was estimated at 1973 (95% HPD 1960(95% HPD -1984, close to the period when this topotype was first described. Divergence inferred by NSP coding regions was estimated at similar tMRCA times (r1, 1976; r3, 1983; r7, 1972; r8, 1963, and r9, 1989). Phylogenetic analyses of all regions are consistent in suggesting that CATHAY viruses evolved independently from all other co-circulating lineages, as previously suggested 29 . The lack of recombination with other picornaviruses from the same species can result in the emergence of a new species or critical deterioration of the viral sequences, as previously proposed 37 . However, neither of these evolutionary theories may be resolved by the analyses described herein. Further analyses may be facilitated by acquisition of additional sequences of CATHAY strains of FMDV. Intra-lineage recombination is more difficult to detect compared to recombination across distant viral lineages. A previous study of O/PanAsia lineage using scanning windows did not detect intra-lineage recombination 39 , in contrast with the results obtained herein. Because of the high nt identity within FMDV lineages, methods that have a higher sensitivity to detect recombination may be useful to reveal the evolutionary association within FMDVs lineages 40 . Detection of recombination among highly homologous viruses is challenging due to the analytic limitations in differentiating recombination from point mutations. This detection becomes even more complex when considering the principles of clonal interference, competition of mutants, and quasispecies. Additionally, the biological significance of intra-lineage recombination may be different from that occurring between lineages. It has been described that such differences may be related to the distinction between short geo-temporal scale (intra-lineage) versus longer and more regional or global scales of recombination (inter-lineage) 41 . Furthermore, it has also been suggested that for rapidly evolving RNA viruses, recombination in a short time scale is a mechanism that promotes viral diversity and through which deleterious mutations are purged 37,42 . Inter-lineage recombination, by contrast, may contribute to confer critical changes that dramatically improve the fitness of the virus, as well as maintaining a stable global gene pool of a species 37 .
Although recombination of positive-sense RNA viruses is a well-established fact 27,40 , the mechanisms that enable these events at the cellular, host organism, and population levels are poorly understood. Recombination is believed to be dependent upon simultaneous replication of distinct viruses within the same cell, co-occupying the same replication complex. Numerous studies, especially in African countries have reported serological evidence of individual animals infected by multiple FMDV serotypes [43][44][45][46][47][48] , which highlights the common occurrence of multiple serotypes affecting herds in endemic countries. Additionally, one study in Pakistan, demonstrated detection and sequencing of two viruses from one sample with different serotypes (A and Asia-1), and another sample with two different sublineages of A/Iran-05 49 . Hypothetically, co-infection might occur during either the acute or post-acute phases of FMDV infection, each of which presents unique mechanistic challenges. The acute phase of FMD is relatively short, lasting 5-10 days 4,50 . The extremely high levels of virus replication at lesion sites would favor recombination; however, the activation of the innate immune response during early infection would highly impede a second strain from superinfecting an animal during the acute phase of infection. Alternatively, recombination could occur during the post-acute convalescent or carrier stages of dual infections that can last for months-years 51,52 . Heterologous co-infection in this period would be favored by the lack of innate, pro-inflammatory immune responses 53 , and absence of heterologous adaptive immune responses 54 . However, relatively low levels of viral replication and the limited anatomic distribution of infection sites during the carrier stage would not favor such events. Overall, we support the hypothesis that recombination would be most likely when a post-acute, carrier ruminant becomes superinfected with a heterologous strain and develops acute FMD. This scenario would preclude the interference of innate and adaptive immunity and would include the high levels of replication and viremia of the superinfecting virus. Furthermore, we propose that this recombination would be most likely to occur in the nasopharyngeal epithelial cells where FMDV has been demonstrated to establish and maintain persistent infection 55,56 . This proposed localization of recombination to the nasopharyngeal cells of ruminants is supported by the finding that the lineage which has evolved most independently is O/CATHAY which is more associated with clinical disease in pigs (that do not develop persistent infection), and less reported in cattle 57,58 .
The high frequency of picornaviruses recombination highlights the importance testing for recombination before reconstructing FMDV phylogenies. Most molecular epidemiologic studies of FMDV have been performed using phylogenies of VP1 or P1 regions. Under such conditions, recombination may not have greatly interfered with the assumptions. However, as whole genome sequence becomes more readily available for data analysis, assessment of recombination should always be considered as a factor that may unduly bias phylogenetic analyses and the dating of ancestral intermediates 36 .
In conclusion, we have identified novel evolutionary events within a limited set of full ORF sequences of FMDV field strains from Southeast Asia. Knowledge of viral circulation in Southeast Asia, recombination detection, and time divergence estimation were used to determine the unique evolutionary relationships amongst these viruses. Intra-lineage recombination was shown to be a recurring component of FMDV evolution, however, further analyses are required to acquire a deeper understanding of its biological significance. Inter-lineage recombination occurs in viruses with higher homologous NSP coding regions and seems to follow patterns defined by historical co-circulation. Overall, enhanced knowledge of the specific functions of the NSP and their contributions to viral fitness changes acquired by recombination may contribute to greater understanding of how specific events lead to outcompeting strains. Elucidation and assimilation of the specific mechanisms of viral infection at the molecular, host organism, and population levels will be critical to achieving the ultimate goal of understanding and predicting the emergence of novel FMDV strains with regional and pandemic spread potential. The continuous development of sequencing technologies and associated analytic tools have the potential to allow for full genome sequencing of FMDVs to be used as a routine surveillance tool which will further enable similar analyses and provide greater insights to FMDV evolution under natural conditions.

Methods
Data source. FMDV sequences available in GenBank public database were included in the analyses described herein. Viral sequences in GenBank were located searching the term 'Foot-and-mouth disease virus' (organism) and filtered by nt length >6000 to include all potential sequences of full or near full genome. Additionally, 18 FMDV sequences were obtained from samples collected and sequenced for this study at Plum Island Animal Disease Center, NY, USA. These samples were obtained from lesions in clinically affected animals during FMD outbreaks and from persistently infected animals collected in Vietnam. Further 10 sequences were obtained from samples held at the repository of the World Reference Laboratory for FMD (WRLFMD), Pirbright, United Kingdom. The lineages of all sequences of viruses described herein were identified following phylogenetic comparisons of the VP1 coding region and determined to belong to Southeast Asia endemic lineages A/Sea-97 (n = 17), O/PanAsia (n = 29), O/Mya-98 (n = 30), O/CATHAY, (n = 8), Asia-1(As1)/Gr-V (n = 8), As1/Gr-IV (n = 2) or As1/Gr-VI (n = 1), based upon VP1 protein coding region phylogenic classification (Table S1). Viral sequencing. At PIADC, Sanger sequencing of the full open reading frame (ORF) was performed according to a previously described method 59 . Briefly, complete bidirectional ORF sequences were obtained by amplifying each FMDV with three sets of primers. These overlapping fragments of approximately 3000 nt in length were subsequently bidirectionally sequenced by several specifically designed primers spanning the whole ORF. At PIADC, NGS sequencing of the full ORF was also performed according to a previously described method 60 . To accommodate running the SISPA-generated libraries on an Illumina MiSeq, New England Biolabs (NEB), end prep and ligation modules (E7546S, E7595S) with Bio Scientific barcoded adapters (514113) were used. ORF consensus sequences were generated after mapping assembly of the reads to the closest reference genome available in public databases followed by de novo assembly of the mapped reads. Near-complete genome sequences at The Pirbright Institute were determined as previously described using an Illumina MiSeq. 61 and assembly of raw paired-end reads to consensus-level sequences was undertaken using SeqMan NGen ® and SeqMan Pro ™ (Lasergene package version 12; DNAStar, Inc.).
Sequence alignment. The full ORF polyprotein-coding genetic region was analyzed. To validate the alignment, each of the protein coding region nt sequences (VP4-1, 2A, 2B, 2C, 3A, 3B, 3C and 3D) were aligned separately using MUSCLE, and subsequently concatenated. Because of the high heterogeneity among VP1 protein coding sequence of viruses from different serotypes, the amino acid translated sequences were aligned and then back translated into nt sequences in MEGA7 62 .

Inter-lineage recombination identification of recombination breakpoints. Recombination anal-
ysis of 96 complete ORF sequences was carried out using RDP4 software 63 . We set the parameters to identify recombination events with multiple comparison correction (Bonferroni) and a 0.05 significance. The following algorithms implemented within RDP4 were used (1) RDP using window size = 40 (2) Geneconv (default settings) (3) Bootscan using window size = 300, step size = 20, bootstrap = 100, and Neighbor-Joining tree with a Jin and Nei model option. (4) Chi square using variable sites per window = 60 (5) Chimaera using variable sites per window = 60 and (6) Siscan using window size = 300. This analysis allowed us to (1) identify and characterize unique recombinants, which were further analyzed using a similarity plot implemented in Simplot 3.5.1 64 , where non-recombinant reference sequences from each lineage was used to represent the relationship between the query (recombinant) sequence and each lineage. (2) identify recombinant lineage (when parental recombination regions were inferred for more than one sequence from the same lineage), and (3) build a breakpoint recombination distribution plot to identify recombination-free regions using a window size of 300 nt.

Phylogenetic analysis of recombination-free regions. A recombination breakpoints distribution plot
(with window size = 200 nt) was constructed to identify the regions within the genome where recombination was not present. The recombination-free regions were obtained as separate alignments. Likelihood-mapping algorithm implemented in TreePuzzle 65 was used to determine if the phylogenetic signal contained in each of the recombination-free regions was appropriate to reconstruct the phylogeny. Briefly, the adequacy of phylogenetic signal was defined when >80% of all possible three trees constructed by any four sequences in the alignments were resolved based on the support of the internal tree branches.
Phylogenies and divergence time estimation of FMDV using regions free of recombination with adequate phylogenetic signal were reconstructed using a Bayesian framework implemented in BEAST 1.8.4 66 . A lognormal uncorrelated relaxed clock, a Bayesian skyline tree prior and the GTR + I + G nucleotide substitution model were used for all recombination-free regions. Analyses were run for 10 8 iterations or until all parameters reached an effective sample size >200. Mixing and convergence of the chains was assessed using Tracer v1.6 67 . The maximum clade credibility (MCC) tree was depicted in Figtree 1.4.3 68 .
The time to most recent common ancestor (tMRCA) and 95% high posterior density (HPD) of different lineages in each of the regions was extracted from annotated MCC tree to determine the evolutionary relationship of the different lineages. All viruses from serotype O, A and Asia-1 (As1) across regions were named according to their VP1 lineage classification (conventionally named O/Mya-98, O/PanAsia, O/CATHAY, A/Sea-97, As1/ Group(Gr)-V, As1/Gr-IV, As1/Gr-VI). If any of the sequences in NSP coding regions clustered with a different lineage monophyletic group, it was considered an inter-lineage 'recent recombinant' , and analysis of specific recombination was further explored using Simplot with a 200-300 nt sliding window. Bayesian phylogenetic analyses were run using computational resources available in CIPRES 69 . Genetic distances of nucleotide sequences were computed using the Kimura-2 parameters substitution using MEGA7 62 .
Intra-lineage recombination. The intra-lineage mean pairwise genetic p-distance for full ORF was computed using MEGA7 62 . We used the Homoplasy test, which has a higher sensitivity to detect recombination in alignments with high nt identity (such as the intra-lineage ORF identity) 70 . This test accounts for the substitution patterns and rate of variation among sites to increase power and avoid false positives generated by high rates of variation across sites usually observed in virus evolution 40 . A 'homoplasy ratio' is calculated which ranges from zero (clonal population) to one (population under free recombination). We computed the Pairwise Homoplasy Index tests using the algorithm developed in PhiPack 70 . We ran the analyses for lineages currently present in SEA: A/Sea-97, O/PanAsia, O/Mya-98 and O/CATHAY using different window sizes of 300, 400 and 600 nt in steps of 25. We computed and plotted the specific p-values (null hypothesis of no recombination) along the viral genome to determine potential intra-lineage recombination regions.