Phylogeography of Kyasanur Forest Disease virus in India (1957–2017) reveals evolution and spread in the Western Ghats region

The Kyasanur Forest Disease (KFD) has become a major public health problem in the State of Karnataka, India where the disease was first identified and in Tamil Nadu, Maharashtra, Kerala, and Goa covering the Western Ghats region of India. The incidence of positive cases and distribution of the Kyasanur Forest Disease virus (KFDV) in different geographical regions raises the need to understand the evolution and spatiotemporal transmission dynamics. Phylogeography analysis based on 48 whole genomes (46 from this study) and additionally 28 E-gene sequences of KFDV isolated from different regions spanning the period 1957–2017 was thus undertaken. The mean evolutionary rates based the E-gene was marginally higher than that based on the whole genomes. A subgroup of KFDV strains (2006–2017) differing from the early Karnataka strains (1957–1972) by ~2.76% in their whole genomes and representing spread to different geographical areas diverged around 1980. Dispersal from Karnataka to Goa and Maharashtra was indicated. Maharashtra represented a new source for transmission of KFDV since ~2013. Significant evidence of adaptive evolution at site 123 A/T located in the vicinity of the envelope protein dimer interface may have functional implications. The findings indicate the need to curtail the spread of KFDV by surveillance measures and improved vaccination strategies.


Results
Next-generation sequencing of KFD virus isolates. Though majority of the isolates were sequenced by the Sanger sequencing method, the NGS method was implemented for the sequencing of some of the recent isolates (Supplementary Table S1). In addition, we also undertook sequencing of a few isolates by the NGS method for which the sequence was also done by the Sanger sequencing method. Concurrence of sequence information was noted (data not shown). In the case of NGS, the reference mapping of the generated FASTQ file was performed to an already available KFDV genome, as the isolates were known to be KFD positive. In all the samples, a significant number of reads were identified to be of KFD viral genomic RNA. It was found that almost 4-65% of reads were of KFDV origin depending on the viral load and the quality of the RNA library (Supplementary  Table S2). It was further noted that more KFDV reads were retrieved in the Tissue culture Fluid (TCF) as compared to KFDV reads from mice brain suspension which may be due to contamination with the host sequences (of mice origin). phylogeny and diversity analysis. Phylogenetic analysis based on the 48 whole genome sequences revealed a major group with strong bootstrap support, consisting of the recent strains from 2006 to 2017. The older KFDV strains from 1957 to 1972 though different, did not form a homogeneous cluster. The overall nucleotide (nt) divergence within all the KFDV strains based on the full genome was 2.24% while that between the two groups was 2.76%. The overall amino acid (aa) divergence within all the KFDV strains based on the full genome was 0.75% while that between the two groups was 0.86%. The values of nt/aa divergence in different genes/proteins are provided in Table 1. The highest percent nt difference in between the two groups was noted in the capsid gene (3.27%), followed by the E gene (3.22%) and NS2 gene (2.91%). The highest percent aa difference in between the two groups was noted in the capsid protein (3.17%), followed by the prM (1.66%), and NS2 (1.4%) proteins. The aa substitutions delineating the two groups were C: N56S; prM: F130L and NS1: S271N, NS3: G221R (except in KA_MCL13H113_H_2013 and KA_121863_H_2012) (Supplementary Table S3). A comparison of the source species-wise divergence (Table 2) in the different genes revealed that maximum nt divergence (2.39%) was observed between human and tick species, followed by that between human and monkey (2.23%) and lowest between monkey and tick (2.06%). In all three species, the maximum nt differences were noted in the capsid gene followed by the E gene and NS2 gene respectively.
Consideration of the enlarged dataset of 76 E-gene sequences revealed that the overall nt and aa divergence was 2.61% and 0.33% respectively. Supplementary Table S4 enlists the mutations noted in the E-gene sequences.
Selection pressure analysis. The ratios of nonsynonymous to synonymous substitutions (dN/dS) determined for each KFDV gene (Table 1) showed no strong indication of positive selection. The selection pressure analysis (Table 3) 10 . Modeling of the KFDV glycoprotein was performed using the crystal structure of the Tick-borne encephalitis virus (TBEV), another tick borne virus belonging to the Flaviviridae family, available in the PDB (506 A.PDB) 9 . Mapping of the site revealed that aa123 was in close vicinity of the envelope dimer interface region ( Fig. 1), that plays a crucial role during membrane fusion and infectivity 22 . No annotation of the NS1 protein of KFDV is available, and hence the site in NS1 that showed evidence of being positively selected, could not be functionally ascertained. In NS2, 357 G/S/C (1487) is noted to map on to the NS2B region while the NS3 site 14 R/G/T (1505) falls in the serine protease domain. phylogeography and migration patterns. Based on the whole genome dataset. The estimated mean rate of nucleotide substitution of KFDV was 4.2 × 10 −4 subs/site/year with 95% Highest Posterior Density limits (HPD) (3.2, 5.35) x 10 −4 based on the best fit uncorrelated exponential clock model (Table 4). Based on this rate, the inferred mean root age was 64 years (HPD: 60.7, 69.4 years), indicating an emergence time ~November-1952 (June-1947, April-1956) (Fig. 2). The geographical ancestral state for KFDV was Karnataka with probability 1.0. Majority of the earliest strains (1959-1972) formed a cluster (node A) with the earliest reference strain P9605 of 1957 from a human host. The sequences of 1957 from tick and monkey and two other sequences of 1958 and 1960 were distinct and found to share a common ancestor with the more recent strains of 2006-2017 though with a low nodal posterior support of 0.44 (node A'). These recent strains had a divergence time estimate of 37.7 (24, 53) years back {March-1979{March- (1964{March- -1993)} (node B) with Karnataka as the ancestral source (state probability 1). The A106 2006 isolate formed a sub-group with strains from Karnataka during the period 2013-2016 and a single monkey strain from Goa of 2016 (161919). The common ancestor of these strains (node C) emerged around 31 years ago (~1986) from Karnataka. An indigenous group of strains from Karnataka over the period 2012-2017 evolved with a common ancestor (node D) estimated ~2004. Two sub-groups emerging from nodes E and F were noted to have a common ancestor (node G) that emerged from Karnataka with lower state probability (0.88). An independent entry to Tamil Nadu is noted as strains of 2013 from Tamil Nadu and Karnataka emerged from a common ancestor ~July 2009 with Karnataka as the source (probability 0.94). Dispersion is observed in Maharashtra around June 2012 as the common ancestor of strains circulating over the period 2016-2017 emerged from node F with Maharashtra as the source (state probability 1). Two separate movements to Goa are also noted  Table 3. Selection pressure analysis of KFDV isolates using the methods (SLAC, FEL, REL and MEME) available in the Datamonkey server. Sites identified to be under positive selection pressure based on the statistically significance level* (shown in bold font) by at least two of the methods are shown in bold.   Table 4. Estimates of substitution rates and root ages for KFDV sequences based on different clock models for the two datasets (a) Whole genomes (b) E-gene sequences.

Discussion
The KFDV is enzootic to India and maintained in Haemaphysalis spinigera ticks, as the main vector 23 , though at least 16 tick species 24 and different species of mammals have been shown to be involved in the natural cycle of KFDV. Though molecular clock studies undertaken by earlier workers, gave the estimates of the rates of molecular evolution of the KFDV as well as the time scales of divergence including the time to the most recent common ancestor (tMRCA), limited information is provided regarding the viral dispersal patterns. Therefore, application of phylogeographic reconstruction was undertaken to study the dispersion pathways of KFD viruses within India. We undertook the complete genome sequencing of KFD viruses as well as additional 28 E-gene sequences as representative isolates from human, ticks and monkey samples collected from the KFD affected areas in the recent and past times. While selecting samples for the whole genome sequencing, it was ensured that there was no sampling bias. Thus though during the period, December 2011 -March 2012, the KFDV positivity was high (60%) in Karnataka, only 4 isolates were sequenced. The enlarged and more complete dataset thus generated www.nature.com/scientificreports www.nature.com/scientificreports/ was used for phylogeography and other evolutionary analyses. The study would help understand the evolution, spatiotemporal patterns of dispersal of the KFDV within Karnataka and between the affected states and so also determine estimates of the time of divergence of newly evolved sub-lineages of KFDV.
Based on the p-distance calculations, it was noted that the genomes of KFDV (1957-2017) are overall highly conserved (2.24%/0.75% divergence at nt/aa level). The level of divergence, especially in the 1957 to 1972 time frame, was found to be even lower (0.31%/0.30% at the nt/aa level). Though this considerably low genetic diversity is a surprising observation in case of vector borne flaviviruses in general, another tick-borne flavivirus, a KFDV variant, referred to as the Alkhurma hemorrhagic fever virus (AHFV), did show very low levels of diversity at the genomic level 8 (1.1%/0.8% divergence at the nt/aa level) in isolates obtained from human cases over the course of 15 years. The low genetic diversity was also similar to that of an unrelated tick-borne member of the Bunyaviridae family, Rift Valley Fever Virus 25 . The lower diversity within KFDV isolates and specifically when the spread was restricted to the single state of Karnataka in the early time frame may thus reflect limited transmissibility of tick-borne viruses in comparison to mosquito-borne viruses.
The evolutionary rate obtained for KFDV based on whole genome as well as the E-gene sequences in this study was almost the same and is comparable to that obtained through the Bayesian coalescent analysis of partial E and NS5 gene sequences of around 50 KFDV (1957 to 2006) and AHFV 26,27 isolates 21 . With a mean rate of 6.4 × 10 −4 subs/site/year, the tMRCA for the KFDV isolates was estimated to have occurred ~1948, just nine years before the disease identification 21 . Another study 8 based on a Bayesian coalescent phylogenetic analysis of few KFDV (3) and 18 AHFV full-length sequences, revealed a slower rate of evolution (9.2 × 10 −5 subs/site/year). The present study which estimated the KFDV to have evolved around 1953 (based on whole genome data) or 1952 (based on E-gene data) therefore corroborates the findings based on partial KFDV sequences rather than the latter study including predominantly AHFV sequences and emphasizes that the evolutionary rates of tick-borne and mosquito-borne viruses are similar. The rate of substitution based on the E-gene is noted to be negligibly higher than that observed for the whole genomes and can be explained in terms of the higher percent nucleotide difference in the E-gene when compared to the whole genome (2.61% versus 2.24%).
Both the whole genome sequence data, as well as the E-gene sequences, revealed that all the earlier strains do not form a single homogenous cluster. Few strains of 1957-58 and 1960 have evolved distinctly and may be representing the spread within Shimoga district as well as other districts of Karnataka. Notably, the sequences of these strains (KA_601203, KA_W1930, KA_G11338 and KA_W-377) possessed a mutation A123T in the E gene as was noted in several of the later strains circulating in Maharashtra, Goa, Kerala and Tamil Nadu (Supplementary  Tables S3 and S4). Significantly, the E-gene MCC tree (Fig. 3)  In addition, a spillover of KFD cases to the border areas in the neighboring states of Tamil Nadu, Goa, Maharashtra, and Kerala 28 has also been noted. Dispersal of the viruses to similar eco-regions in the bordering states may be accomplished through the movement of animals presumably carrying ticks. The movement of KFDV from Karnataka to Goa around mid-2008 is indicated from the analysis of the E-gene sequence data (Fig. 3). Further, an exportation of the virus from Goa to Karnataka was also noted in the form of isolates MG720092, MG720091 of 2016 from Belgaum, Karnataka. This can be explained as cases of KFDV in cashew nut workers contracted in Goa during cashew nut harvesting from a KFD-affected village in Sattari taluka of Goa 18  To study the migration patterns of the KFDV, the transition rates with significant BF values (>3) and posterior probability values (>0.8) were identified (Supplementary Table S5). Significant BF values and posterior probability values were noted between Karnataka/Goa, Goa/Maharashtra, Karnataka/Tamil Nadu, Karnataka/ Maharashtra and Maharashtra/Tamil Nadu. The Bayesian skyline plot (Fig. 4A,B) reflecting the genetic diversity of KFDV reveals an almost stable phase or slightly increasing trend over the period from 1955 to 2015, followed by an increase beyond this period which is more remarkable from the whole genome data (Fig. 4B). This trend is reflective of the proliferation of the virus to the neighboring states of Karnataka and also the recent exchange of viruses between neighboring states. Figure 5 provides a schematic representation of the migratory pathways along with the migration times, on a map of the study area. As is clear from this figure, the transmissions of KFDV are along the narrow mountainous stretch, known as the Sahyadri range or Western Ghats, parallel to the western coast of the Indian peninsula traversing the States of Kerala, Tamil Nadu, Karnataka, Goa, and Maharashtra. The free movements of tick-infested monkeys in these forests along with changes in agricultural and occupational practices that encourage close proximity to humans and/or their dwellings can explain the KFDV spread from the Karnataka epicenter to movements between the states.
Mapping of the sites noted to showing evidence of positive selection, as well as major mutational sites on the three-dimensional E-protein structure models, was undertaken for evaluating possible functional correlations. Though among the five sites predicted to be positively selected, the sites prM: 152 V/I/A, E: 123 A/T, NS2: 357 G/S/C and NS3: 14 R/G/T (1505) could be mapped to specific functional domains of these proteins, further, studies may be warranted for experimental examination of their functional role. Similarly the role of other substitutions noted in the E-protein such as D239N, G230E would need further investigation. Notably certain specific mutational changes (in the C, prM, NS1 and NS3 proteins) were noted between the earlier strains of the period 1957-1972 and the recent strains (2006-2018) wherein the virus was noted to have spread its geographical distribution from Karnataka to other states Maharashtra, Goa, Tamil Nadu, and Kerala stretching from north to south in the Western Ghats. The fact that an indigenous group of strains from Karnataka also possessed these mutations, indicate that the mutations are not a factor for viral adaptation to the terrain. The mutations could have facilitated the transmission to the newer areas through some advantage to the virus in the vector or hosts, though it cannot be denied that they could be the accumulation of random mutations. Unfortunately, the non-availability of KFDV isolates during the interim period due to lack of a well-equipped containment laboratory then and the absence of experimental evidence at this time point does not permit a more definite interpretation. Similarly though mutations have been noted in several genes specific to the different phylogenetic groups reported in this study and corresponding to different geographical locations, experiments are necessary to identify any possible significance of these mutations.
In order to understand whether the passage history could influence the introduction of mutations in the sequenced isolates, we compared the whole genome sequences of representative isolates at different passage levels (data not shown). It was found that tissue culture passage could lead to negligibly low nucleotide substitutions (upto to three) and upto two amino acid mutations at passage levels three and above. None of the mutational sites showed evidence of being positively selected (data not shown) and most of the isolates in this study are of passage 1 and 2, indicating the robustness of the findings of this study.
In summary, the present genomic and phylogeography study revealed the evolution of the virus and also the regions linked with the transmission dynamics of the KFDV. The earliest migratory routes to Kerala, Maharashtra and Tamil Nadu, as well as Goa, are most likely from the State of Karnataka. Thereafter exchange of viruses www.nature.com/scientificreports www.nature.com/scientificreports/ between neighboring regions is being noted in the recent past. The study highlights that the ongoing transmissions need to be curtailed through intense surveillance activities and enhanced vaccination coverage efforts.

Materials and Methods
Ethics statement. This study was prior approved by the scientific advisory committee, Institutional Animal Ethics committee (IAEC) and adhered the regulation of Committee for the Purpose of Control And Supervision of Experiments on Animals (CPCSEA). The lyophilized old KFD virus isolates (mice brain suspension) or propagated in BHK-21 tissue culture were used for sequencing the genome. The virus isolation was attempted from KFD positive blood/serum samples in BHK-21 cells. The standard international guidelines and Institutional Human Ethics committee (IHEC) guidelines were followed to anonymized the identity of KFD cases. Prior approval was obtained from IHEC NIV/IEC/2018/D-9 for including these isolates in the study. The informed consent was taken for recent human samples. vitro and in vivo method). Old KFDV (lyophilized virus mouse brain suspensions) isolated during the year 1957-1972 were procured from the virus repository of ICMR-National Institute of Virology, Pune, India. The isolates were stored in cold chain and were used directly in the study after reconstituting in 10% phosphate buffer supplemented with bovine serum (BAPS). The recent KFDV positive clinical specimens spanning the year 2006 to 2018 were used for virus isolation, in CD8 infant mouse and BHK-21 cells, by selecting samples from different geographical locations, hosts, and vectors. These samples [monkey (9), human (25), and tick pools (13)] were referred/collected during different KFD outbreaks.

KfD virus isolation (in
The samples having Ct value less than 28 were selected for genome sequencing. In the absence of the older clinical samples and to ensure adequate material for primer sequencing using the traditional Sanger sequencing protocol, virus isolation for the recent clinical samples also was undertaken. For in vivo propagation, serum samples/tick homogenates/monkey tissue homogenates were inoculated in infant mice and observed for sickness on daily basis. On sickness, the mice brain was extracted, homogenate and suspended in 10% BAPS and stored at −80 °C until further use. The KFDV were propagated in BHK-21 cells at 37 °C for 5 days, and the supernatants were harvested for RNA extraction when cytopathic effects (CPE) were observed. The tissue culture fluid (TCF) containing viral particles were obtained from cell lysates after three freeze-thaw cycles. The cells were incubated with either mice brain suspension or TCF was used for the total RNA extraction and for sequencing of KFDV genomes. The virus propagation experiments were performed in biosafety level 4 (BSL-4) laboratories and inactivated. After inactivation of the virus in trizol reagent (Invitrogen), the vials were moved in a BSL-2 facility with proper precautions and safety measurements for RNA extraction and sequencing.
KFD virus sequencing using Sanger's method. The KFDV isolates were aliquoted and inactivated in the BSL-4 facility. RNA was extracted from KFDV isolates in BSL-2 facility, by using Trizol Reagent (Invitrogen), followed by the QIAamp Viral RNA Mini Kit (Qiagen) column purification. The extracted RNA was stored at −80 °C until use. A set of 50 primers were designed, using the reference KFDV strain P9605 (Accession number: JF416958), for the whole genome sequencing of KFDV that are enlisted in Supplementary Table S5 The cDNA products from the above step were electrophoresed on 1.5% Agarose gel. The products were excised and purified through the QIAquick Gel Extraction Kit (Qiagen). The purified cDNA products were subjected to the cyclic Sequencing PCR, by using BigDye ™ Terminator Cycle Sequencing Kit (Invitrogen). The cyclic sequencing reaction conditions followed were 96 °C (1 m), 50 °C for (5 s), and 60 °C (4 m), for 25 cycles. Further, the products were purified by using DyeEx 2.0 kit (Qiagen) and sequencing was performed using the ABI PRISM ® 3100 Automated DNA Sequencer. The results of the sequenced chromatogram data were assembled by using the Sequencher 5.1 software (Accelrys Inc.).
KFD virus sequencing using next-generation sequencing method. One mL of mice brain suspension/tissue culture fluid was used for RNA extraction. RNA library preparation and its quantification were performed using the protocol described by Yadav et al. 29 , and loaded in the Illumina Miniseq platform. The FASTQ files generated after the completion of the run were analyzed using CLC Genomics Workbench software version 10.1 (CLC, Qiagen).
The list of 46 KFDV isolates of this study from different sources (tick pool, monkeys, and human) along with their sampling years, locations, passage history and their GenBank accession numbers are provided (Supplementary Table S1). Of a total of 46 isolates, Sanger sequencing was done for 40 isolates while six isolates were sequenced using the next-generation sequencing method. Additionally five isolates were sequenced by both the methods to check for concurrency.
Phylogeny and genetic diversity. The KFDV whole genome sequences obtained in this study (n = 46) and a two whole genome available of the KFDV reference strain from India (G11338 and W-377) isolated in 1957 (GenBank accession number JF416959.1 and JF416960.1) were aligned using MEGA and the length of the alignment was restricted to 10,248nt. Phylogeny was carried out using the neighbor-joining method available in MEGA v6 30 using the Kimura 2-parameter nucleotide (nt) substitution model with 1000 bootstrap replicates. The proportion of nt differences was estimated by pairwise comparisons using the p-distance model. Estimation of the gene-specific ratio of nonsynonymous to synonymous substitutions (dN/dS) was also done in MEGA.
Selection pressure and functional analyses. Selection pressure analysis based on the whole genome dataset and also on the enlarged dataset of the E-gene region was performed in the Datamonkey server 31 using the single likelihood ancestor counting (SLAC), fixed effects likelihood (FEL), random effects likelihood (REL), mixed effects model of evolution (MEME) and (Fast, Unconstrained Bayesian AppRoximation) FUBAR methods. Codon sites were identified to be under positive selection pressure based on the statistical significance level (P ≤ 0.1 for SLAC, FEL, and MEME, Bayes factor >50 for REL or posterior probability >0.9 for FUBAR) by at least two of the methods. To co-relate, the mutational sites to functional sites, the available protein structural data of other flaviviruses was used. The E-protein of KFDV was modeled as a trimer using the crystal structure of TBEV (506 A.PDB) as a template the sequence of which showed 80% identity with the KFDV sequence (strain P9605). The SWISS-MODEL server (http://swissmodel.expasy.org/interactive#structure) 32 was used for homology modeling and validation of the modeled structure was carried out using the PROCHECK tool from structure analysis and verification server (SAVES) [https://services.mbi.ucla.edu/SAVES/]. The 3D structure of the protein was visualized using BIOVIA Discovery Studio Visualizer (www.accelrys.com/products). www.nature.com/scientificreports www.nature.com/scientificreports/ Phylogeographic reconstruction of KFDV and migration pattern. The phylogeographic analyses based on 48 KFDV whole genome sequences was carried out in this study. Representative isolates based on sampling time and geographic location were selected for the study to ensure that there was no sampling bias. In addition, the analysis was also carried out on a separate dataset of 76 E-gene sequences by including the E-gene data that was available in GenBank (Supplementary Table S8). The best-fit model of nucleotide substitution selected on the basis of Akaike Information Criterion (AIC) in jmodeltest-2.x was found to be the GTR (General Time Reversible) + G4 (gamma with 4 categories) + I (proportion of invariant sites) model for the for either dataset. Temporal information (year of isolation) of sequence data was used to estimate the evolutionary rate and ancestral times, by generating a maximum clade credibility (MCC) tree using the Bayesian Markov chain Monte Carlo (MCMC) approach as implemented in BEAST 1.8.2 33 employing the relaxed uncorrelated lognormal and uncorrelated exponential clock models with the Bayesian Skyline tree prior. Two independent runs of the chain were carried out, each with 50 million generations and sampling frequency of 1000 and combined with 20% burn-in for the whole genome and the E-gene datasets. The convergence of the chain was analyzed by using Tracer1.5 and effective sample size (ESS) values of >200 indicated sufficient level of sampling. The MCC tree was visualized using FigTree 1.4.2. The spatial information (state of isolation) of the isolates was further used to infer the geographical dispersal patterns of the virus by fitting a standard symmetric substitution with the Bayesian stochastic search variable selection (BSSVS) 34 in BEAST. Four geographical locations (Karnataka, KA; Tamil Nadu, TN; Maharashtra, MH; Goa, GA) were selected and coded as discrete states. The MCC tree obtained was input to the program SPREAD 1.0.5 35 to visualize and analyze the transmission pathways. To infer diffusion rates between any pair of locations Bayes Factor (BF) values available in SPREAD, were calculated.

Data availability
The datasets analyzed during the current study are available in the Genbank database.