Global evolutionary history and spatio-temporal dynamics of dengue virus type 2

DENV-2 spread throughout the tropical and subtropical regions globally, which is implicated in deadly outbreaks of DHF and DSS. Since dengue cases have grown dramatically in recent years, about half of the world’s population is now at risk. Our timescale analysis indicated that the most recent common ancestor existed about 100 years ago. The rate of nucleotide substitution was estimated to be 8.94 × 10−4 subs/site/year. Selection pressure analysis showed that two sites 160 and 403 were under positive selection, while E gene is mainly shaped by stronger purifying selection. BSP analysis showed that estimating effective population size from samples of sequences has undergone three obvious increases, additionally, Caribbean and Puerto Rico maintained higher levels of genetic diversity relative to other 6 representative geographical populations using GMRF method. The phylogeographic analysis indicated that two major transmission routes are from South America to Caribbean and East&SouthAsia to Puerto Rico. The trunk reconstruction confirmed that the viral evolution spanned 50 years occurred primarily in Southeast Asia and East&South Asia. In addition, phylogeographic association-trait analysis indicated that the viral phenotypes are highly correlated with phylogeny in Nicaragua and Puerto Rico (P < 0.05).

Scientific RepoRts | 7:45505 | DOI: 10.1038/srep45505 1944, DENV was first isolated from the sera of American soldiers in India. Since then, dengue epidemics have been mostly reported from Southeast Asia, particularly after the end of World War II 10 . Severe dengue (DHF) was first recognized in the 1950s in Philippines and Thailand. It was not until 1981 that DENV-2 strain of Asian origin was introduced into Cuba with a significantly increased severity during subsequent epidemics, causing the first epidemic of DHF in the Americas 11 . The disease is now endemic in more than 100 countries in the WHO regions, among which the Americas, South-East Asia and Western Pacific are the most seriously affected. In 2008, cases across these three regions exceeded 1.2 million. In 2010, Puerto Rico experienced the largest outbreak in its history with 5382 confirmed infections and 20 deaths. Nearly 2.4 million cases were reported annually in the member states of 3 WHO regions (http://www.who.int/en/) in 2010, 2013 and 2015. The number of new cases arising each year is still increasing globally.
Dengue is transmitted by female mosquitoes mainly of the species Aedes aegypti, to a lesser extent, Ae. albopictus. The mosquitoes also transmit Zika, Yellow fever, and Chikungunya infection. The disease has been widespread throughout the tropics and now a global threat. Nearly all tropical/subtropical nations had reported explosive increases. Geographical limits of dengue fever transmission are strongly determined by climate such as rainfall and temperature 12 . In addition, air travel has been considered as a crucial factor driving the intracontinental spread 13 . One recent estimate indicates 390 million dengue infections per year (95% credible interval 284-528 million), of which 96 million (67-136 million) manifest clinically 14 . Today, severe dengue disease burden affects most Asian and Latin American countries and DHF/DSS has become a leading cause of death among children in these regions 15 . Due to the fact that dengue viruses have evolved rapidly and spread worldwide, genotypes associated with increased virulence have expanded from South and Southeast Asia to the Pacific and Americas 16 .
In recent decades, reports of dengue infections in human migration from the tropics and subtropics have been increasing. Not only has the number of cases increased but also the explosive outbreak spread to new regions 17 . Many phylogenetic studies on DENV-2 have documented the viral spread within individual countries or specific regions 18,19 , so the geographic origins, migration patterns have been poorly investigated. Additionally, E protein is the major protein on the surface of the virion, a N-glycosylated dimeric membrane fusion protein that mediates virus binding and fusion to host cell membrane 20,21 , and thus might be under strong selective pressure from the host immune response. Most importantly, E protein contains the main epitopes for vaccine development and is used as a reagent for diagnostic purposes. Therefore, we focus on evolutionary analysis of E protein and try to reveal the population dynamics, migration direction and geographic origin of DENV-2 in global.

Results
Phylogenetic analyses. To determine the phylogenetic relationships among dengue-2 viruses, phylogenetic trees based on E gene sequences were inferred using the maximum likelihood (ML) and the Markov chain Monte Carlo (MCMC) methods. The ML tree allows clearly distinguish the topological structure, and the bootstrap values (> 75) are displayed on the tree with rainbow color (Fig. S1). The isolates derived from the same geographic area tend to cluster together, which makes the strains displayed a marked geographic distribution. Generally, the isolates in the tree can be divided into five genotypes: American/Asian (AM/AS), Asian type 1 (Asian 1), Asian type 2 (Asian 2), Cosmopolitan, and American (Fig. S1). Further observation indicated that geographic-close regions share the same DENV-2 genotype and this would serve as a control to show that DENV-2 genotype variation is a major underlying determinant of epidemics. In genotype AM/AS, the strains may be separated into two distinct subgroups, which are mainly from the countries in the Americas and some from Southeast Asia (10 isolates from Viet Nam and 2 isolates from Cambodia). While the majority of Southeast Asia strains were present in Asian 1. Asian 2 is composed of strains from New Guinea C (NGC), Taiwan and other regions. The Cosmopolitan genotype has a wide geographical distribution, including Southeast Asia, India, China, Guam, Saudi Arabia and Burkina Faso. The American genotype has one strain (DENV-2/ID/1183DN/1977). In order to verify the reliability of grouping, we have included three American genotype viral sequences which have been previously reported, such as DENV-2/Tonga/1974 (AY744147), DENV-2/CO/BID-V3358/1986 (GQ868592) and DENV-2/PE/IQT2913/1996 (AF100468), as the reference for the genotype (Fig. S2). Hence, it is convincing that we named the American genotype solely based on one virus isolated from Indonesia. Maximum clade credibility tree depicted the proportional relationship between branch length and time, representing the genealogy of the virus strains by analyzing 194 sequences. Comparatively, the topology of the MCC tree is congruent with that of the ML tree (Figs 1, S1 and S3). Interestingly, when compared to the tree topology of influenza virus characterized by a ladder-like trunk and short terminal branches, we obtained a unique equidistant phylogeny with short terminal branches. The equidistant topology indicated that all prevalent lineages in the study are transient under a selection enhancing host immunity. Natural selection. To avoid high rates of false positives, the recombinant sequences were removed before quantifying selection pressure. Putative positively selected sites were identified using Single-Likelihood Ancestor Counting (SLAC), Fixed Effects Likelihood (FEL), Random-Effects Likelihood (REL), Mixed Effects Model of Evolution (MEME) and Fast Unconstrained Bayesian AppRoximation (FUBAR). Site found to be statistically significant for selection pressure by two methods was considered positive (P < 0.1, posterior probability (PP) > 0.9, or Bayes Factor > 20). Four codons under positive selection were identified by REL method, whereas MEME analysis detected ten diversifying selection sites. Taken together, positions 160 and 403 were identified as being under potential diversifying selection (Fig. 2a). Unfortunately, the SLAC and FEL analyses did not find any sites affected by positive selection but suggested 191 sites and 255 sites influenced by negative selection, respectively. Also, the FUBAR method did not detect any codons under pervasive diversifying selection exceeding a PP > 0.9. It should be noted that no positive selection was detected in sites 390 and 402 in our study, which is located within or adjacent to the putative receptor-binding domain (domain III) implicated as a potential virulence determinant in mice in American DENV-2 genotype 22,23 . Obviously, most of sites in the E gene are under strong negative selection, indicating strongly conserved. The purifying selection shows a typical highly adapted phenotype, which is probably caused by constraints imposed by protein structure and function. Then, we correlated the positively sites with the time-scale of DENV-2 migration, most codon positions were under negatively selection. Comparatively, selection pressure in the early viruses involved an increase in amino-acid mutations, leading to the increased positive selection most likely in response to host immune pressure. While the decreased positive selection or increased negative selection were observed in the most recent period (Table S1, Fig. S4). Totally, the positive selection on DENV-2 is relatively weak when compared to the viruses that appear to be under strong host immune pressure, such as influenza A virus 24,25 and hepatitis C virus 26 . Glycosylation focus on the importance of viral virulence and immune evasion, so we predicted glycosylation sites in envelope glycoprotein. Probable 23 O-linked and 2 N-linked sites were presented in Table S2 and Fig. 2a. The N-linked glycosylation site 153 is conserved in most flaviviruses, while the position 67 is unique for dengue viruses. All glycosylation sites detected in this study are not under positive selection. Collectively, purifying selection may be more important in shaping E protein evolution, leading functionally and structurally conserved.
Evolutionary rate and divergence date estimate. After molecular clock and demographic models selection using Akaike information criterion through MCMC (AICM), the relaxed molecular clock (uncorrelated exponential) and an exponential growth tree prior are the best combination in our data analysis (Table 1, Fig. 2b). For the sampled population, the mean evolutionary rate was estimated to be 8.94 × 10 −4 substitutions/site/year  23,1943.74), and genotype American (n = 1) in 1977. While in the root-to-tip regression analysis for DENV-2 population, the nucleotide substitution rate was estimated to be 1.45 × 10 −3 nucleotide subs/site/year, and the TMRCA could date back to 1889.51 (Table 1, Fig. 2c). The analysis indicated that DENV-2 evolved almost 1.6-fold faster and diverged approximately 24 years earlier than those estimated by using the BEAST program. Although there are some slight differences between two methods, they still fall within the confidence interval.

Reconstruction demographic history.
To explore the demographic history of the sampled population and genotypes over the study period, the relative effective population size over time was inferred by analyzing the genetic diversity using the Bayesian skyline model. As shown in Fig. 3a, the population has had a steady size until a first decay inferred around 1975. The decrease was followed by a sharp increase started around 1987-1990, coinciding with the period when genotype AM/AS viruses were introduced (Fig. 3b). This genetic diversity reached a plateau that remained steady until the end of 1997. The geographic expansion of the DENV-2 entailed two major waves since 1997, and the first reached its peak before 2001. The effective population size of three genotypes   Table 1. Comparison of BEAST models. Comparison of parameter estimates and AIC values among combinations of demographic (constant population size, exponential growth and Bayesian skyline-plot) and molecular clock (strict and uncorrelated molecular clock, UCED and UCLD) models for the reconstruction of the evolutionary and demographic history using BEAST. remaining relatively stable during the period, the Cosmopolitan genotype may be the major contributor to the genetic diversity of DENV-2 ( Fig. 3b). Correspondingly, Philippines experienced its worst dengue outbreak during 2000-2001 27 . The second sharp increase in diversity occurred reaching higher levels of genetic variability than that of the first, peaking around 2005, and approximately 14,000 people were infected during the worst dengue outbreak on record for Singapore in the year 28 . The diversity of DENV-2 declined sharply during 2005-2009, which may be the use of mosquito repellent to protect human against mosquito and to break the chain of transmission 29 , but the genetic diversity in Asian 1 showed an upward trend in this period. The population size of DENV-2 remained stable after that. Our observation showed that DENV genotype variation is a major underlying determinant of epidemics. To further understand the changing ecology and epidemiology of the disease, the epidemiological data of DENV serotype from the global were collected from WHO (Fig. 3c). Although the event of exponential increase in global dengue cases during 1976-1977 was not detected by the Bayesian skyline plot (BSP) analysis, the analysis is generally consistent with the epidemiological dengue cases of all four DENV serotype.
To infer the time-resolved phylogeny in representative areas, an estimation of the relative change in the population size of DENV-2 over time was made from Gaussian Markov random field (GMRF) Bayesian Skyride coalescent model, which differs from the BSP by not requiring the specification of a user-defined prior on the number of population size changes in the history of the sample. We compared the relative genetic diversity of 8 different geographic regions with coinciding epidemic trends, representing the tropics and subtropics areas (Fig. 4). Phylogeographic and migration pattern analyses of DENV-2. As phylogeographic analysis of DENV transmission regionally is conceivable, we first carry out a phylogeographic analysis within Asia. Locations that are epidemiologically linked are identified using the Bayes Factor (BF) test under Bayesian Stochastic Search Variable Selection (BSSVS). The transition between Singapore and China received the strong support by the BF criterion (BF = 708.7) (Fig. S6a). In order to understand the migration process acting along DENV-2 phylogeny in Asia, Fig. S6b projects each of the branches of the MCC phylogeny onto a geographic map. The earliest routes began in India, continued to China and Philippines and then Cambodia before 1970. In 1971-1985, the virus appeared to spread from China to Viet Nam and from India to Indonesia. In the next few years, it is important to note that the circulation of the virus resulted in a very high incidence of epidemics in the three countries (Thailand, Viet Nam and Cambodia) during 1986-2005. After 2006, the virus had expanded their geography distribution to Pakistan. Subsequently, we expanded to transcontinental analysis to understand the geographic origins of the DENV-2 virus in the world. The phylogeographical analysis shows that Nicaragua, Puerto Rico, Brazil, Caribbean, South America, Southeast Asia, Viet Nam and East&South Asia played central roles in the dispersal of DENV-2. To illustrate the spatial diffusion of the virus within the tropics and subtropics, a joint analysis of discrete trait models was performed to estimate the overall genetic transmission process. Figure 5 shows the maps of the routes when calculating a BF for the most significant non-zero rates, and we used a BF cutoff of 3 to determine significance. And the rates of viral migration across the eight regions were estimated. Fourteen migration events with strongly support were summarized in Fig. 5a and Table S3. Among them, two significant migration links were established with mean rates from 0.92 to 0.937 (decisive support with BF > 1,000), with one originating from South America to Caribbean, and another from East&SouthAsia to Puerto Rico (Fig. 5a, Table S3), state counts support the idea that East&SouthAsia may have acted as a seeding population (Fig. 5b).
For a more detailed understanding the routes of invasion of DENV-2, eight different data sets were used to represent the global epidemic periods: 1964-1990, 1991-2000, 2001. During 1964-1990, there are six significant migration links with higher supported diffusion rates (Fig. 5c), and state change counts clearly reflected this dynamic with outward migration from Southeast Asia and inward migration from Puerto Rico dominating (Fig. 5d). From 1991 to 2000, one strongly supported pathway (BF > 100) from Viet Nam was apparent, indicating migration from Viet Nam to Brazil (Fig. 5e), meanwhile state counts suggest that a migration out of Viet Nam is probable (Fig. 5f). In the period from 2001 to 2013, three strong epidemiological links (BF > 1,000) from East&South Asia to Puerto Rico and Nicaragua and from Southeast Asia to Brazil were illustrated in Fig. 5g, and these were supported by the state change counts with migration from East&South Asia being much greater than any other location included in our analysis (Fig. 5h). The inferred spatial dynamics suggested that multiple geographical regions might act as a potential seed for local epidemics. Additionally, we reconstructed genealogical tree with time-scale and inferred ancestral locations of each branch through a Bayesian phylogeography framework. The most probable location of each branch of the tree was assigned to different colors and the calibrating time-scale was shown on the bottom, and the Southeast Asia has the highest posterior probability of being the location of the tree root (Fig. S5), indicating that the five genotypes seem to have originated from the Southeast Asia.
A structured coalescent approach was used to calculate the proportion of the trunk for each geographic region. A higher proportion implies that the corresponding region is more likely to be the source of the virus. The mean proportion and credible interval of the trunk assigned to each geographic region were summarized in Table 2. The trunk of the genealogy predominantly resides in Southeast Asia (32%) and East&South Asia (15%) from 1964 to 2013 (Fig. 6a). As previously established, migration patterns near the root of the tree support Southeast Asia as source populations (Fig. S5). We observed that samples from the Southeast Asia during the period from 1964-1985 coalesce to the trunk of the genealogy rapidly, and from 1986 to 1993, Puerto Rico samples are closer to the trunk than samples from others (Fig. 6b).
Scientific RepoRts | 7:45505 | DOI: 10.1038/srep45505 To assess the strength of geographical association with sampling locations across the entire tree, the global trait association tests of phylogeographic structure were performed and the null hypothesis of no association between sampling location and phylogeny is nearly always rejected. The analysis shows a very strong geographic clustering of strains by area of origin (P = 0 for both association index (AI) and parsimony score (PS) statistics). The monophyletic clade (MC) statistic was used to test the extent of phylogenetic clustering of individual regions, the result showed that population subdivision was significant for most of the localities (P < 0.05). The most significant geographic correlation of lineages was found in Nicaragua and Puerto Rico, reflecting predominantly local evolution in these regions (Table 3).

Discussion
The dengue virus has become a leading cause of illness and death in the tropics and subtropics, emerging as a serious public health problem in the world. Some 50-100 million new infections are estimated to occur annually in more than 100 endemic countries, with a documented further spread to previously unaffected areas (http:// www.who.int/). The viruses co-circulate in many countries with multiple serotypes, because infection with one serotype does not provide long-term protection against the other serotypes. For a long time, there is no specific antiviral therapy or vaccine in clinical use for the disease. But now, the first dengue vaccine approved in Mexico on December 9, 2015 7 . In spite of this, continuous surveillance of the virus can facilitate the early detection of novel emerging variants, and acts as a fundamental part of integrated preventive strategy. Previous studies were mainly about specific country and region 7,30-33 , or only focused on certain genotype 19 . Therefore, the challenge is to analyze evolutionary history and geographic spread of the virus globally. We utilized a larger and more temporally and geographically spread data to understand epidemics and endemic diseases, and provided effective control for epidemiology.
In this study, a consensus ML tree and Bayesian tree were constructed using 194 nucleotide sequences isolated from geographically distinct regions. However, low bootstrap support values of > 60% and posterior probability of > 0.53 are shown above the branch with one strain (DENV-2/ID/1183DN/1977) (Fig. 1, Fig. S1). As we have mentioned previously, we have discussed that the strain from American genotype with the reported strains (DENV-2/Tonga/1974, DENV-2/CO/BID-V3358/1986 and DENV-2/PE/IQT2913/1996) as the reference (Fig. S2). The result indicated that the isolate from Indonesia can be named the American genotype. The strains from America share a common ancestor with the genotypes circulated in Cambodia and Viet Nam, that is, AM and AS lineages have co-circulated at certain point in the past 34 38 . What causes differences between the estimates? A lot of sylvatic strains and more endemic strains were included in the previous analyses, resulting in an earlier TMRCA value and lower the evolutionary rate. E protein mediating receptor binding and endosomal fusion is the major antigenic determinant of DENV. Some amino acid sites may be located in or near B-or T-cell epitopes for neutralizing antibody and T-cell responses, and the positive selection acting on the sites is associated with immune evasion. In some of the earlier studies, positions 360, 390 and 402 were reported to be under positive selection 9,39,40 . Although the mutation N → D390 exists in American genotype, no statistically significant positive selection was detected at the site by five methods. The site 390 has been identified as a hot spot for DENV-2, because it may be crucial in determining the virulence. The site 402 locates within the stem-anchor region adjacent to the putative receptor-binding domain, which confers neurovirulence in mice 23 . Our analysis showed that two previously unreported sites (160 and 403) were under weak positive selection detected by two methods simultaneously. The positively selected site 403 may have a similar function with site 402, but the validation need more experimental evidences. Although previous studies reported sporadic positive selection in E protein amongst four DENV serotypes 21 , purifying selection is clearly the dominant evolutionary pressure acting on DENV. DENV-2 E protein evolved to a high level of fitness for alternate infection between mosquitoes and primates, resulting in most amino acid changes being deleterious and purified quickly from the population by selection 9 . As a consequence, we believe that there is a limit to TMRCA estimation for rapidly evolving viruses and strong purifying selection can lead to an underestimation of viral origins.
Phylodynamics of dengue virus focuses on transmission dynamics in an effort to shed light on how the dynamics impact viral genetic variation, and understanding the population dynamics of the virus can provide important insights into epidemiology and virus evolution. Our Bayesian skyline plots showed the demographic history of DENV-2, a decay was observed during 2005-2009. While population size of Indian Cosmopolitan strains remained almost constant in the period 2005 to 2009 41 , the changes in genetic diversity of other genotypes may resulted in the decay of DENV-2 population size. Furthermore, the event of exponential increase in global dengue cases during 1976-1977 was not detected by the DENV-2 BSP analysis, it is possible that the increase of cases is due to other three DENV serotypes. Because epidemiologic data for single serotype is unavailable in the public database, we are unable to obtain the data for DENV-2 serotype. Nevertheless, the overall temporal trend of the epidemiological dengue cases of all four DENV serotypes is generally consistent with the DENV-2 BSP analysis, indicating that DENV-2 serotype remains predominant in the four serotypes. Epidemiology is a central feature of public health policy and disease management, and prompt case detection and appropriate clinical management can reduce the mortality from severe dengue. In fact, dengue disease has been neglected by the scientific and pharmaceutical industries with no new drug development 42 . Management of dengue disease relies on early diagnosis and prevention with vaccines, but the cause of concern for the local medical community is often unaware of diagnosis and management of the new disease, which leads to avoidable morbidity and mortality. While Mir et al. 33 investigated DENV-2 migration using similar tools, the study was geographically limited to the Americas including the Caribbean. By contrast, our study attempted a global phylogeographic analysis of DENV-2 transmission. Yet phylogeographic analysis of DENV-2 transmission regionally is conceivable, it is difficult to conceive how transmission can occur via long distances over short periods. Even if globalization and travels were taken into account, one would expect to see a concentration of virus dissemination within neighbouring  Table 3. Phylogeny-trait association tests of the phylogeographic structure of DENV-2 using BaTS.
regions, in particular the ones with a smaller time scale. Therefore, we conducted additional analysis within Asia for emergence and dissemination of DENV-2. Subsequently we inferred the evolutionary history in the context of global spread on a much larger scale than the previous study for transcontinental analysis. The aim of our phylogeographic research was to reconstruct the spatial and temporal dynamics based on the analysis for the data set composed of 188 sequences sampled from 1964 to 2013. We have shown that the genetic population structure of DENV-2 arises in part from global migration dynamics, with the most important contributions from South America and Southeast Asia, and the Southeast Asia might be the geographical origin of DENV-2, which is supported by Rico-Hesse et al. 43 . The migration route from South America to Caribbean and East&South Asia to Puerto Rico were strongly supported by BF test. This will lead to geographical expansion of the disease, just like the epidemics expanded from Southeast Asia countries to India 44 . Our results in turn could guide the policy makers for controlling the global dengue transmission, highlighting the usefulness of phylogeography for public health surveillance and the analysis of disease transmission within the regions. Besides anthropogenic factors significantly influencing the DENV migration, some factors involved in the viral dispersal, such as unplanned rapid urbanization, proliferation of mosquitoes, bad sewer and waste management, increased air travel and global warming.
Our use of the structured coalescent model to analyze DENV evolution represents a significant step forward over previous methods. More importantly, the sampling date, sampling location and an underlying model of the demographic process were incorporated into coalescent method, the details will allow for more accurate reconstruction. By analyzing a large number of sampled trees, our estimates of the trunk location over time have a degree of confidence associated with them, confidence intervals are attached to proportions in all regions. Additionally, it is considerable that samples from Southeast Asia are the closest to the trunk, because of this, we can assume that the trunk of the genealogy resides in Southeast Asia. During 1986 to 1993, Puerto Rico samples are closer to the trunk than the strains from other regions, but are not close to the trunk in absolute terms.
Mathematical models have been recognized as useful epidemiological tools in exploring complicated relationships underlying infectious disease transmission processes. The limitation is the fact that modelling exercise is inevitably based on assumptions. One can argue about whether they are too simplistic or impractical and therefore hamper quantitative conclusions about the reality. The reality is always more complex, so epidemiology or DENV transmission will always involve some approximation, and parameter estimates will always involve a certain degree of error and bias. As long as the scientist understands the nature of the problems, approximation, error, and bias do not, by themselves, invalidate the utility of an estimate. The major limitation about the study is that reconstruction of the viral movement pattern was limited to transmission among population sampled, while unsampled populations may play a role in the spread and persistence of DENV-2. The number of samples and the strength of migration are highly correlated in our analysis. The migration signal may be weak due to insufficient sample. In contrast, the small number of samples could be the result of the low prevalence of DENV-2 and a corresponding weak migratory connection with other regions. To overcome this bias, subsampling is typically conducted to ensure that each region has the same number of sequences.

Conclusion
Our study provides novel and significant insights into the temporal and spatial dynamics of the DENV-2. The estimated TMRCA suggested that DENV-2 emerged in 1913. Purifying selection is a dominant force acting on E gene to drive the evolution of the virus. Our study suggested that Southeast Asia is the origin of DENV-2 and the hotbed for its spread, and that escalation in genetic diversity lead to epidemics seen in Philippines and Singapore. It is interesting to note that the major routes of viral movement between East&South Asia and Puerto Rico and between South American and Caribbean. The research into virus intercontinental transmission is necessary to understand molecular diversity and to manage public health.

Materials and Methods
Sequence dataset. We have downloaded all DENV-2 E gene sequences with known time (year) and geographical locality (country) of isolation from GenBank (http://www.ncbi.nlm.nih.gov/). Strains used in this study are presented in the following format: two letters ISO country code/strain name/year of isolation. In order to minimize the problem of over-representation of some countries, such as the Puerto Rico, 194 sequences were subsampled from the larger set of 946 sequences by location and time to create a more equitable spatiotemporal distribution, excluding sequences with 100% identity and removing the recombinants detected by RDP, Chimaera, BootStan, GENECONV, MaxChi and SiScan methods implemented in RDP4 program 45 to avoid inferential biases. The sequences, which covers a temporal range of 71 years (1944-2014) from 30 countries, were aligned by MAFFT 46 . The final alignments included 1,485 nucleotides, and were edited manually with Bioedit 47 . The strain names and accession numbers are listed in Table S4. The specific sampling problem is described in detail in the supporting information Text S1.
Phylogenetic tree inference. Phylogenetic tree was inferred from the alignment using the ML approach implemented in PHYML v3.1 48 with 100 bootstrap replicates. The best fit substitution model, general-time-reversal incorporating invariant sites and a gamma distribution (GTR + I + G), was chosen by Akaike Information Criterion (AIC) using jModelTest v0.1 49 (see Text S1 for a detailed comparative analysis). The output from PHYML is mapped onto the reference tree with TreeGradients v1.03 50 , plotting the bootstrap value at the nodes in continuous rainbow color gradient from the root of the tree toward the tips. The existence of temporal information in the data set was tested by regression analysis of root-to-tip genetic distance in the ML (PhyML) tree against sampling date using TempEst v1.5 (http://tree.bio.ed.ac.uk/software/tempest/). The substitution rate remains constant through time, the distance equals the genetic distance from our collected sequences to the root and time equals the sampling time.
Scientific RepoRts | 7:45505 | DOI: 10.1038/srep45505 Natural selection. Site-specific selection pressure was assessed using five varying methods implemented in HYPHY 2.2.3 51 and available through the web-based interface Datamonkey (http://www.datamonkey.org/), including SLAC, FEL, REL, MEME and FUBAR. In these five methods, we utilized TrN model as nucleotide substitution bias model and significance levels set to P < 0.1, Bayes factor > 20 or PP > 0.9.
Glycosylation sites were detected using NetNGlyc v1.0 52 and NetOGlyc v3.1 53 methods that employ algorithms based on neural networks. Structure homology modeling of E protein was performed using the SWISS-MODEL serve (http://swissmodel.expasy.org). To visualize the locations of glycosylation sites and positive selected sites, each identified amino-acid residue was mapped onto a three-dimensional structure of E protein (Protein Data Bank code: 3J27) using PyMOL (Molecular Graphics System, version 1.7.4 Schrödinger, LLC).
Time-scaled phylogeography reconstruction. We evaluated the performance of different demographic (constant population size, exponentially growing population and Bayesian skyline plot) and molecular clock (strict and uncorrelated relaxed clock) models by comparing an improved variant of AICM values 54 . To test whether our results are robust to the coalescent prior used, we estimated the TMRCA parameter under six different coalescent prior models. Exponential growth coalescent tree prior is the best model used in the study. The rates of evolutionary (subs/site/year) and TMRCA of DENV-2 were estimated using BEAST v1.8.0 package 55,56 , which employs a Bayesian MCMC algorithms to infer population genetic histories from molecular sequences. For this analyses, we used the SRD06 codon position model 57 , to allow partitioning for codon positions (1 + 2 positions and 3 position), with the HKY85 + Γ substitution model applied to these codon divisions. The length of MCMC chain was run for 300 million steps and the log parameter values were sampled at every 30,000 steps after discarding 10% as burn-in. We also used the BEAGLE parallel computation library to enhance the speed of the likelihood calculations 50  Estimating population dynamics of Dengue virus type 2. We utilized the BSP, a method for estimating past population dynamics through time from a sample of molecular sequences, without dependence on a prespecified parametric model of demographic history, to assess the change of relative genetic diversity (Neτ ; Ne represents the effective population size and τ is generation time) of DENV-2 through time. Then we estimated evolutionary dynamics of DENV-2 isolated from 8 representative geographical regions: Nicaragua, Puerto Rico, Brazil, Caribbean, South America, Southeast Asia, Viet Nam and East&South Asia. To avoid the risk of the model over-parameterization and maximize the reliability of the results, we maintained similar total numbers of sequences for each region across the entire study period. We removed areas with sample size less than 10, and the remaining 188 sequences were used for further analysis. The estimation of the relative change in the population size of DENV-2 over time was made from GMRF Bayesian Skyride coalescent model 58 . For each regional data set, the GMRF skyride plots and the maximum clade credibility trees were reconstructed, and visualized in Tracer and FigTree respectively.
Discrete phylogeographic analyses and phylogeny-trait association. For understanding the spatial dynamics of DENV-2, phylogeographic analyses were performed to infer diffusion rates and migration patterns simultaneously through time. Three different data sets were used to represent these global epidemic periods: 1964-1990, 1991-2000, and 2001-2013. The asymmetric substitution model with the BSSVS was used to infer asymmetric diffusion rates between any pairwise location state. Meanwhile, BF test was conducted for significance of individual rates to infer the significant epidemiological links. If BF > 3, the phylogeographic link between two locations is considered to be statistically significant and BF > 1,000 is regarded as a very strong support of the degree of rates. The outputs of the Bayesian phylogeographic analyses, as generated by BEAST, were summarized using MCC tree. All demographic parameter settings were the same as in the BEAST analysis described earlier.
Further the tree nodes were annotated with their modal location states via color labeling. Then, we used SPREAD v1.06 59 to convert the estimated divergence times and the spatially-annotated time-scaled phylogeny to a spatiotemporal movement, and visualized the significant migration pathways of the virus.
As the continuous time Markov chain (CTMC) transitions between spatial locations in the phylogeny cannot be directly observed, "Markov jumps" analysis was used to count the number of location state transitions, and the total number of states counts were plotted to present the migration in and out of each location. The waiting times between transitions (Markov rewards) can be tracked on the phylogeny to infer the spatial history of the trunk lineage (described in more detail in Text S1). The program PACT (http://www.trevorbedford.com/pact) was used to perform trunk extraction and processing, which is able to estimate the mean and 95% credible interval for the proportion of the trunk for each geographic region.
To correlate viral phenotypes with phylogeny, we calculated the values of AI, PS and MC based on the posterior samples of trees produced by BEAST v1.8.0 using the Bayesian tip-association significance testing (BaTS) 60 . These statistics methods are described in more detail by Parker et al. 60 . A subset of 900 trees was selected from the combined posterior distribution and subsampled using LogCombiner v1.8.0. We used 100 state randomizations to create the null distribution to test the significance of our observed data. In the present study, P value < 0.05 was considered significant for all statistics analyses calculated by BaTS (we have a more detailed description in Text S1).