Abstract
For most pathogens, transmission is driven by interactions between the behaviours of infectious individuals, the behaviours of the wider population, the local environment, and immunity. Phylogeographic approaches are currently unable to disentangle the relative effects of these competing factors. We develop a spatiotemporally structured phylogenetic framework that addresses these limitations by considering individual transmission events, reconstructed across spatial scales. We apply it to geocoded dengue virus sequences from Thailand (N = 726 over 18 years). We find infected individuals spend 96% of their time in their home community compared to 76% for the susceptible population (mainly children) and 42% for adults. Dynamic pockets of local immunity make transmission more likely in places with high heterotypic immunity and less likely where high homotypic immunity exists. Agedependent mixing of individuals and vector distributions are not important in determining spread. This approach provides previously unknown insights into one of the most complex disease systems known and will be applicable to other pathogens.
Introduction
As with other endemic pathogens, widespread, sustained cocirculation of dengue viruses (DENV), effectively masks the dynamics of individual lineages^{1,2,3}. The cooccurrence of unrelated transmission chains means we still only have a limited understanding of how DENV spreads, including the role for human mobility of both infected individuals and the surrounding susceptible population, agespecific mixing and local heterogeneities in serotypespecific population immunity and mosquito density. These mechanistic knowledge gaps help explain our failures to control a pathogen that continues to cause 50 million annual symptomatic infections globally^{4}. The use of pathogen sequences has the potential to help. However, existing phylogeographical approaches can only provide limited mechanistic insight into drivers of spread as they focus on rates of flow between locations present in a phylogeny, based on assumptions of mass action and using traits attached to the observed sequences (i.e., cases) only^{5,6}. In addition, fewer than 1% of dengue infections will currently be sequenced from any one country^{4}. Critically, existing phylogeographic approaches do not consider that viral flow is made up of sequential transmission events with each event arising from a complex interplay of individual, population, environmental and virallevel factors. Further, the bulk of available sequences typically come from a few locations with most locations providing no data. Many existing phylogeographic approaches will infer a viral flow between observed locations without consideration that transmission events that link two observed sequences will be unobserved and often in unsampled locations.
Here we develop an inference framework that fills this knowledge gap by explicitly considering individual transmission events. By using the generation time distribution for dengue (Supplementary Fig. 1), we derive estimates of the number of generations that separate each pair of sequences in a timeresolved phylogeny and consider viral mobility for a singletransmission generation. This shift of focus to singletransmission generations, rather than overall viral flow, allows us to develop detailed mechanistic models of how viruses are moving at a tractable and interpretable scale. For example, we separately model population movement for infected individuals, the susceptible population (mainly children) as compared to adults, allowing for transmission to occur in the infector’s community, in the infectee’s community or in a tertiary location. We assume the local scale of movement of the Aedes mosquito means only human mobility can drive spread between locations^{7}. We allow for the disabling symptoms from dengue to result in reduced mobility in cases compared to the susceptible population^{8}; that transmission may occur in an agestructured manner^{3}, and that the spatial heterogeneity in vector distributions and the dynamic nature of local serotypespecific population immunity may affect where successful transmissions occur^{9}. Using the transmission probabilities for a single generation, we probabilistically integrate over all possible pathways for the total number of transmission generations that link the observed locations for each pair of sequences, thereby capturing movement in unsampled locations. In parallel, we incorporate the probability of sequencing (i.e., observing) an infection at each space–time unit, thereby explicitly incorporating space–time biases in sampling. We fit our models in a maximum likelihood framework that incorporates uncertainty from the evolutionary processes, including topical uncertainty in the phylogenetic trees, uncertainty in the generation time distribution and sampling uncertainty using a bootstrap approach.
We apply our framework to dengue in Thailand, a country that has suffered from selfsustained dengue circulation for decades^{3,10}. We use 726 sequences obtained from seven different provinces sampled over an 18year period (1995–2012) (Fig. 1A, D), from which we build timeresolved phylogenies (Fig. 1E–H and Supplementary Figs. 2–5)^{11}. In Bangkok, the home location of the cases was also geocoded (N = 467) (Fig. 1B). To inform our model, we use data from a major mobile phone operator in Thailand that empirically captures how adults move (N = 11.4 million subscribers; 26% market share), modelled estimates of the probability of Aedes aegypti occurrence (Supplementary Fig. 6) and the longterm spatiotemporal distribution of serotypes (Fig. 1C)^{3,12}. We explore viral movement over two different spatial scales: within central Bangkok (N = 337 1km^{2} grid cells throughout the centre of the city) and nationwide (N = 76 provinces with a mean area of 6700 km^{2} each).
Results and discussion
Mobility of susceptible and infected populations
Using our framework, we estimate that in Bangkok, susceptible individuals spend 76% (95% CI, 57–95%) of their time within their home cell as compared to 42% for adults (95% CI, 41–43%) (Fig. 2A and Supplementary Figs. 7–9). As dengue susceptibility is concentrated in children (Supplementary Fig. 10), our findings of reduced mobility in susceptible individuals suggest that children are less likely to travel far from their home than adults. To explore the consistency of this finding with observed differences in mobility by age, we use data from a separate study from Thailand that asked individuals (N = 2011) of all ages about their daily travel. Consistent with our findings of reduced mobility in susceptible individuals, we find that there is a strong relationship between age and reporting having stayed within 1 km of their home in the prior week (Supplementary Fig. 10). Incorporating the probability of being susceptible by age suggests that susceptible individuals are 1.5 (95% CI: 1.2–1.9) times as likely to report staying within 1 km of their home in the last 7 days, consistent with the 1.8 (95% CI: 1.3–2.2) times difference estimated by our model (Supplementary Fig. 10).
Using our model, we find that infected individuals in Bangkok are even less mobile than susceptible individuals with 96% (95% CI: 87–100%) of infected individuals’ time being within their home cell (Fig. 2A). Importantly, these estimates of differential mobility hold for the intervening unseen transmission events, as well as the observed cases in the phylogeny. This shows that on average, infected individuals are more likely to stay in and around their home. This suggests that some subclinical DENV infections may still result in severe enough symptoms to change the daily routine and limit mobility. Further observational studies are needed to understand how movement changes across the spectrum of disease severity^{13}. We also observed similar differences in mobility patterns at the national scale, with cases spending 96% of the time within their home province (95% CI: 86–100%), compared to 95% for the susceptible population (95% CI: 89–100%) and 87% for adults (95% CI: 86–88%) (Fig. 2B).
Role of local immunity, vector and age
Local serotypespecific immunity also appears important, with transmission being more likely to occur in places that have seen increases of other (heterotypic) serotypes circulating in the previous two years and less likely to occur in places with increased cases of the same (homotypic) serotype within the same timeframe (Fig. 2D, E and Supplementary Figs. 8 and 9). However, the overall incidence of reported cases in the home cell or province of the susceptible population is not associated with differences in transmission risk, highlighting the complex relationship between observed case incidence and underlying infection risk. More direct measures of local immunity through populationrepresentative seroprevalence studies may provide a more nuanced picture of the role of immunity in patterns of spread^{14,15}.
We find that the probability of Aedes aegypti presence is not linked to transmission risk (Fig. 2F and Supplementary Figs. 8 and 9), although, for Bangkok, this may be driven by limited heterogeneity in estimated presence across the city (Supplementary Fig. 6). This does not rule out a role for the vector in characterising heterogeneity in risk. In particular, the relationship between modelled probabilities of occurrence that we have used and vector density remains unclear. After accounting for agespecific patterns of immunity, we find no evidence of agedependent transmission, with 0.12 (95% CI: 0.09–0.42) of sequential infections in a transmission chain occurring between individuals of <2 y in age difference, which is not statistically different to models with no age structure in transmission, suggesting that the intermediary vector removes the effect of assortative mixing (Fig. 2G and Supplementary Fig. 11).
Characterising model fit
In order to assess the performance of our model, we repeatedly refit the model on data where we remove all sequences from a subset of locations (heldout locations). We find our model is able to accurately estimate the probability of observing viruses in the heldout locations, both within Bangkok and at the nationwide scale (correlation between observed and estimated locations of held out sequences of 0.94 in Bangkok and 0.94 nationwide) (Fig. 2C). This demonstrates our framework can characterise viral movement in unobserved locations and highlights how nonrepresentative sequencing approaches can still provide accurate descriptions of overall virus mobility. However, sampling biases do need to be explicitly incorporated as assuming unbiased observation results in very different parameter estimates, including falsely high estimates of betweenlocation population movement (Supplementary Fig. 12). Using a simulation framework, we show we are able to accurately recover known parameter values, even under biased observation (Supplementary Fig. 13).
Simulating transmission across spatial scales
We use our fitted model to characterise the movement of the virus at each transmission generation. We use a simulation approach that introduces viruses into randomly selected provinces and use the fitted mobility matrices to see where transmission occurs over 20 transmission generations. Averaging over repeated simulations, we find that the virus is 4.3 (95% CI: 2.4–7.0) times as likely to have travelled to Bangkok after a singletransmission generation as compared to a randomly selected province. After 20 transmission generations (equivalent to ~1 year of sequential transmissions), we find the virus is 11.4 (95% CI: 6.3–19.4) times as likely to have infected at least one individual in the capital as compared to at least one individual living in a randomly selected province (Fig. 3A and Supplementary Fig. 14). The flow to larger cities is not restricted to Bangkok. with the likeliest destinations after 20 generations also being where the largest population centres are located (Fig. 3A). Substantial heterogeneity is also observed at the local scale, with the virus tending to go to the hyperurban city centre in Bangkok (Fig. 3B). Overall, within Bangkok, we find that 34% of infections occur outside the 1km^{2} home grid cell of an infected individual (95% CI: 26–43%). This is despite infected individuals spending only an average of 4% of their time outside their home cells, highlighting the importance of considering mobility in both infected and susceptible populations when considering viral spread (Fig. 3C). After 20 generations, only 2.6% of viruses are still within the same Bangkok cell and 34% are within the same province (Fig. 3D).
While most transmissions occur within the home cell of an infected individual, we find that when transmissions do occur further away, local heterogeneity in patterns of serotypespecific population immunity means that the pathway taken by viruses depends on the serotype (Supplementary Fig. 15). Within Bangkok, on average, 85% (95% CI: 81–89%) of the likeliest location after a singletransmission event was the same across serotypes dropping to only 44% (95% CI: 38–53%) overlap after 20 generations. These effects are not observed at a larger scale, where the likeliest destination province remained largely the same across serotypes.
Using this same simulation approach, we explore how far a virus will have spread a year following the introduction in a randomly selected location. We find that the virus will have infected individuals from, on average, 27% of all provinces (95% CI 15–38) and in 32% of cells within central Bangkok (95% CI: 25–43) (Supplementary Fig. 16). We find that local immunity and the reduced mobility of cases compared to the susceptible population has minimal effect on the number of locations affected; however, if the mobility of susceptible individuals matched that of the adult population, there would be 1.9 times as many infected provinces (95% CI: 1.4–3.3), with a similar effect at the local Bangkok scale (RR: 1.5, 95% CI: 1.1–1.9) (Fig. 3E, F). For arboviruses, such as Zika and chikungunya viruses, where limited immunity means most infections are in adults, we could therefore assume a more rapid dispersal of the virus compared to DENV^{16}. We observe consistent patterns across different effective reproductive numbers and for overdispersed transmission (Supplementary Figs. 16 and 17).
Summary
By explicitly characterising the mechanisms of individual transmission generations and integrating the mobility of populations, our framework brings inference to a tractable scale and allows unbiased inferences to be made despite minimal and heavily biased sequence availability. Individual transmission generations are also those most relevant for targeted interventions and can help predict future flows. While we have used this framework for DENV, it is applicable to other communicable pathogens where there exists a timeresolved phylogeny, the generation time distribution is known and is relatively short (days or weeks) and there exists spatial information or other discrete traits.
Methods
Data sources
Sequence data and associated metadata
We use all available full genome sequence data from all four serotypes from Thailand covering the period 1994 and 2012 where provincelevel spatial information is available. All sequences are available from GenBank. The accession numbers are set out in Supplementary Data 1. For sequences from Bangkok, we also have household coordinates for a subset of 432 sequences. In Bangkok, the sequences come from Queen Sirikit National Institute of Child Health, a large children’s tertiary care hospital based in the centre of the city. Outside Bangkok, most sequences come from the national surveillance system of dengue run by the Ministry of Public Health. They perform confirmatory testing and viral isolation and sequencing of samples from sentinel hospitals based around the country. The hospitals are based in the following five provinces: Lampang, Ratchaburi, Songkhla, Pathum Thani, Nakhon Ratchasima. In addition, there are sequences available on GenBank from Kamphaeng Phet province.
Case data
All cases of dengue are notifiable and are reported to the Thai Ministry of Public Health. We extract the number of cases per year for each year and each of the 76 provinces in the country between 1994 and 2012. To estimate serotypespecific case data in each year, we used the serotype distributions of geocoded cases from QSNICH for Bangkok (N = 11,583 cases). For the rest of the country, we used serotypespecific case data from the Ministry of Public Health. Outside the five sentinel surveillance sites, these data mainly come from ad hoc samples sent to the ministry for testing. Altogether, this represents a serotypespecific database of 27,586 cases covering 67 of the 76 provinces. For each province and year, we calculated the proportion of cases that were caused by each serotype. Where there were no samples from that provinceyear, we used data from the closest province in that same time period where data were available.
Population data
In Bangkok, we initially placed a 1 × 1km grid cell over the central part of the city (337 grid cells) and estimated the population size using populationsize estimates from LandScan for 2010^{17}. We also identified the grid cell for each of the Bangkok sequences. At the province level, we used populationsize estimates from the 2010 national census. We note that the Thai population has been relatively stable over the study period (rising from 60 million to 64 million between 1995 and 2012).
Call detail records (CDR)
To estimate adult mobility across provinces and within Bangkok, we used the call detail records (CDR) of over 11 million mobile phone subscribers between August 1, 2017 and October 19, 2017 from the thirdlargest mobile phone operator in Thailand (N = 11.4 million subscribers, 26% market share). These data are described in more detail elsewhere^{18}. Briefly, each subscriber was assigned a daily home location based on their most frequently used cellular tower. Travel between locations was estimated by tabulating the subscriber’s home location on 1 day relative to the day before. The locationtolocation transition probability matrix was estimated by using the average travel from location i to location j (weighted by the population at location i), and normalising travel such that the sum of travel from location i is equal to 1. We note that the mobile phone data were collected after our study period (2017 vs 1995–2012). Human mobility may have changed and could help explain some of the differences in mobility between the fitted models and that implied from the mobile phone data.
Aedes aegypti abundance estimates
We used previously published estimates of the probability of Aedes aegypti presence for 5 × 5km grid cells around the globe^{19}. These estimates were generated by incorporating information on temperature, rainfall, vegetation indices from satellite imagery and fitting models to a large dataset of Aedes occurrence records. The fitted models were then used to predict elsewhere. For Bangkok, we extracted the Aedes aegypti estimate using the centroid of each grid cell. For the province level, for each province, we used the simple average across all the raster cells from the Aedes map that were contained within that province.
Generation time distribution for dengue
To estimate the generation time distribution for dengue, we combined data on the incubation period, extrinsic incubation period and the lifespan of the Aedes aegypti mosquito as has previously been used for chikungunya^{20}.
Human incubation period (HI)
We used a truncated lognormal distribution with a mean of 5.6 days and a standard deviation of 1.41 days and a maximum time of two weeks^{21}. The stated mean and standard deviation are the values prior to truncation.
Humantomosquito transmission (HM)
Based on the estimated durations of viremia, we used a truncated exponential distribution with a mean of 4.5 days and a maximum period of 7 days^{22}.
Mosquito infectiousness (MI)
The period of mosquito infectiousness depends on the mosquito lifespan and the extrinsic incubation period. The average daily probability of survival for Aedes aegypti, has been estimated at 0.87 for up to 30 days^{23}, equivalent to a mean lifespan of 7.2 days. The extrinsic incubation period has been estimated at 6.1 days^{24}. To calculate the period of mosquito infectiousness, we initially draw the mosquito lifespan (MLS) using a truncated exponential distribution with a parameter of 7.2 days and a maximum value of 30 days. Next, we draw the age at which the mosquito gets infected (MAI) from a uniform distribution between 0 and the lifespan of the mosquito. Next, we draw the extrinsic incubation period (EIP) as a random exponential distribution with mean of 6.1 days. The total period of mosquito infectiousness (MI) is then equal to MLS–MAI–EIP. Values of MI <0 were considered unsuccessful onward infections.
Generation time distribution
We derived the empirical distribution of the generation time by simulating 10,000 values for HI, HM and MI and summing them. Individuals who are viremic for longer are more likely to infect mosquitoes. Similarly, mosquitoes that are infectious for longer are also more likely to infect more individuals. We, therefore, weighted the probability of each generation time by the length of HM multiplied by the length of MI. We obtained a mean generation time of 18.2 days and a standard deviation of 6.1 days, which we approximated using a gamma distribution with the same mean and standard deviation (Supplementary Fig. 1).
Timeresolved phylogenetic trees
Taking each serotype in turn, we aligned the full genome sequences using the Muscle algorithm in MEGA^{25}. We built Bayesian timeresolved phylogenetic trees using BEAST 2.5.0^{11}. We used a strict clock, a General Time Reversible nucleotide substitution model, as determined by jModelTest2^{26}, and a Bayesian skyline prior. Similar coalescence times were found using a relaxed clock.
Probabilistic model
Overall inferential strategy
We use a likelihoodbased approach to model the probability of the observed location of pairs of sequences in a timeresolved phylogeny. We initially use the timeresolved phylogenies and information on the generation time distribution to estimate the number of generations that separates each member of a pair of sequences in the phylogeny from their Most Recent Common Ancestor (MRCA). This then allows us to consider singletransmission generations rather than overall viral flow. We develop models of viral movement between each location in our study area (whether it was sampled or not) for each transmission step. These viral movement models incorporate estimates of human mobility with the potential for differences in the movement for cases compared to the susceptible population, as well as incorporating effects of timevarying serotypespecific immunity and vector distributions. We allow for infection to occur at the infector’s home location (which requires the infectee travelling to the infector’s home location), the infectee’s home location (which requires the infector to travel to the infectee’s home location) or in a tertiary location (where both parties would have to travel there).
We use the viral movement matrix for a single transmission to calculate the viral movement after G generations via matrix multiplication. This approach integrates over all possible pathways that link two locations. To specifically incorporate observation processes, we consider the probability of sequencing an infection at each space–time unit.
To inform this model, we use an integrative approach that brings in detailed data from mobile phone operators that capture how people move and interact with each other, maps that estimate how populations are distributed, maps on vector suitability and the longterm spatiotemporal distribution of serotypes.
Notation
For a pair of cases, C_{A} and C_{B}, in a phylogenetic tree: C_{A} has home location L_{A}, was sick at time T_{A} and has sequence Seq_{A}; case C_{B} has home location L_{B}, was sick at time T_{B} and has sequence Seq_{B}. The time of the MRCA between C_{A} and C_{B} is T_{m} and the location of the MRCA is L_{m}. We denote G_{A} and G_{B} the number of transmission generations that separate C_{A} and C_{B} from their MRCA. Obs_{LiTA} is 1 if a sequence was observed at location Li at time T_{A} and 0 otherwise. Obs_{LiTB} is defined in a similar manner for time T_{B}.
The singletransmission generation matrix
Initially let us consider a singletransmission generation. The probability that two individuals, i and j, are in the same location and are in contact (via a mosquito) given i lives in location a and j lives in location b can be written down as
where P(V_{i} = kL_{i} = a) is the probability of individual i, whose home location is in a, visiting location k and P(V_{j} = kL_{j} = b) is the probability that individual j also visits location k and β_{k} is the locationspecific probability of transmission.
At time τ, one infector i that lives in location a is expected to transmit to the following number of persons living in location b:
where S_{b,τ,ser} is the number of susceptible people to serotype ser living in location b at time τ. The total expected number of persons infected by the infector is i:
Conditional on transmission occurring, the probability that the infectee has a home location in cell b is the ratio of these terms:
We can create a N × N transmission matrix, ∏_{τ,ser,gen = 1}, where N is the total number of locations, that sets out the transmission probabilities between all pairs of locations at a point in time for a single transmission. The element [a,b] of the matrix is π_{a,b,τ,ser}.
Characterising human mobility
We use mobile phone data to characterise human mobility. Initially, we extract a matrix from the CDR data that set out the probability that an individual that lives in location a visits location k.
CDRs come from adults, whereas dengue is concentrated in children, who are potentially more likely to spend more of their time at home. In addition, sick individuals may travel differently than healthy individuals and spend more time at home. In this way, the mobility of infectors may differ from susceptible information.
To allow for different periods at home for susceptible individuals compared to those in the CDR data and to assess whether there exists differential mobility by illness status, we incorporate separate parameters for the probability of being at home for infectors and the susceptible population (θ_{infector.home}, θ_{population.home}) to reflect the additional time at home compared to that extracted from the CDR data.
For the home cells of infected individuals:
For the home cells of the rest of the population:
In each case, for nonhome cells, we rescale the probabilities so that the sum of all movements remains equal to 1.
where CDR[a,k] reflects the movement probabilities from the CDR data.
As the sum of movements to the destinations in the matrix is equal to 1, we are assuming that the spatial unit of analysis contains all possible mobility (of the virus and people). It has previously been shown that the dengue epidemic in Thailand is selfsustaining with few external introductions^{3}. Applications of this approach to small spatial units should consider that some mobility may be missed.
Dengue transmission is more likely to occur during daylight hours, due to the feeding behaviour of Aedes mosquitoes. Therefore, it would be optimal to use CDR data from daylight hours only. However, as is often the case, our CDR data represent an aggregate from all hours of the day. Nevertheless, as the majority of cell phone calls are made during daylight hours (it has been estimated that 67% of calls are between 8 am and 7 pm)^{27,28}—it is reasonable to assume that this estimate is largely representative of daytime mobility.
Factors affecting transmission
As there may be factors that allow for different probabilities of transmission across locations, we allowed for differential probability of infection by location based on the mosquito presence in that location
where parameter γ_{1} is to be estimated, mosq represents the estimated suitability for Aedes aegypti mosquitoes in location k. D_{1} is a proportionality constant (which gets cancelled out).
Factors affecting susceptibility
The number of susceptible individuals living in a location will depend on the level of historic infection in that location, in a potentially timevarying serotypespecific manner.
where het_{k,τ,ser} is the incidence of cases caused by different serotypes in the two prior years in location k, homo_{k,τ,ser} is the incidence of cases caused by the same serotype in the two prior years in location k, incidence_{k} is the incidence of all cases over the study period in location k. We choose a window of 2 years to define recent immunity as serotypespecific incidence has previously been shown to be spatially correlated over this time range, presumably due to serotypespecific local herd immunity^{9}. For the Bangkok analyses, we use the serotypespecific geolocated case data. For the nationwide analysis, we use the national reporting system from the Ministry of Public Health (MOPH). The national MOPH system is not serotypespecific; however, a small number of cases from all around the country are serotyped each year by the national reference centre in Bangkok. To obtain serotypespecific incidence estimates for each year, we multiplied the proportion of cases that came from each serotype within each provinceyear by the overall number of cases for each provinceyear. Where there were no serotyped cases for a provinceyear, we used the closest province where there were serotyped cases.
Probability of virus being within each location after G transmission generations
To calculate the probability of the home location being within location k after G transmission generations, we can use matrix multiplication that integrates over all possible pathways connecting two locations
where t_{l} is the time of generation G_{l}.
Probability of observing a pair of cases in two specific locations
Conditional on sequences being observed in location L_{A} at time T_{A} and L_{B} at time T_{B}, the probability that C_{A} has home location L_{A} and C_{B} has home location L_{B} can be written down as
We can consider that the location of the two cases is dependent on the location of their MRCA and the number of transmission generations that separate them from the MRCA.
We consider that the observation processes across locations are independent of each other. In addition, each transmission event is considered independent of other transmission events. The probability of observing a case at location L_{i} at time T_{A} does not depend on the location of the MRCA or the number of generations separating the case from the MRCA. We can also substitute in Eq. (12) into Eq. (11). Finally, we consider discretized space—either 337 1 × 1km grid cells throughout central Bangkok or the 76 provinces of Thailand.
Equation (11) therefore becomes
Probability of G generations between the MRCA and a case
We can extract the joint probability that case C_{A} is separated from the MRCA by G_{A} transmission generations and case C_{B} is separated from the same MRCA by G_{B} transmission generations using the generation time distribution, for dengue and the timeresolved phylogenetic tree.
If we assume that the generation time distribution is gamma distributed with parameters a_{G} and β_{G} and that all transmission events are independent of each other, the sum of g gamma distribution is also gamma distributed with parameters gα_{G} and β_{G}. In addition, from a genealogy, R_{i}, we can extract the evolutionary time, E_{A}, separating C_{A} from the MRCA and E_{B}, separating C_{B} from the MRCA. We can therefore estimate the probability of g transmission events over many trees as follows:
This approach allows us to incorporate uncertainty in the phylogeny, including uncertainty in the evolutionary parameters and tree structure.
As any spatial signal will be heavily diluted after many transmission generations, to optimise computational performance, we restrict our analyses to pairs where the mean estimated number of transmission generation is <25, we perform a sensitivity analysis where this is extended to 40 generations with very similar results (Supplementary Fig. 3).
Observation probability (P(Obs_{Li,TA}))
We cannot know the true number of infections occurring within each space–time unit. Given the longterm endemicity of dengue in the region, we assume that the number of infections will be approximately proportional to the size of the population within each location. Therefore, the probability of observation (the probability of sequencing the virus causing an infection event) at location k at time point t is approximately proportional to the number of sequenced viruses from that year and location for that serotype divided by the size of the population in that location.
We conducted a subanalysis where we assumed unbiased observation. In this analysis, we assumed that the probability of observation was 1 across all space–time locations, we obtained very different results (Supplementary Fig. 6).
We further assessed the performance of this approach using a simulation model that imposed a heavily biased observation process. Our inference framework was able to correctly identify all parameters (Supplementary Fig. 7, see below for simulation model details).
The location of the MRCA (P(L _{m}))
The probability of the MRCA for each pair of cases will depend on the longterm history of dengue in the communities, which cannot be estimated using the presented approach. Instead, we assume that P(L_{m}) is proportional to the size of the population in that location. To assess the sensitivity of this assumption, we conducted a separate analysis where P(L_{m}) was assumed to be the same across all locations, with identical results (Supplementary Fig. 4). This suggests that we do not need to probabilistically assess where the start point is for the MRCA that links two cases in a phylogeny.
Likelihood
We can calculate the likelihood using all pairs of available sequenced viruses as follows:
where n_{ser} are the number of sequences available from serotype ser.
Identifying the maximum likelihood estimate
We use a maximum likelihood approach to estimate the parameters linked to the mobility, transmission and susceptibility (θ_{home.sick}, θ_{home}._{population}, β_{1}, β_{2}, β_{3}, γ_{1}). We identify the maximum likelihood estimate using an unconstrained nonlinear quasiNewton optimisation approach^{29}.
In order to incorporate uncertainty, we use a bootstrapping approach where we randomly sample all the available sequences with replacement over 100 iterations and recalculate the maximum likelihood estimate for each parameter each time. The 95% confidence intervals are then calculated using the mean and the standard deviation of the resulting distribution, assuming that they follow a normal distribution.
Using fitted values to estimate patterns of viral flow at each transmission generation
Once we have fitted values for the parameters, we can calculate the \({\Pi}_{\tau ,{\mathrm{ser}},{\mathrm{gen}}}\) matrix for each month between 1994 and 2012, each serotype and each transmission generation. From these matrices, we can extract the probability that the virus is in each location given a specified serotype, location and time of introduction and number of generations. From these matrices, we calculate the cumulative distribution function of the distance between where the virus started and where it is after different numbers of generations, averaging over time and serotype. We compare this to the cumulative distribution function of how far cases are from their home at any time. This highlights that viral mobility requires both movement of cases and the susceptible population.
We also calculate the mean proportion of times that the most likely (nonhome) destination is the same across serotypes. This allows us to assess whether viruses across the serotypes take the same routes or whether serotypespecific immunity changes the most likely pathways.
Model fit
In order to assess the model fit, we perform heldout validation. In Bangkok, we remove all sequences from 10% randomly chosen locations, we then refit the model and then estimate the probability of observing sequences in the locations not included in the model fitting process. For the nationwide analysis, as we have fewer locations with sequences available, we undertake the same process but hold out a single province in turn.
For incremental windows of probability between 0 and 1, we identify all locationyears where a virus was predicted to have been observed within the heldout locations. We then calculate the mean proportion of times a sequence was observed within those identified locationyears.
Estimates of spread using a transmission simulation
Using the fitted parameter values, we conduct a forward simulation at both the Bangkok and province levels. Taking each month between 1994 and 2012 and each serotype in turn, we apply the following algorithm:

(I)
Randomly introduce a single infection in one location where all locations have the same probability of being the source.

(II)
We generate daughter infections from the index using a random draw from a Poisson distribution with mean R_{eff} (representing the effective reproductive number).

(III)
We identify the location for each daughter infection using a random draw where the probability of each location is taken from the Π_{τ,ser,gen} matrix.

(IV)
Repeat (ii) and (iii) for 20 generations.

(V)
Repeat (i)–(iv) 50 times
For each iteration, we calculate the average number of locations that have had at least one infection at each generation, average over all time points and all serotypes.
To assess the impact of mobility patterns and immunity on the number of locations affected, we repeated the analysis with the following adjustments:
Scenario a: All initial introductions were in the most connected location only (as defined as the location with the lowest probability of staying within your home location).
Scenario b: All initial introductions were in the least connected location only (as defined as the location with the highest probability of staying within your home location).
Scenario c: No difference in the mobility of the cases as compared to the susceptible population. This was achieved by forcing the \(\theta _{{\mathrm{infector.}}\,{\mathrm{home}}}\) parameter to be the same as the fitted \(\theta _{{\mathrm{population.}}\,{\mathrm{home}}}\) parameter.
Scenario d: Susceptible population mobility is equal to that of the adult population. This was achieved by forcing the \(\theta _{{\mathrm{population.}}\,{\mathrm{home}}}\) and \(\theta _{{\mathrm{infector.}}\,{\mathrm{home}}}\) parameters to be zero.
Scenario e: No impact of immunity. This was achieved by forcing the β_{1} and β_{2} parameters to be zero.
For each scenario, we calculated the proportion of locations affected at each generation and the relative number of locations affected compared to the base model. We also conducted sensitivity analyses where the R_{eff} was varied from 1.3 to 1.1 and 1.6.
Agemixing model
We use an equivalent approach to characterise the agedependent mixing of the population. Instead of considering the probability of the virus transitioning between two locations, we consider the probability of transitioning between individuals of two ages.
We can consider that the age of the two cases is dependent on the age of the MRCA and the number of transmission generations that separate them from the MRCA.
We consider that the observation processes across ages are independent of each other. In addition, each transmission event is considered independent of other transmission events. The probability of observing a case at age Age_{i} at time T_{A} does not depend on the location of the MRCA or the number of generations separating the case from the MRCA. We can also substitute Eq. (12) into Eq. (11). Finally, we consider the discretized ages.
The age transition matrix
Initially let us again consider a singletransmission generation. We assume that the agespecific susceptibility of the population is stable over time and that the probability of exposure does not differ by age group or serotype. The probability that two individuals, i and j, of ages Agei and Agej are in contact (via a mosquito) can be written down as
where β_{a,b} is the agespecific probability of contact between individuals of ages a and b.
The expected number of infected persons coming from individuals of age b conditional on an infector, i, being of age a is
where S_{b} is the number of susceptible people of age b.
The expected number of infected persons coming from individuals of all ages, conditional on an infector, i, being of age a is
Conditional on one transmission generation where the infector i is of age a, the probability of the infectee with age b is, therefore, the ratio of these terms. We define this probability as ϕ_{a,b}.
We can create an N_{age} × N_{age} transmission matrix, Φ, that sets out the transmission probabilities between all ages for a single transmission where element [a,b] of the matrix is ϕ_{a,b}. We use a maximum age of 70 years.
Age contact matrix
To parametrically characterise the age contact matrix, we use a discretised exponential decay parameter, θ_{age}, that captures the probability that two people interact, as a function of the difference in their ages, such that \(\beta _{a,b} = f( {a  b;\theta _{{\mathrm{age}}}} )\).
Agespecific susceptibility
To characterise the susceptibility, we assume that the number of susceptible people of age a is equal to \(N_{{\mathrm{age}}_a} \cdot {\mathrm{exp}}(  \lambda \cdot {\mathrm{age}}_a)\), where \(N_{{\mathrm{age}}_a}\) is the number of people of age a in the national census and the force of infection, λ, is assumed to be 0.04^{30}. We conduct a sensitivity analysis where the force of infection is varied to 0.02 and 0.06 with unchanged results (Supplementary Fig. 5).
Observation probability (P(Obs _{Agei,TA}))
We assume that the probability of observing a case of age a is proportional to the number of sequenced viruses of that age for that serotype divided by the estimated size of the susceptible population of that age (S_{a}).
Likelihood for the age model
We can calculate the likelihood using all pairs of available sequenced viruses as follows:
where n_{ser} are the number of sequences available from serotype ser.
We use a maximum likelihood approach to estimate the parameter θ_{age}. In order to incorporate uncertainty, we use a bootstrapping approach where we randomly sample all the available sequences with replacement over 100 iterations and recalculate the maximum likelihood estimate for each parameter each time. The 95% confidence intervals are then calculated using the mean and the standard deviation of the resulting distribution, assuming that they follow a normal distribution.
Once we have fitted the θ_{age} parameter, we calculate from the matrix of ϕ_{a,b}, the proportion of transmissions that are between individuals that have <2 y in age between them. We use an equal weight for the age of the infector across all ages and only consider individuals between the ages of 1 and 15 as they represent the majority of the susceptible population. We compare this to the scenario where all individuals have the same probability of contact, irrespective of age (i.e., β_{a,b} = 1/70, for all a and all b), which is the minimum possible value.
Simulation study
In order to ensure that our model is able to correctly identify parameters, we built a simulation framework with known parameters using 50 randomly selected grid cells from the Bangkok dataset where the population size, the historic incidence and the probability of between cell movement were taken from the observed data. As the observed heterogeneity in mosquito presence was limited in Bangkok (Supplementary Fig. 2), we simulated mosquito presence in each location using a Uniform distribution between 0 and 1. For the recent heterotypic and homotypic cases, we used a randomly selected time point from the observed distributions of cases and assume all cases came from serotype DENV1.
We fixed the parameter values as follows:

Additional time being at home for a susceptible population compared to adults (\(\theta _{{\mathrm{susceptible.}}\,{\mathrm{home}}}\)) (logit scale): −0.5

Additional time being at home for cases compared to adults (\(\theta _{{\mathrm{infector.}}\,{\mathrm{home}}}\)) (logit scale): −0.05

Mosquito exponent: 1.0

All incidence exponents: 0.5

Recent heterotypic incidence exponent: 0.3

Recent homotypic incidence exponent: −0.3
We then calculated the \({\Pi}_{\tau ,{\mathrm{ser}},{\mathrm{gen}}}\) transmission matrix using these known parameters and simulated transmission events using the following algorithm:

1.
Randomly select a starting location (H_{0}) by randomly choosing a location, weighted by the population in that location. This will represent the MRCA case (C_{0}) between the two observed viruses.

2.
Draw the number of generations (g) between the MRCA and one of the observed isolates, where the number of generations is between 15 and 19 generations and the probability of 15 generations is 0.1, 16 generations is 0.2, 17 generations is 0.4, 18 generations is 0.2, 19 generations is 0.1.

3.
For case C_{0} identify where they will transmit to (H_{1}) using a random draw with the probabilities of each destination location coming from the H_{0} row of the \({\Pi}_{\tau ,{\mathrm{ser}},{\mathrm{gen}}}\) matrix.

4.
Repeat step (3) g times using the destination of the previous step as the start location each time

5.
Repeat steps 1–4 2000 times to generate 2000 pairs of cases

6.
We assumed that the probability of observing (i.e., sequencing) the virus from a case was unequal across locations. The probability of observing a case at a location (ρ_{l}) is taken from a random uniform distribution (U(0,1)). We randomly select 500 pairs of cases where the probability of observation of each pair is the product of the probability of observation at each of the two locations.
Using the observed pairs, we then used our framework to estimate the parameters of the model. We repeated the simulation 50 times and report the mean and 2.5 and 97.5 percentiles of the distribution for each parameter estimate.
To assess the importance of incorporating sampling bias in our estimates, we repeated the inference on our simulated data but assumed that all space–time locations had the same, equal chance of being observed.
SMILI data and analysis
In order to understand the consistency of our estimated differences in human mobility between susceptible individuals and adults, we used data from the Social Mixing for InfluenzaLike Illness (SMILI) project. This project asked 2011 individuals about their mobility patterns. Here we used the responses to the question ‘what is the farthest distance you have travelled within the last 7 days?’. We dichotomised the results into those that had not travelled >1 km and those that had travelled >1 km.
To reconstruct the probability that a susceptible individual had not travelled more than 1 km within the last 7 days, we used data on the population size from the 2010 census and assumed a constant force of infection (foi) of 0.04 per year. Using a catalytic model, we can calculate the probability that an individual of age a has never been infected by dengue as pnaive_{a} = exp(−4*a*foi). The probability of being monotypically immune (i.e., being infected by one of the serotypes but still susceptible to one of the other ones) is pmono_{a} = exp(−3*foi*a)*(1 − exp(−foi*a)). The probability of being susceptible as the sum of psus_{a} = pnaive_{a} + pmono_{a} (Supplementary Fig. 3B). To calculate the average probability that a susceptible individual has not travelled further than 1 km from their home within the last 7 days, we take a weighted average across all ages:
where pTravel_{a} is the proportion of individuals of age a that have not travelled more than 1 km within the last 7 days and pPop_{a} is the proportion of the population that is of age a.
To calculate the average probability that an adult has not travelled further than 1 km from their home within the last 7 days, we use a similar approach where we calculate a weighted average across all individuals that are over 15 years of age (the mobility data are available in 5year increments).
Ethical approval
This study was approved by the ethical review boards of Queen Sirikit National Institute of Child Health, and Walter Reed Army Institute of Research and the University of Florida. Case data were obtained from the results of standard confirmatory testing for dengue and therefore did not require informed consent.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
All data used in the analyses are available on Zenodo (https://doi.org/10.5281/zenodo.4543279). In addition, GenBank references for the sequences are available in Supplementary Data 1. LandScan data are available from https://landscan.ornl.gov/landscandatasets.
Code availability
R code used for the analyses is available on Zenodo (https://doi.org/10.5281/zenodo.4543279).
References
 1.
Halstead, S. B. Dengue. Lancet 370, 1644–1652 (2007).
 2.
Van Panhuis, W. G. et al. Regionwide synchrony and traveling waves of dengue across eight countries in Southeast Asia. Proc. Natl Acad. Sci. USA 112, 13069–13074 (2015).
 3.
Salje, H. et al. Dengue diversity across spatial and temporal scales: local structure and the effect of host population size. Science 355, 1302–1306 (2017).
 4.
Shepard, D. S., Undurraga, E. A., Halasa, Y. A. & Stanaway, J. D. The global economic burden of dengue: a systematic analysis. Lancet Infect. Dis. 16, 935–941 (2016).
 5.
Dudas, G. et al. Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544, 309–315 (2017).
 6.
Faria, N. R. et al. Genomic and epidemiological monitoring of yellow fever virus transmission potential. Science 361, 894–899 (2018).
 7.
Harrington, L. C. et al. Dispersal of the dengue vector Aedes aegypti within and between rural communities. Am. J. Trop. Med. Hyg. 72, 209–220 (2005).
 8.
Kitron, U., Elder, J. P., Barker, C. M. & Perkins, T. A. Dengue illness impacts daily human mobility patterns in Iquitos, Peru. PLoS Negl. Trop. Dis. 13, e0007756 (2019).
 9.
Salje, H. et al. Revealing the microscale spatial signature of dengue transmission and immunity in an urban population. Proc. Natl Acad. Sci. USA 109, 9535–9538 (2012).
 10.
Nisalak, A. et al. Forty years of dengue surveillance at a Tertiary Pediatric Hospital in Bangkok, Thailand, 1973–2012. Am. J. Trop. Med. Hyg. 94, 1342–1347 (2016).
 11.
Bouckaert, R. et al. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 10, e1003537 (2014).
 12.
Kraemer, M. U. G. et al. The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus. eLife 4, e08347 (2015).
 13.
Perkins, T. A. et al. Calling in sick: impacts of fever on intraurban human mobility. Proc. Biol. Sci. 283, 20160390 (2016).
 14.
Salje, H. et al. Nationallyrepresentative serostudy of dengue in Bangladesh allows generalizable disease burden estimates. eLife 8, e42869 (2019).
 15.
Metcalf, C. J. E. et al. Use of serological surveys to generate key insights into the changing global landscape of infectious disease. Lancet 388, 728–730 (2016).
 16.
Ruchusatsawat, K. et al. Longterm circulation of Zika virus in Thailand: an observational study. Lancet Infect. Dis. https://doi.org/10.1016/S14733099(18)307187 (2019).
 17.
Dobson, J. E., Bright, E. A., Coleman, P. R., Durfee, R. C. & Worley, B. A. LandScan: a global population database for estimating populations at risk. Photogramm. Eng. Remote Sens. 66, 849–857 (2000).
 18.
Kiang, M. V. et al. Incorporating human mobility data improves forecasts of dengue fever in Thailand. Sci. Rep. 11, 923 (2021).
 19.
Kraemer, M. U. G. et al. Past and future spread of the arbovirus vectors Aedes aegypti and Aedes albopictus. Nat. Microbiol. 4, 854–863 (2019).
 20.
Salje, H. et al. How social structures, space, and behaviors shape the spread of infectious diseases using chikungunya as a case study. Proc. Natl Acad. Sci. USA 113, 13420–13425 (2016).
 21.
Rudolph, K. E., Lessler, J., Moloney, R. M., Kmush, B. & Cummings, D. A. T. Incubation periods of mosquitoborne viral infections: a systematic review. Am. J. Trop. Med. Hyg. 90, 882–891 (2014).
 22.
Vaughn, D. W. et al. Dengue viremia titer, antibody response pattern, and virus serotype correlate with disease severity. J. Infect. Dis. 181, 2–9 (2000).
 23.
Clements, A. N. & Paterson, G. D. The analysis of mortality and survival rates in wild populations of mosquitoes. J. Appl. Ecol. 18, 373–399 (1981).
 24.
Chan, M. & Johansson, M. A. The incubation periods of dengue viruses. PLoS ONE 7, e50972 (2012).
 25.
Sohpal, V. K., Dey, A. & Singh, A. MEGA biocentric software for sequence and phylogenetic analysis: a review. Int. J. Bioinform. Res. Appl. 6, 230–240 (2010).
 26.
Darriba, D., Taboada, G. L., Doallo, R. & Posada, D. jModelTest 2: more models, new heuristics and parallel computing. Nat. Methods 9, 772 (2012).
 27.
Aledavood, T. et al. Daily rhythms in mobile telephone communication. PLoS ONE 10, e0138098 (2015).
 28.
Liu, Z. et al. Mapping hourly dynamics of urban population using trajectories reconstructed from mobile phone records. Trans. GIS 22, 494–513 (2018).
 29.
Henningsen, A. & Toomet, O. maxLik: a package for maximum likelihood estimation in R. Comput. Stat. 22, 443–458 (2011).
 30.
RodríguezBarraquer, I. et al. Revisiting Rayong: shifting seroprofiles of dengue in Thailand and their implications for transmission and control. Am. J. Epidemiol. 179, 353–360 (2014).
Acknowledgements
H.S. is funded by the European Research Council (No. 804744). H.S. and D.A.T.C. would like to recognise funding by The National Institutes of Health (R01AI114703). A.P.W. is funded by a Career Award at the Scientific Interface by the Burroughs Wellcome Fund, by the National Library of Medicine of the National Institutes of Health under Award Number DP2LM013102 and the National Institute of Allergy and Infectious Diseases of the National Institutes of Health under Award Number R21Al151750. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Material has been reviewed by the Walter Reed Army Institute of Research. There is no objection to its presentation and/or publication. The opinions or assertions contained herein are the private views of the author, and are not to be construed as official, or as reflecting true views of the Department of the Army or the Department of Defense.
Author information
Affiliations
Contributions
H.S. developed the methods, conducted the analyses and wrote the first draft of the paper. S.C., A.W., N.L. and D.A.T.C. helped methods development. T.B., M.K., J.M.R., I.M.B., S.F., R.J., K.R., S.I., W.V., P.S., C.K., B.T., K.E.M. and C.B. worked on obtaining data for the analyses. All authors contributed to revising the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Tommy lam and the other, anonymous reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Salje, H., Wesolowski, A., Brown, T.S. et al. Reconstructing unseen transmission events to infer dengue dynamics from viral sequences. Nat Commun 12, 1810 (2021). https://doi.org/10.1038/s41467021218889
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021218889
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.