Effects of memory on the shapes of simple outbreak trees

Plazzotta, Giacomo; Kwan, Christopher; Boyd, Michael; Colijn, Caroline

doi:10.1038/srep21159

Download PDF

Article
Open access
Published: 18 February 2016

Effects of memory on the shapes of simple outbreak trees

Giacomo Plazzotta¹,
Christopher Kwan²,
Michael Boyd³ &
…
Caroline Colijn¹

Scientific Reports volume 6, Article number: 21159 (2016) Cite this article

1860 Accesses
2 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Genomic tools, including phylogenetic trees derived from sequence data, are increasingly used to understand outbreaks of infectious diseases. One challenge is to link phylogenetic trees to patterns of transmission. Particularly in bacteria that cause chronic infections, this inference is affected by variable infectious periods and infectivity over time. It is known that non-exponential infectious periods can have substantial effects on pathogens’ transmission dynamics. Here we ask how this non-Markovian nature of an outbreak process affects the branching trees describing that process, with particular focus on tree shapes. We simulate Crump-Mode-Jagers branching processes and compare different patterns of infectivity over time. We find that memory (non-Markovian-ness) in the process can have a pronounced effect on the shapes of the outbreak’s branching pattern. However, memory also has a pronounced effect on the sizes of the trees, even when the duration of the simulation is fixed. When the sizes of the trees are constrained to a constant value, memory in our processes has little direct effect on tree shapes, but can bias inference of the birth rate from trees. We compare simulated branching trees to phylogenetic trees from an outbreak of tuberculosis in Canada, and discuss the relevance of memory to this dataset.

Plagued by a cryptic clock: insight and issues from the global phylogeny of Yersinia pestis

Article Open access 19 January 2023

Epidemiological inference for emerging viruses using segregating sites

Article Open access 29 May 2023

Validity of Markovian modeling for transient memory-dependent epidemic dynamics

Article Open access 08 March 2024

Introduction

Understanding outbreaks of an infectious disease is important for understanding how a pathogen spreads, and in determining the best steps to take to control it. Recently, the advent of next-generation sequencing has permitted the use of genomic data to assist in understanding outbreaks. Even small amounts of genetic variation within an outbreak can potentially be detected with whole-genome sequencing, and used to aid in reconstructing who infected whom^{1,2,3,4,5,6,7,8}. Genomic data are typically analysed by inferring phylogenetic trees (phylogenies), namely trees in which the tips correspond to pathogen isolates from infected cases, and the internal nodes correspond to inferred ancestors. Phylogenies are usually rooted, binary trees. An ongoing challenge in epidemiology is to make the best use of genomic data, usually with the help of inference and analysis of phylogenetic trees that carry information on parameters including the basic reproduction number (R₀)^9,10.

Phylogenetic trees are related to branching trees. In a branching process without multifurcations (ie where there is a positive time interval between successive branching events), the process defines a bifurcating tree moving forward in time; internal branch lengths reflect times between infection events, and pendant branch lengths reflect the time between an infection event and a sampling event. Under good conditions (where evolution is clock-like, within-host diversity is low and where sufficient diversity accrues across sampled individuals), a timed phylogenetic tree can be seen as an approximate representation of the true branching tree, though it does not include the information of who infected whom in a direct way. Indeed, the link between pairwise genetic diversity and who infected whom has been widely studied and discussed^{6,7,8,11,12,13,14,15,16,17}. These assumptions may break down for various reasons, but the study of branching trees remains a central tool for modelling phylogenetic trees.

Both the theory of branching processes and Kingman’s coalescent theory provide models for branching trees; these have been used to good effect in theoretical epidemiology^{11,14,15,18,19,20,21}. The constant rate birth-death and coalescent processes share the simplifying assumption that distribution of times between branching events is exponential, due to the constant rates, mirroring the exponential time distribution in the susceptible-infectious-recovered (SIR)-type epidemic models that have been widely used to model the spread of infection²². However, exponentially distributed (memory-less) infectious periods are not very realistic for many infections^{23,24,25,26,27,28}. Non-exponential distributions in models of the spread of infection have been a topic of study for decades, and it is well established that incorporating memory in these processes can have large effects on the models’ dynamics^{24,26,29,30,31,32}. Non-exponential distributions, particularly in the infectious period, can also affect the estimation of R₀ and other parameters^33,34,35. The growing fields of phylodynamics and genomic epidemiology, however, have primarily used the constant rate assumption because of its tractability and the inherent additional complexities of estimation from sequence data. Recently there has been growing interest in non-Markovian processes in this context, particularly non-constant removal rates^36,37,38. Using multiple compartments, the epidemiological coalescent can account for non-exponential durations of infectiousness and variable infectivity^15,19,21 but in models with many compartments, the necessary inference becomes challenging due to large numbers of latent variables¹⁵.

Chronic bacterial infections such as tuberculosis have long and variable durations of infection. This can include a non-infectious latent period, as is the case for tuberculosis, and can also include infectiousness worsening over time. Furthermore, cases may not present clinically in the chronological order in which they were infected. An individual may be undiagnosed and infectious for months, and some infections may remain latent for variable time periods in infected individuals. Accordingly, the spreading processes of these complex infections are likely to depart substantially from the constant rate assumption, and reconstructing transmission events using the timing of case presentation is not always feasible.

Models have so far focused primarily on the branching times in phylogenetic trees, as these are natural quantities in branching processes and coalescent theory. However, it has been observed for species phylogenies that the tree shapes arising from the Yule or constant-rate birth-death processes do not fit trees from data particularly well^{39,40,41,42,43}. Several studies have suggested that tree shapes carry relevant information for epidemiology^{21,44,45,46,47}, and tree shapes have recently been shown to have applications to inference from phylogenies derived from both viral and macroevolutionary data^21,48. Frost and Volz²¹ noted that coalescent times are not sufficient to estimate epidemiological dynamics in complex models (such as structured populations), though they do very well in simple populations. They found that high transmission in the acute stage of HIV infection affected the asymmetry and the numbers of cherry configurations in phylogenies²¹. Recently, Hagen et al.⁴⁸ also found that variable speciation rates in macroevolutionary processes affect tree imbalance and produce trees that match the shapes of thousands of macroevolutionary trees better than trees from homogeneous processes. However, while asymmetry and cherry patterns capture aspects of tree shape, they do not describe it entirely. Incorporating tree shapes into the growing field of phylodynamics is an open challenge.

Here, we simulate and compare outbreaks using infectiousness functions that vary sharply over time. We compare a range of shape features of the resulting trees to each other, to trees from constant rate processes, and to trees from a tuberculosis dataset for which memory is likely relevant. We allow substantial delays between infection and infectiousness, so that the times are not exponentially distributed. This introduces memory into the process. We control either R₀ (the mean number of secondary infections) or the Malthusian parameter M (the mean rate of growth of the process), but vary the timing of infections, and explore how this affects the shapes of the outbreaks’ branching trees. We find that contrary to our expectations, memory in the process has very little direct effect on the shapes of branching trees. Rather, it strongly affects the number of infected individuals (tips in the tree), and affects tree shapes as a consequence. It also affects estimates of the birth rate derived from branching times, and it affects the comparison between the branch timing and that expected under a Yule process. We find that phylogenies derived from data do not match the shapes of the constant-rate birth-death models even when the birth and death rates were fit to data, and we discuss whether this match should be expected.

Methods

Formulation and notation

We use the Crump-Mode-Jagers generalised branching process. We follow Jagers’ setting⁴⁹: each individual is assigned a random variable λ modelling its life/recovery time and a point process ξ modelling the number of individuals that he infects and the respective infection times. The pairs (λ, ξ) assigned to different individuals are independent and identically distributed. This implies that in the process z(t), defined as the number of individuals alive at time t, is indeed a branching process.

In our model, λ is an exponential random variable representing the life span of an individual, its expectation is 1/δ where δ is the death rate. If δ = 0 we have trees with no death, this results in ultrametric trees, i.e. trees where the distance from an ancestor to any of the tips is the same. This removes the risk that sampling through time will bias the shape features²¹. The point process ξ (the new infections caused by each individual) is a non-homogeneous Poisson process with intensity I(t). The mean number of secondary infections caused by any individual (R₀) is given by , where t is the time since an the individual became infected. The Malthusian parameter M is the finite positive solution to the equation which in our setting reduces to . The Malthusian parameter M exists if E[ξ(0)] < 1 < E[ξ(∞)] < ∞, which are reasonable properties in the context of transmission trees (and we assume them throughout). The Malthusian parameter captures the growth of the process because the expectation E[e^−Mtz(t)] converges to a constant as t → ∞⁵⁰.

To convert the branching process simulation to a rooted binary transmission tree with branch lengths in units of time, we begin at the source case, adding a node each time there is an infection event. One of the descending lineages from that node corresponds to the infector (say A), and the other to the infectee (say B). The length of the branch is either (1) the time between A’s infection and A infecting B (if B is the first case A infected), or (2) the time between A’s infecting the individual she infected just prior to B and the time A infects B. The pendant branch length to the tip labelled B is the time between either (1) B’s infection and B’s sampling (if B did not infect anyone) or (2) the time between B’s last infection of someone else and B’s sampling. In this way, the branching process defines a rooted, timed, bifurcating tree. While the focus of this paper is on how the shapes of these trees are affected by the intensity function describing when infection events occur, the motivation of the work makes the implicit assumption that these shapes are made relevant because under good conditions, timed phylogenetic trees are a reasonable approximation to these bifurcating timed transmission trees.

Simulations

Simulating stochastic branching trees under processes with memory is challenging. Gillespie-type methods do not fit the problem naturally, and steps must be taken to ensure that all events that can happen before the final simulation time have the appropriate probability of happening (affecting the conditions under which the process can be stopped). This can be very computationally time-consuming due to the variability of the tree size in branching processes. This complexity is part of the motivation for using simple intensity functions. To simulate non homogeneous trees, we wrote an iterative function that takes in input the ancestor, the start and final time of the tree, the intensity function and the life-span distribution: SimNHTree(ancestor, startTime, finalTime, int, lambda). The branching property states that the subtree generated by each daughter of the ancestor is equal, in distribution, to the whole tree in every aspect but the starting time. To use this property we first find the number of daughters of the ancestor and their birth times, simulating a non homogeneous Poisson process with intensity int. Then, for each of the ancestor’s daughters, the function SimNHTree calls itself with the daughter and its birth time as new input: SimNHTree(daughter_i, birthTime(daughter_i), finalTime, int, lambda). This generates the ancestor’s daughters subtrees which can be merged because we track each subtree’s ancestor and start time.

We use two different approaches to setting the stopping time: in Scenario 1 we fix the time for each intensity function so that when R₀ is the same, so is the time, and when M is the same, so is the time. In Scenario 2 we tune the time to obtain, on average, trees with the desired number of tips. To do this, we use a simple algorithm that simulates a group of trees and if the total average of the tips is too high/low then the algorithm decreases/increases the final time and starts a new simulation. It stops once the average number of tips of each group is between 32 and 34 (to be comparable to our dataset) or between 98 and 102, in order to compare results in larger trees (Scenario 2-large). To exclude meaningless cases, we rejected trees with fewer than 5 tips. In each scenario, we vary the delay between becoming infected and infecting others, using different intensity functions (each labelled with a case number illustrated in Figs. 1 and 2(a)). In Scenario 3 we increase both the delay between becoming infected and infecting others (ie location of the intensity function) and the height of the intensity function. This results in the basic reproduction number ranging more widely than in Scenarios 1 and 2. Here we also use a positive death rate, so trees are not ultrametric.

**Figure 2: Intensity functions (a) and shapes of trees (b) from Scenario 3, in which we explore a more extreme example of memory with a high R₀, high M, and delayed but high intensity function.**

The parameters were chosen empirically, in order to explore and compare trees originated by different intensity functions, but sharing biologically-relevant measures such as R₀, M or the size. The parameter choice was not intended to fit a specific outbreak. However, R₀ values for most common pathogens including tuberculosis (R₀ = 1–1.5) are in the range 1–6, with some viruses having much higher values (measles for example at R₀ ≈ 20). Our choices of R₀ mirror these values. The sizes of our outbreaks mirror the dataset we have, and in order that the results not be restricted to processes of this small size we also explore larger trees. Table 1 and Figs 1 and 2(a) give details of the parameters in each simulation. We compare these sets of simulations with two sets of constant-rate birth-death trees, one with with parameters matching the R₀ of the trees in Scenario 1, and one with parameters estimated from our TB data using BEAST. For each case in Scenario 1, 2 and 3 we simulate 200 trees and for the two homogeneous cases we simulate 1000 trees.

Table 1 Details of the parameters that defined the simulations.

Full size table

The number of secondary infections per infectious case has a mean of R₀ but of course it can be distributed in various ways. In the terminology of branching processes this distribution is called the offspring distribution; the constant-rate birth-death process has a geometric offspring distribution (a convolution of a Poisson number during their lifespan and an exponential lifespan). However our non-homogeneous Poisson processes, in which cases survive their infectious period, have a Poisson offspring distribution. To explore possible effects of this difference, we simulate a variant of our process in which the intensity functions varied as in Scenario 1, but we draw the numbers of secondary infections from a geometric distribution (see Supplement).

Intensity functions: the time between infection and infecting others

For scenario 1, 2, and 2-large we choose intensity functions such that we can introduce memory while fixing R₀ (cases 1–4 in Table 1), or fixing the Malthusian parameter (cases 5–8 in Table 1). It is not possible to fix both simultaneously while varying the intensity function independently. We vary the timing of infectiousness, from beginning immediately (cases 1 and 5 in Table 1) to beginning relatively late after a case was infected (cases 4 and 8 in Table 1); Fig. 1 illustrates the intensity functions for each case.

For a general step-like intensity function, R₀ is given by

where k is the height of the step, n defines the step interval [n, n + 1] and δ is the death rate. The Malthusian parameter M cannot be written in closed form and is the solution of the following equation:

where k, n and δ are same as in Equation (1). Given a fixed value of R₀, Eq. (1) can be used to obtain different intensity functions with the same R₀, varying the parameters n and k. In a similar way if M is fixed, from Eq. (2) one can derive the height k for different values of n, thus defining different intensity functions with the same Malthusian parameter. In this way we derive the intensity functions in Fig. 1.

Shape features

Many of the functions used in this paper have been collected into an R package called phyloTop. Its aim is to allow the calculation of topological properties of phylogenetic trees. It does this by allowing the calculation of certain basic properties. Three important examples of topological properties of the nodes of a tree are the number of descendants of each node (this generalises the concept of cherries), the imbalance in the number of descendants and the length of the ladder starting from that node. For a graphical representation of cherry, pitchfork and ladder we refer to Fig. 3. The package includes tools to calculate these and many others. Once these basic properties have been found it is easy to calculate whatever else may be needed. This approach is quite flexible in calculating other topological properties. phyloTop implements this practice for some common examples such as the Colless and Sackin imbalance. phyloTop is based on the R package phylobase.

We use the normalized Colless imbalance^51,52, Sackin imbalance^16,53, the number of cherries^21,54, the number of pitchforks, a “stairness” property (stairs2), the number of internal nodes with a single tip descendent (ILnumber), and an average “ladder length” (avgLadder). A cherry is two tips with a common ancestor. A pitchfork is a configuration of 3 tips: one cherry and an additional tip with a common ancestor. They can be counted in phyloTop with nConfig (tree,3). The stairs feature is the second “stair-ness” shape defined by Norstrom⁵⁵, namely the average of over the internal nodes of the tree. Here, T_ri and T_li are the number of tips descending from the left and right sides at internal node i. We define a ladder to be a series of of connected internal nodes, each with a single leaf descendant. The avgLadder is the average length of ladders in the tree.

These can all be computed in a straightforward manner in the phyloTop package. The relevant phyloTop functions were then used with a function called treeListSummary. As inputs, this takes a list of functions (each of which return a topological property of a tree) and a list of trees. It then returns a data frame displaying the results of applying the input functions to the input trees. phyloTop has been made available on CRAN with a standard open source licence.

We normalise the shape features by comparing them to the maximum possible value in a tree of the given size. Normalization is performed by division by the maximum possible value, which is a function of the number of tips, n. While the expected value of any of the shape patterns will vary with the model under which the expectation is taken, and these averages are in general challenging to determine, the maximum possible value in a tree of size n is straightforward for all of these shape features. Normalization is as follows: Colless (already normalized); Sackin (normalized dividing by: ); cherries (normalized dividing by n/2); pitchforks (normalized dividing by: n/3); Stairs2 (already normalized); ILnumber (normalized by dividing by n − 2); max height (normalised by dividing by n − 1), average ladder (normalized by dividing by n − 2).

A linear regression was performed for normalised shape feature versus the start of the intensity burst n. A t-test on the slope was used to infer whether the feature increased or decreased with n. The test and the relative p-vaue, i.e. the probabiility given the simulations that the shape feature considered is are neither increasing nor decreasing, was found with the function t.test in R⁵⁶. In addition, in the supplement the Spearman’s correlation between shape features was computed for each scenario and case.

Data

We use data previously described in⁸ and⁴⁷ (Outbreak A) Briefly, the outbreak included 33 M. tuberculosis isolates collected in British Columbia between 2006 and 2011. Isolates were sequenced using paired-end 75bp reads on the Illumina HiSeq. The outbreak, sequences and SNPs are presented in⁸. Reads were aligned against the reference genome M. tuberculosis CDC1551 (NC002755) using Burrows-Wheeler Aligner (BWA)⁵⁷. Single nucleotide variants were identified using samtools mpileup⁵⁸ and were filtered to remove any variant positions within 250bp of each other and any positions for which at least one isolate did not have a genotype quality score of 222. The remaining variants were manually reviewed for accuracy and were used to construct phylogenetic trees with BEAST^59,60 and MrBayes⁶¹. BEAST was run with the tip dates, and with birth-death serial sampling model⁶², an uncorrelated relaxed molecular clock with an exponential (1) prior on the rate, and a GTR substitution model. The MCMC chain length was 10000000 with every 1000th stored. MrBayes was run with the following options: lset Ploidy = Haploid; prset Brlenspr = clock:uniform; prset Treeagepr = Gamma(7.5, 1); prset nodeagepr = calibrated and tip dates included.

Results

A linear model was fitted to the group of simulations cases 1–4 and cases 5–8 in order to investigate how memory, in terms of different intensity functions, may affect processes with either same basic reproduction number or same Malthusian parameters. The result of the statistical analysis is shown in Table 2.

Table 2 Relation between shape topologies and memory, as included in our simulations.

Full size table

From the simulations in Scenario 1, summarised in Fig. 4, we find that memory can affect many of the shape features we compared. In particular, as the start of the infectious period n moves further from the time of infection, the tree imbalance increases. A negative or null slope of both standardised Colless and Sackin imbalance is rejected with a p-value p < 10⁻⁸ for both cases 1–4 and cases 5–8. The frequency of cherries is unaffected by memory if the R₀ is kept constant (p = 0.48). For cases 5–8 the frequency of cherries is decreasing, having rejected the hypothesis of a null or negative slope (p < 10⁻¹⁵). The frequency of pitchforks shows a slight increase for cases 1–4 in (p = 1.3 ⋅ 10⁻⁶) and decrease for cases 5–8 (p = 1.5 ⋅ 10⁻⁴). Similarly the “stairs 2” feature⁵⁵ increases as n increases for cases 1–4 where the R₀ is kept constant (p = 1.4 ⋅ 10⁻⁴) and decreases for cases 5–8 where the Malthusian parameter is constant (p < 10⁻¹⁵). The normalised number of internal nodes with a single tip descendant (“ILnumber”), maximum heights and average ladder length increase when both the R₀ is constant (p 1.9 ⋅ 10⁻⁵, <10⁻¹⁵, 9.9 ⋅ 10⁻⁹ respectively) and when the Matlhusian parameter is constant (all p <10⁻¹⁵). Whereas it is difficult to visualise most topological differences, the increase in imbalance, standardised maximum height and average ladder can be appreciated from the example trees related to Scenario 1 in Fig. 5. for instance the tree in (S1 case5) is more balance and has smaller ladders, in proportion to its tips, than (S1 case7) or (S1 case8).

**Figure 4: Boxplots describing shape features of branching trees from Scenario 1.**

**Figure 5: Example of trees from each scenario and case.**

However, the most dramatic difference between the various cases in Fig. 4 is in the number of tips. This was against expectation because the combinations final time-R₀ and final time-M were kept the same in cases 1–4 and 5–8 respectively (see Table 1). Particularly in cases 5–8 which have the same Malthusian parameter, the net growth is the same up to a (usually unknown) constant, and here it seems that this constant is highly dependent on the specific intensity function. This led us to ask whether the impact of memory on tree shapes in this context is just a matter of the impact of memory on the number of tips. To explore this question, we adjusted the time periods of the simulations to allow the different cases to produce branching trees of comparable sizes (Scenario 2). Fig. 6 shows the result. We now see lower differences between the processes. In the simulations where R₀ is kept constant, only the frequency of cherries, the “stairs 2” and the ILnumber show a statistically significant (p < 0.01) pattern. The cherry frequency and the “stairs 2” decrease (p of 0.2 ⋅ 10⁻³ and 0.6 ⋅ 10⁻³ respectively), whether the ILnumber increases as n increases (p = 0.3 ⋅ 10⁻²). In the cases with constant M, more shape features showed statistically significant patterns. As the time between infection and the start of the infectious period grows, the cherry and pitchfork frequencies together with the “stairs 2” measure decrease (p of 10⁻¹⁵, 0.5 ⋅ 10⁻³, <10⁻¹⁵ respectively), whether the ILnumber, the maximum height and the average ladder increase (p of <10⁻¹⁵, 0.3 ⋅ 10⁻² and 0.1 ⋅ 10⁻³ respectively). Comparing to Scenario 1, adjusting for the size of the branching trees eradicates the some effects of memory, particularly in the simulations where R₀ is kept constant. In Fig. 5, second row, some example trees for Scenario 2 were chosen. The shape difference between each other is not as evident as the trees from Scenario 1; there is a clear increase in the proportion of internal nodes with a one tip descendant (standardised ILnumber) from case 5 (11 nodes, 23 tips), to case 8 (14 nodes, 24 tips).

**Figure 6: Boxplots describing shape features of branching trees from Scenario 2, where the size of the tree is controlled by varying the times of the simulations.**

As in Scenario 2, Scenario2-large shows an increased level of uncertainty (high p); see Fig. 7 and Table 2 for a summary of the results. Compared to Scenario 2, in Scenario 2-large the Colless and Sackin imbalance for cases 5–8 is decreasing (p of 3.5 ⋅ 1−⁻⁴ and 6.0 ⋅ 10⁻⁴) instead of uncertain; the frequency of pitchfork for cases 1–4 increases (p = 7.7 ⋅ 10⁻⁴) and is unchanged in cases 5–8; the standardised maximum height has an opposite behaviour for cases 5–8; and the standardised average ladder for cases 5–8 does not show a statistically significant linear increase.

**Figure 7: Boxplots describing shape features of branching trees from Scenario 2-large, where the tree has a mean number of tips between 98 and 102, obtained varying the simulation time.**

We used two posterior collections of phylogenetic trees derived from the Kelowna TB outbreak in Canada (see Methods). From the collection of BEAST⁶⁰ estimates of the trees, birth and death rates were estimated. The estimated values were used to simulate the homogeneous trees in scenario BDFit, please refer to Table 1 for the input values used. Data-derived trees had slightly lower imbalance and slightly lower normalised maximum height than the simulated trees once size was controlled (Fig. 6); the two inference methods differed more with each other in several shape parameters (cherries, IL number and stairs) than the data trees differed from the simulations. Comparing to Scenario 1, the variability in the data-derived trees was typically much lower than that in the constant-rate models and more closely matched the tightly defined simulations in cases 3, 4, 7, and 8, but then the number of tips in the data is fixed, and the numbers of tips in cases 3, 4, 7 and 8 were the most narrowly distributed as well. Comparing to Scenario 2, where we controlled the average size, all of the tree shape features from the data are consistent with the simulations except for imbalance, maximum height and avgLadder feature. The BEAST trees were always closer to the constant-rate birth death model than the MrBayes trees, consistent with our having used the birth-death prior in BEAST.

The fact that the number of tips varies dramatically while R₀ or M are fixed means that memory affects the number of lineages in the tree. This led us to wonder whether memory would affect the results of inference approaches that assume a memory-less model and use the timing of branching events. We estimated the birth rate using the pureBirth function in the laser package in R⁶³. We found that the estimates varied, and that memory resulted in some bias. Since the trees are ultrametric, the estimate of the birth rate should be equal to the Malthusian parameter. With reference to Fig. 8, the median estimate (shown by the horizontal bar in the boxplots) is too high in cases 1–4, correct in cases 5–8 in Scenario 1, but too high in cases 5 and 8 in Scenario 2. We also used Pybus’ γ⁶⁴ to analyse the timing of branching events; this is possible for the ultrametric trees (Scenarios 1 and 2). We found that both memory, as we have explored it, and the distribution of the number of secondary infections, can affect whether trees appear consistent with the Yule model (see Supplementary Material).

**Figure 8: Pure birth rate fits to ultrametric simulated branching trees from Scenarios 1 (a) and 2 (b).**

We compared branching trees from a set of more extremely varying intensity functions such that the R₀ values differed greatly. Fig. 2(a) illustrates the intensity functions, which range from infectivity beginning immediately to infectivity beginning much later. We found that high R₀ values combined with a late intensity function resulted in marked differences in tree shapes, particularly in the numbers of cherries, pitchforks, the stairs feature, and the IL number, all with p < 10⁻¹⁵. The marked difference in shape is also evident in the examples in Fig. 5. Imbalance is increasing with R₀, with p 8,06 ⋅ 10⁻¹⁰ and 4.1 ⋅ 10⁻³ for Colless and Sackin respectively, as well as the maximum height (p = 2.0 ⋅ 10⁻⁹). The number of tips and average ladder length are the only two shape features with no statistically significant difference among the three cases (p 0.87 and 0.06 respectively). A long delay between infection and start of infectious period causes the tree have only a few individuals with numerous offspring as in (S3 case2) and (S3 case3), compared to a more “normal” shape of (S3 case1). The extreme case (S3 case3) is composed by a handful of long caterpillars which imply low frequency of cherries and pitchforks and a high ILnumber, validating the results in Fig. 2.

Some of the shape features are naturally related to each other. For instance, connected “ladder” configurations will occur more frequently in imbalanced trees and cherries will be more numerous in balanced ones. Among measures of balance, Rogers⁶⁵ showed high correlation between Sackin and Colless imbalance, under the equal rates Markov model and the equal probability model. With a simulation approach, Shao and Sokal⁶⁶ evaluated the correlation matrix of nine indices of tree balance under the equal probability model. Similarly, they found that Colless and Sackin are highly correlated with each other.

We explored these correlations across our simulations and data (Supplementary Figures S3–S6). We found that the correlations are remarkably preserved across the simulations, but that Case 8 and the latter 2 cases from Scenario 3 (all with high R₀ and delayed transmission), as well as the data, had correlation patterns which differed from the rest. Colless and Sackin imbalances are highly correlated in every simulation case, scoring a minimum of 0.88 (p < 10⁻¹⁵) in Scenario 2 case 8. We did not find negative correlations between cherries and imbalance, though cherries indicate symmetry (near the tips) and imbalance indicates asymmetry (over the whole tree). Unlike other cases, case 8 consistently has a negative correlation between the ILnumber feature and the pitchforks: −0.82 (p < 10⁻¹⁵) in Scenario 1 and −0.52 (p = 3, 5 ⋅ 10⁻¹²) in Scenario 2. This negative correlation also occurs in the high R₀ cases from Scenario 3. In both data-derived groups of trees, the ladder numbers were not correlated with the imbalance measures whereas in most simulations these were tightly correlated. In the data-derived trees there were significant negative correlations between the stairs feature and the imbalance, height and ladder number, which were not present in any of the simulations. The data-derived trees had a weaker correlation between the ladder number and the imbalance than the simulations. In Scenario 3, many small but consistent correlations amongst tree shapes are reversed compared to the rest of the simulations; in particular, imbalance and tree height are not negatively correlated with cherries, pitchforks and stairs. These trees can have a high imbalance simultaneously with high values of symmetric shapes such as cherries, for example. There are strong negative correlations between the number of ladder nodes (ILnumber) and the cherries and pitchfork numbers.

Overall, when we compare how shape features have depended, or not depended, on the variations we have explored, we note that there are several shapes whose distributions were quite tightly constrained by data compared to their variability in simulations. These were the maximum height of the tree, the avgLadder feature and both measures of imbalance. In contrast, several shape features emerge as being sharply determined by the combination of high R₀ and M: the numbers of cherries and pitchforks, the stairs feature, and the number of internal nodes with a single tip descendant (ILnumber).

Discussion

A “process with memory” is simply a process that is non-Markovian, and there are many ways that memory can be introduced. Here, introducing memory in a simple way affected many shape features of branching trees, but also affected the timing of new infections (via the intensity function), the offspring distribution, the Malthusian parameter and the tree size distribution. This makes disentangling the effects of non-exponential waiting times on tree shapes quite complex. We have attempted to construct our study to best explore these different effects, and within this context we have found that tree shapes are quite robust to the non-exponential waiting times we have used. However, the memory in our processes did affect estimates of the birth rates and the Pybus γ statistic.

We explored memory using a collection of simple step-like intensity functions with a delay between a host becoming infected, and infecting others. When the delay is much larger than the pulse width (the duration of infectiousness), the branching events of each individual occur in a short time compared to the individual’s life span, very much like a burst. In the limit where the “burst” becomes very short, each individual spawns offspring at a fixed time after infection. In this case, when we observe the tree, each individual has either reached this time, infected others and completed the infectious burst or not. In such a tree, shape features such as imbalance, the cherry-to-tips ratio and so on remain essentially unchanged over time. In contrast, if the delay is comparable to the pulse width, the first offspring of an individual may infect others before her “youngest sister” is born. This overlap can influence the shape because some part of the tree may not be born when the tree is observed at the cut-off time or when a chosen number of individuals is reached.

Like most other works in the field of phylodynamics, we have implicitly assumed that the true branching tree, or at least a good posterior distribution approximating it, can be estimated using pathogen genetic data. This is a limitation, as branching events in phylogenies may not always correspond to transmission events in the outbreak. However, when within-host diversity is low, the pathogen evolves in a clock-like manner and accrues sufficient genetic diversity, the branching points in a phylogeny are likely to be very close to transmission events. A related complication is that the “transmission bottleneck” is not typically known; hosts may initially be infected with more than one pathogen lineage. Finally, hosts may be re-infected and carry multiple lineages as a consequence. We have chosen not to add these additional complexities to our exploration of memory and tree shapes. Indeed, if under models that capture within-host diversity or other complex features, transmission trees can be inferred^8,67, then comparing the shapes of those transmission trees to what might be expected under different intensity functions, as presented here, remains relevant. Challenges in the next generation of phylodynamics have recently been discussed elsewhere¹⁰.

We have not focused on how sampling affects tree shapes, but the question of how sampling affects phylodynamic inference is a challenging one¹⁰. The density, timing and uniformity of sampling can be expected to affect shapes; in the limit of very low sampling density, we would expect the effects of non-exponential waiting times between infection events to be washed out by the fact that lineages in the sampled branching tree would change hosts many times between branching events (unless the low sampling is highly non-uniform). If sampling occurs through contact tracing, snowball sampling or respondent-driven sampling then this could have substantial effects on tree shapes, which are as yet uncharacterised.

In outbreak settings, many factors can impact tree shape, including R₀, M, non-exponential distributions of waiting times (which we have explored here), but also including selection, population structure, host behaviour, super-spreading, host contact network structure^{21,44,45,47,68} and other factors. The complexity of the underlying models and the computational challenges associated with large datasets make likelihood-free inference an appealing tool in this domain^69,70,71. However, this approach requires informative summary features that can be compared to properties of sequence data or to trees inferred from these data.

Tree shapes are potentially an important source of such summary features. The number of possible tree shapes explodes exponentially in the number of tips, so specifying a shape in principle specifies a lot of information. Using shapes as informative summary features will require much more finely-resolved shape statistics than the few that are currently in use – mainly imbalance and the number of cherries. We have found that tree shapes are quite robust to variations in the waiting times between the onset of infection and infecting others. However, in our study, some shapes were strongly constrained by data (tree height, the average length of connected “ladder” components and the imbalance) and some were strongly affected by high R₀ and M (cherries, pitchforks, stairs and ILnumber). If this robustness to memory together with sensitivity to other aspects of the data, carries forward to an informative suite of tree shapes, shapes could provide an alternative approach to estimating epidemiological parameters such as R₀ and M using sequence data.

Additional Information

How to cite this article: Plazzotta, G. et al. Effects of memory on the shapes of simple outbreak trees. Sci. Rep. 6, 21159; doi: 10.1038/srep21159 (2016).

References

Köser, C. U. et al. Rapid whole-genome sequencing for investigation of a neonatal MRSA outbreak. New Engl. J. Med. 366, 2267–2275 (2012).
Article PubMed Google Scholar
Walker, T. M. et al. Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study. Lancet Infect. Dis. 13, 137–146 (2012).
Article PubMed CAS Google Scholar
Grad, Y. H. et al. Genomic epidemiology of the Escherichia coli O104: H4 outbreaks in Europe, 2011. Proc. Natl. Acad. Sci-Biol. 109, 3065–3070 (2012).
Article CAS ADS Google Scholar
Török, M. E. et al. Rapid Whole-Genome Sequencing for Investigation of a Suspected Tuberculosis Outbreak. J. Clin. Microbiol. 51, 611–614 (2013).
Article PubMed PubMed Central Google Scholar
Kato-Maeda, M. et al. Use of whole genome sequencing to determine the microevolution of mycobacterium tuberculosis during an outbreak. Plos One 8, e58235 (2013).
Article CAS PubMed PubMed Central ADS Google Scholar
Ypma, R. J. F., van Ballegooijen, W. M. & Wallinga, J. Relating Phylogenetic Trees to Transmission Trees of Infectious Disease Outbreaks. Genetics 195, 1055–1062 (2013).
Article PubMed PubMed Central Google Scholar
Jombart, T. et al. Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data. Plos Comput. Biol. 10, e1003457 (2014).
Article PubMed PubMed Central CAS Google Scholar
Didelot, X., Gardy, J. & Colijn, C. Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol. Biol. Evol. 31, 1869–1879 (2014).
Article CAS PubMed PubMed Central Google Scholar
Grenfell, B. et al. Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303, 327–332 (2004).
Article CAS ADS PubMed Google Scholar
Frost, S. D. W. et al. Eight challenges in phylodynamic inference. Epidemics 10, 88–92 (2015).
Article Google Scholar
Pybus, O. & Rambaut, A. Evolutionary analysis of the dynamics of viral infectious disease. Nat. Rev. Genet. 10, 540–550 (2009).
Article CAS PubMed PubMed Central Google Scholar
Jombart, T., Eggo, R., Dodd, P. & Balloux, F. Reconstructing disease outbreaks from genetic data: a graph approach. Heredity 106, 383–390 (2010).
Article PubMed PubMed Central Google Scholar
Volz, E. Complex population dynamics and the coalescent under neutrality. Genetics 190, 187–201 (2012).
Article PubMed PubMed Central Google Scholar
Stadler, T. et al. Estimating the basic reproductive number from viral sequence data. Mol. Biol. Evol. 29, 347–357 (2012).
Article CAS PubMed Google Scholar
Rasmussen, D. A., Volz, E. M. & Koelle, K. Phylodynamic inference for structured epidemiological models. Plos Comput. Biol. 10, e1003570 (2014).
Article PubMed PubMed Central ADS CAS Google Scholar
Frost, S. D. W. & Volz, E. M. Modelling tree shape and structure in viral phylodynamics. Philos. Trans. R. Soc. Lond. B Biol. Sci. 368, 20120208 (2013).
Article PubMed PubMed Central Google Scholar
Worby, C. J., Chang, H. H., Hanage, W. P. & Lipsitch, M. The distribution of pairwise genetic distances: a tool for investigating disease transmission. Genetics 198, 1395–1404 (2014).
Article PubMed PubMed Central Google Scholar
Drummond, A., Rambaut, A., Shapiro, B. & Pybus, O. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evol. 22, 1185–1192 (2005).
Article CAS PubMed Google Scholar
Volz, E. M., Koopman, J. S., Ward, M. J., Brown, A. L. & Frost, S. D. W. Simple epidemiological dynamics explain phylogenetic clustering of HIV from patients with recent infection. Plos Comput. Biol. 8, e1002552 (2012).
Article MathSciNet CAS PubMed PubMed Central ADS Google Scholar
Stadler, T. & Bonhoeffer, S. Uncovering epidemiological dynamics in heterogeneous host populations using phylogenetic methods. Philos. T. Roy. Soc. B 368, 20120198 (2013).
Article Google Scholar
Frost, S. D. & Volz, E. M. Modelling tree shape and structure in viral phylodynamics. Philos. T. Roy. Soc. B 368, 20120208 (2013).
Article Google Scholar
Anderson, R. M. Infectious Diseases of Humans: Dynamics and Control (Oxford Science Publications) (Oxford University Press, 1991), reprint edition edn.
Keeling, M. J. & Grenfell, B. T. Disease extinction and community size: modeling the persistence of measles. Science 275, 65–67 (1997).
Article CAS PubMed Google Scholar
Keeling, M. J. & Grenfell, B. T. Effect of variability in infection period on the persistence and spatial spread of infectious diseases. Math. Biosci. 147, 207–226 (1998).
Article MathSciNet CAS MATH PubMed Google Scholar
Lloyd, A. L. The dependence of viral parameter estimates on the assumed viral life cycle: limitations of studies of viral load data. Proc. Biol. Sci. 268, 847–854 (2001).
Article CAS PubMed PubMed Central Google Scholar
Lloyd, A. L. Realistic distributions of infectious periods in epidemic models: changing patterns of persistence and dynamics. Theor. Popul. Biol. 60, 59–71 (2001).
Article CAS PubMed Google Scholar
Cauchemez, S. et al. Middle east respiratory syndrome coronavirus: quantification of the extent of the epidemic, surveillance biases, and transmissibility. Lancet Infect. Dis. 14, 50–56 (2014).
Article PubMed Google Scholar
Grassly, N. C. & Fraser, C. Mathematical models of infectious disease transmission. Nat. Rev. Microbiol. 6, 477–487 (2008).
Article CAS PubMed PubMed Central Google Scholar
Grossman, Z. Oscillatory phenomena in a model of infectious diseases. Theor. Popul. Biol. 18, 204–243 (1980).
Article MathSciNet CAS MATH PubMed Google Scholar
Lloyd, A. L. Destabilization of epidemic models with the inclusion of realistic distributions of infectious periods. Proc. Biol. Sci. 268, 985–993 (2001).
Article CAS PubMed PubMed Central Google Scholar
Heffernan, J. M. & Wahl, L. M. Improving estimates of the basic reproductive ratio: using both the mean and the dispersal of transition times. Theor. Popul. Biol. 70, 135–145 (2006).
Article CAS PubMed PubMed Central MATH Google Scholar
Conlan, A. J. K., Rohani, P., Lloyd, A. L., Keeling, M. & Grenfell, B. T. Resolving the impact of waiting time distributions on the persistence of measles. J. R. Soc. Interface 7, 623–640 (2010).
Article PubMed Google Scholar
Wearing, H. J., Rohani, P. & Keeling, M. J. Appropriate models for the management of infectious diseases. Plos Med. 2, e174 (2005).
Article PubMed PubMed Central Google Scholar
Wallinga, J. & Lipsitch, M. How generation intervals shape the relationship between growth rates and reproductive numbers. Proc. Biol. Sci. 274, 599–604 (2007).
Article CAS PubMed Google Scholar
Nishiura, H., Chowell, G., Safan, M. & Castillo-Chavez, C. Pros and cons of estimating the reproduction number from early epidemic growth rate of influenza A (H1N1) 2009. Theor. Biol. Med. Model. 7, 1 (2010).
Article PubMed PubMed Central Google Scholar
Lambert, A., Morlon, H. & Etienne, R. S. The reconstructed tree in the lineage-based model of protracted speciation. J. Math. Biol. 70, 367–397 (2015).
Article MathSciNet PubMed MATH Google Scholar
Lambert, A., Alexander, H. K. & Stadler, T. Phylogenetic analysis accounting for age-dependent death and sampling with applications to epidemics. J. Theor. Biol. 352, 60–70 (2014).
Article MathSciNet PubMed MATH Google Scholar
Lambert, A. The contour of splitting trees is a lévy process. Ann. Probab. 38, 348–395 (2010).
Article MathSciNet MATH Google Scholar
Aldous, D. Probability distributions on cladograms. In Random Discrete Structures, The IMA Volumes in Mathematics and its Applications, 1–18 (Springer, New York, 1996).
Aldous, D. J. Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Stat. Sci. 16, 23–34 (2001).
Article MathSciNet MATH Google Scholar
Mooers, A., Gascuel, O., Stadler, T., Li, H. & Steel, M. Branch lengths on birth-death trees and the expected loss of phylogenetic diversity. Syst. Biol. 61, 195–203 (2012).
Article PubMed Google Scholar
Steel, M. & McKenzie, A. Properties of phylogenetic trees generated by Yule-type speciation models. Math. Biosci. 170, 91–112 (2001).
Article MathSciNet CAS PubMed MATH Google Scholar
Blum, M. G. B. & François, O. Which random processes describe the tree of life? A large-scale study of phylogenetic tree imbalance. Syst. Biol. 55, 685–691 (2006).
Article PubMed Google Scholar
Robinson, K., Cohen, T. & Colijn, C. The dynamics of sexual contact networks: Effects on disease spread and control. Theor. Popul. Biol. 81, 89–96 (2012).
Article PubMed PubMed Central MATH Google Scholar
Leventhal, G. et al. Inferring epidemic contact structure from phylogenetic trees. Plos Comput. Biol. 8, e1002413 (2012).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Poon, A. F. Y. et al. Mapping the shapes of phylogenetic trees from human and zoonotic RNA viruses. Plos One 8, e78122 (2013).
Article CAS PubMed PubMed Central ADS Google Scholar
Colijn, C. & Gardy, J. Phylogenetic tree shapes resolve disease transmission patterns. Evol. Med. Public Health 2014, 96–108 (2014).
Article PubMed PubMed Central Google Scholar
Hagen, O., Hartmann, K., Steel, M. & Stadler, T. Age-dependent speciation can explain the shape of empirical phylogenies. Syst. Biol. 64, 432–440 (2015).
Article PubMed PubMed Central Google Scholar
Jagers, P. Branching processes with biological applications (Wiley, 1975).
Nerman, O. On the convergence of supercritical general (CMJ) branching processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 365–395 (1981).
Article MathSciNet MATH Google Scholar
Colless, D. Review of phylogenetics: the theory and practice of phylogenetic systematics. Syst. Zool. 31, 100–104 (1982).
Article Google Scholar
Rogers, J. S. Central moments and probability distributions of three measures of phylogenetic tree imbalance. Syst. Biol. 45, 99–110 (1996).
Article Google Scholar
Sackin, M. good and bad phenograms. Syst. Biol. 21, 225–226 (1972).
Article Google Scholar
McKenzie, A. & Steel, M. Distributions of cherries for two models of trees. Math. Biosci. 164, 81–92 (2000).
Article MathSciNet CAS PubMed MATH Google Scholar
Norström, M. M., Prosperi, M. C. F., Gray, R. R., Karlsson, A. C. & Salemi, M. PhyloTempo: A set of R scripts for assessing and visualizing temporal clustering in genealogies inferred from serially sampled viral sequences. Evol. Bioinform. Online 8, 261–269 (2012).
Article PubMed PubMed Central Google Scholar
R Core Team R: A Language and Environment for Statistical Computing (2014) Available at: http://www.R-project.org/ (Accessed: 2nd December 2015).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The sequence alignment/map format and samtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central CAS Google Scholar
Drummond, A. J., Suchard, M. A., Xie, D. & Rambaut, A. Bayesian phylogenetics with beauti and the beast 1.7. Mol. Biol. Evol. 29, 1969–1973 (2012).
Article CAS PubMed PubMed Central Google Scholar
Drummond, A. J. & Rambaut, A. Beast: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7, 214 (2007).
Article PubMed PubMed Central CAS Google Scholar
Ronquist, F. R. & Huelsenbeck, J. P. MRBAYES: Bayesian inference of phylogeny. Bioinformatics 19, 1572–1574 (2003).
Article CAS PubMed Google Scholar
Stadler, T. Sampling-through-time in birth-death trees. J. Theor. Biol. 267, 396–404 (2010).
Article MathSciNet PubMed MATH ADS Google Scholar
Rabosky, D. L. LASER: a maximum likelihood toolkit for detecting temporal shifts in diversification rates from molecular phylogenies. Evol. Bioinform. Online 2, 273–276 (2006).
Article Google Scholar
Pybus, O. G. & Harvey, P. H. Testing macro-evolutionary models using incomplete molecular phylogenies. Proc. Biol. Sci. 267, 2267–2272 (2000).
Article CAS PubMed PubMed Central Google Scholar
Rogers, J. S. Central moments and probability distributions of three measures of phylogenetic tree imbalance. Systematic biology 45, 99–110 (1996).
Article Google Scholar
Shao, K.-T. Tree balance. Systematic Biology 39, 266–276 (1990).
Google Scholar
Hall, M. & Rambaut, A. Epidemic reconstruction in a phylogenetics framework: transmission trees as partitions (2014). arXiv:1406.0428.
Romero-Severson, E., Skar, H., Bulla, I., Albert, J. & Leitner, T. Timing and order of transmission events is not directly reflected in a pathogen phylogeny. Mol. Biol. Evol. 31, 2472–2482 (2014).
Article CAS PubMed PubMed Central Google Scholar
Beaumont, M. A., Zhang, W. & Balding, D. J. Approximate bayesian computation in population genetics. Genetics 162, 2025–2035 (2002).
PubMed PubMed Central Google Scholar
Sunnåker, M. et al. Approximate bayesian computation. Plos Comput. Biol. 9, e1002803 (2013).
Article MathSciNet PubMed PubMed Central CAS Google Scholar
Peters, G. W., Sisson, S. A. & Fan, Y. Likelihood-free bayesian inference for -stable models. Comput. Stat. Data Anal. 56, 3743–3756 (2012).
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank Gabriel Leventhal for comments regarding the difference between large and small pulse delays. This work was supported by the Engineering and Physical Sciences Research Council (EP/K026003/1).

Author information

Authors and Affiliations

Department of Mathematics, Imperial College London, London, UK
Giacomo Plazzotta & Caroline Colijn
Department of Electrical and Electronic Engineering, Imperial College London, London, UK
Christopher Kwan
Department of Mathematics, University of Cambridge, Cambridge, UK
Michael Boyd

Authors

Giacomo Plazzotta
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Kwan
View author publications
You can also search for this author in PubMed Google Scholar
Michael Boyd
View author publications
You can also search for this author in PubMed Google Scholar
Caroline Colijn
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.C. conceived and supervised the study at all stages; G.P. designed and performed the analysis, and wrote the manuscript; C.K. conducted a review of the literature and performed the analyses; M.B. wrote the R package.

Corresponding author

Correspondence to Giacomo Plazzotta.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Information (PDF 966 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Plazzotta, G., Kwan, C., Boyd, M. et al. Effects of memory on the shapes of simple outbreak trees. Sci Rep 6, 21159 (2016). https://doi.org/10.1038/srep21159

Download citation

Received: 07 August 2015
Accepted: 07 January 2016
Published: 18 February 2016
DOI: https://doi.org/10.1038/srep21159

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.