Abstract
Investigating into the past history of an epidemic outbreak is a paramount problem in epidemiology. Based on observations about the state of individuals, on the knowledge of the network of contacts and on a mathematical model for the epidemic process, the problem consists in describing some features of the posterior distribution of unobserved past events, such as the source, potential transmissions, and undetected positive cases. Several methods have been proposed for the study of these inference problems on discretetime, synchronous epidemic models on networks, including naive Bayes, centrality measures, accelerated MonteCarlo approaches and Belief Propagation. However, most traced real networks consist of shorttime contacts on continuous time. A possibility that has been adopted is to discretize time line into identical intervals, a method that becomes more and more precise as the length of the intervals vanishes. Unfortunately, the computational time of the inference methods increase with the number of intervals, turning a sufficiently precise inference procedure often impractical. We show here an extension of the Belief Propagation method that is able to deal with a model of continuoustime events, without resorting to time discretization. We also investigate the effect of time discretization on the quality of the inference.
Introduction
Identifying past features of an epidemic outbreak remains a challenging problem even for simple stochastic epidemic models, such as the susceptibleinfected (SI) model and the susceptibleinfectedrecovered (SIR) model. In recent years, this problem has received considerable attention, especially on discrete time models^{1,2,3,4,5}. For these models, we recently proposed an approximate Bayesian method based on Belief Propagation (BP)^{6,7}, that gave the first exact tractable solution to a family of discrete time inference problems on acyclic graphs and an excellent approximation on general graphs, including real ones. The problem addressed ranged from the inference of the epidemic source (the patient zero or index case), inference of the infection times and the epidemic parameters, all from the knowledge of the network plus a (partial, noisy) snapshot of the infection state of the system at a given instant.
In the last years, several precise spatiotemporal information about contacts between individuals in a community have been collected, representing close proximity^{8,9}, social or sexual interactions^{10,11} and more. Each dataset consists of a timestamped list of pairs of individuals. Seeking to explore characteristics of potential outbreaks, many authors studied the disease propagation over those communities employing compartment infection models such as SI and SIR. Technically, a simple way to achieve this is by computing a weighted discrete time network. This can be done by subdividing the time line into subintervals of length Δ (timesteps), aggregating all contacts falling in a given interval [tΔ,(t + 1)Δ] into a timestep dependent weight equal to , where λ is the probability of transmission in a single quasiinstantaneous contact and k_{t} the number of contacts the interval^{8,6}. Once this discrete time network has been constructed, the spread of infectious diseases on the community can be described through a discrete time SIR model, in which the transition probabilities between states defining each of these models depend on the timestep t. However, these coarsening methods naturally lead a loss of timing information and precision, becoming exact only in the limit of small Δ and a large number of intervals. Unfortunately, the computational time of both simulations and various inference algorithms typically increase with the number of time steps, making a sufficiently precise analysis unpractical, if not impossible in most cases. In the following, we will describe a very simple semicontinuous time stochastic model of infection dynamics that does not require coarsening or binning and is naturally equivalent to the Δ → 0 limit. For simplicity, we will concentrate on the SIR model, but all methods here can be naturally generalized to other variants such as SEIR^{8}. We will then develop a semicontinuous time inference framework which is able to deal with contact network datasets without any discretization approximation, may it be implicit or explicit. The method will be shown to perform very well on two datasets of real contact networks, being able to reconstruct the epidemic source, the infection times and the infection causality tree with a great degree of accuracy in a wide range of parameters. In the concluding section, its performance will be compared to a discretized version of the model, showing that the inference performance under discretization approximations, although converging to the nondiscretized one, does so in a range of discretization precision that renders it extremely unpractical.
Methods
A static model to describe dynamics
Let us consider an evolving contact network G composed of N nodes. Each node i is equipped with a time dependent state variable x_{i}(t) ∈ {S, I, R} so that at time t it can be in one of three possible states: susceptible (S), infected (I), and recovered/removed (R). We will define our dynamical model as follows: let time be continuous, and contagion events be instantaneous with probability λ. Each pair of individuals (i, j) will be in contact in a discrete set of instants T_{ij}(0) < T_{ij}(1) < … < T_{ij}(n_{ij}) (given by the real network traced dataset), where we assume for simplicity T_{ij}(r) = ∞ for r > n_{ij}. As time advances, contagion to node j will happen at a time t_{j} with probability λ if i is infected, j is susceptible and there exists a contact between the two nodes with some index r_{i} such that T_{ij}(r_{i}) = t_{j}. Let us define t_{i} = min{t: x_{i}(t) = I} the time at which node i gets infected (infection time) and g_{i} = min{g: x_{i}(t_{i} + g) = R} the time passed before his recovery. Note that, as infections can only occur during contacts, each t_{i} will be necessarily equal to the time of a contact. Let us define the index of the first possible such contacts. Recovery of individuals will happen at a time t_{i} + g_{i}, where g_{i} follows a given recovery probability distribution with density G(g_{i}). This parametrization is reversible, i.e. given t_{i} and g_{i}, it is easy to compute the state of x_{i}(t) at any time t:
Since a node i has a finite probability to transmit the disease to a neighbor j in each of its contacts, one can compute the probability that the contagion will occur during the contact at time , assuming that node j is still susceptible, simply as . Note that after time t_{i} + g_{i}, contagion will not take place because node i will be recovered. Given {g_{k}} and {r_{ki}}, infection time t_{i} of a nonsource node i must satisfy deterministically the condition t_{i} = F_{i}({t_{k}}, {g_{k}}, {r_{ki}}) where
where ∂i denotes the set of neighbors of node i (i.e. nodes that share at least a contact with i) and it is conventionally assumed that the min is equal to +∞ if the set is empty. Eq. (2) must hold because each neighbor k will be in the infected state in the time interval , and will transmit at time ; the transmission that arrives first will be the one that succeeds. Suppose now an epidemic spreading is initiated by a spreader node i_{0}, which was infected at some time −ε < 0 before the first contact, which we conventionally fix at time t = 0. Our aim is to infer the initial spreader from just a single (possibly incomplete) observation of the state x(T) of the network at a later time T. The posterior distribution over the initial state of the networks can be easily written by means of Bayes theorem. In order to identify the initial spreader we should, in principle, maximize over the following posterior marginal probability:
we will give a very small prior probability γ to each initial seed, ensuring that configurations with more than one seed are overwhelmingly improbable (note that having at least one seed is necessary to explain any evidence of infection). The prior distribution S(t_{i}) for the node i then reads:
where denotes the Kroenecker delta.
It is easy to see that (3) does not depend on ε. Indeed, for any ε′ > ε, implying that the two posteriors only differ by a constant factor, that has no relevance in (3).
The posterior distribution of infection times can now be written as where
The patient zero problem can be recast as the one of computing single site marginals from the posterior distribution in equation (3). The problem of computing marginals over large dimensional probability distributions is in general intractable (NPhard). In analogy with a previously introduced approximation method^{6,7}, we will tackle this problem by means of Belief Propagation, a method which is exact on acyclic graphs, and that was shown to perform very well on random and real contact networks in the discrete time scenario.
For each node in the network, the BP algorithm provides an estimate of the posterior probability that the node got infected at a certain time, and thus also the probability that the node was the origin of the epidemics.
Graphical model formulation
In order to apply BP, we will first formulate an alternative expression with no continuous variables, as the numerical representation of their distributions is problematic. Note that variables t_{i} are already discrete, as they live in the finite subset of the real line formed by all incoming contact times . Let us consider the ordered sequence of values T_{i}(0) < … < T_{i}(n_{i}) in this subset, and define T_{i}(n_{i} + 1) = ∞. To cope with continuous variables g_{i}, we will define from g_{i} a discrete variable , by exploiting the fact that (here is the indicator function of the condition in its argument), and that X and F are constant for g_{i} inside an interval . We can recast (3) as
where . In order to simplify the structure of the probability distribution factorization, we will introduce the variables where
The introduction of s_{ji} variables allows to simplify the structure of equation (6):
where
Belief Propagation equations
To briefly describe the essence of the BP method, let us consider a probability distribution over the variables that has the following factorized form:
where is the subset of variables that F_{a} depends on. BP equations are a set of selfconsistent relations linking the socalled cavity messages (or beliefs), a set of singlesite probability distributions which are associated to each directed link in the graphical model defined by the joint distribution in equation (10). The general form of BP equations is the following:
where F_{a} is a factor, z_{i} is a variable, ∂a is the subset of indices of variables in factor F_{a} and ∂i is the subset of factors that depend on z_{i}. The terms Z_{ia}, Z_{ai} and Z_{i} are local partition function, serving as normalizations. To solve equations (11) and (12) an iterative procedure is typically used, where the cavity messages are initialized with homogeneous distributions and they are asynchronously updated until convergence to a fixed point^{12,13}. While the computation of equation (12) is straightforward, the summation in (11) often involves a number of steps growing exponentially with the size of ∂a. In a number of interesting contexts, though, it is possible to devise efficient methods for computing this sum, and so reducing the computational complexity of the BP updates.
The inference problem of equation (8) is interpreted as a partial marginalization of a factorized distribution as in equation (10). In this settings, there are only two types of BP message, namely and , and the corresponding updates are derived straightforwardly from equation (11). The nodetofactor BP messages, namely namely , can be computed very easily by virtue of equation (12).
Calling , the BP equations for ψ_{i} are
where (15) follows because
Similarly,
Note that values can be precomputed in a matrix as each t_{i} must be equal for some in {0, …, n_{i} + 1}. The calculation of 16, 20 can be performed more efficiently by observing that all terms
can be computed in time . Then equations (14)–(20) can be computed as
on a loop over the possible values of .
The pseudocode in Algorithm 1 illustrates the implementation of our BP algorithm in detail. Please note that at each iteration we store factortovariable messages and marginals for each variable, the variabletofactor messages being easily extracted at each iteration in view of the simple relation:
The time complexity of the computation of all messages exiting node i is . Assuming a bounded number of contacts per individual n_{i}, this scaling results in an update with a number of operations per BP iteration which is linear in the number of edges (i.e. the number of pairs of individuals that are in contact) in the full system. On a BP fixed point, equation (13) is used to compute the marginal probability m_{i}(t_{i} = −ε), which brings an estimation of the posterior probability for the node to be the active before the first contact. Note that, in the present case, the marginal m_{i}(t_{i}) is equal to the message . In the event that BP doesn’t converge, reliable information can be extracted equivalently from the average value over iterations of the marginals m_{i}(t_{i}), that we call .
Results
Patient zero detection in two large real contact time networks
We tested our methods on two large evolving networks: a database of timestamped sexual interactions and a network of facetoface contacts in a high school.
For each dataset, we simulated a large number of epidemic propagations, each one initiated from a unique random source (the patient zero, or seed). The nodes in the network are then ranked in decreasing order of their estimated posterior probability of being the origin of the observed epidemics: the position of the true origin in the ranking provided by the algorithm is a good measure of the efficacy of the method. In what follows, i_{0} stands for the index of the true origin of the epidemics, its rank being indicated by r_{0} = rank(i_{0}). For simplicity, we will always consider homogeneous epidemic parameters λ_{ij} = λ and G(g_{i}) an exponential distribution with rate constant μ.
The first dataset comes from a database of sexual encounters between clients and escorts on a Brazilian website, covering the beginning of the community, September 2002 through October 2008, and composed of a total of E = 50185 contacts between between 6642 escorts and 10106 sexbuyers. This kind of data are particularly relevant in the study of spreading of Sexually Transmitted Infections (STI), and have been previously used to model the diffusion of HIV by means of simple SISIR compartmental models^{11}. We build a bipartite evolving network, focusing on the last two available years of operation of the website (E = 29628 contacts, slightly over half of the dataset) in order to skip the initial period where reporting of encounters are very sparse and incomplete. For each value of λ, we simulate M = 1000 single source epidemic propagations, with a recovery rate equal to μ = 0.5/year. Concerning algorithmic efficiency, the efficient implementation of the BP updates presented in the previous section allows us to perform inference in largescale contact network with a remarkably small computational cost. As an example, a parallel C++ implementation with 5 concurrent processes took roughly 11 minutes to converge to a solution with rank(i_{0}) = 1 (perfect identification of the source) on a single instance with an epidemic size of N_{IR} = 1810 individuals, obtained with λ = 0.4 and μ = 0.5/year.
In Fig. 1 we show the average absolute rank r_{0} of the true first infected individual i_{0} as a function of the epidemic size N_{IR} = I + R (i.e. the number of infected and recovered sites), whose values are discretized in intervals of width equal to 0.1. The low values of the rank show the effectiveness of the method.
The second dataset consists of a collection of Close Proximity Interactions (CPIs) obtained by means of wireless sensor network technology (TelosB motes)^{8}. Data were collected in a US high school and provide an almost complete account of facetoface interactions during a whole day at school. All in all 798 individuals were monitored, corresponding to the 94% of the total school community, and 2148991 unique Close Proximity Records (CPR) were acquired. A single CPR corresponds to a close proximity detection event between two motes (max. 3 meters). The authors of the study perform an aggregation of the raw data in interactions, defined as continuous sequences of CPRs between the same two nodes. Our choice was to go back to the raw data and investigate the spreading process at the level of single CPRs, using the intensity signal as a proxy for the closeness of a facetoface contact (a detailed account is present in Salathe et al  Supplementary Information^{8}). We constructed a set of evolving networks by setting a threshold θ_{int} for the signal intensity of the motes, thus resulting in denser networks for smaller θ_{int}, where more weaker (and distant) contacts are taken into account. A second interesting possibility could be to allow for a probability of contagion that depends on the proximity (i.e. the strength of the signal). We did not persue this route as the dependence of probability cf contagion on the distance is hard to determine (to our knowledge, no study provides this information for known diseases), and moreover we don’t have information on the correspondence between signal strength and distance. Besides, a thresholdlike dependence on the distance could be adequate for contagion of many infectious diseases, such as nonairborne ones. In any case, the analysis technique presented here can be adapted directly to the case of contactdependent probability would the needed modelling information mentioned above be available in the future.
Three representative examples are show in Fig. 2, which displays the average rank of the true first infected individual i_{0} for different values of λ and threshold θ_{int}. Two values of λ = 0.3, 0.4 are explored for θ_{int} = 250. Then we attempted with the much denser graph resulting of considering θ_{int} = 245. Here the infection probability λ has been chosen to maintain the same average number of infections as the case θ_{int} = 250, λ = 0.3.
Reconstruction of causality
Consider the problem of inferring for each nonsusceptible individual i at time T, the individual k from which he contracted the infection. The probability p_{ktoi} of such transmission event corresponds to
Once p_{k→i} has been computed for every (ordered) pair (ki), a prediction will be formed by the subset of pairs with probability larger than a given threshold. A receiver operating charasteristic (ROC) curve can be computed by considering the performance for all possible thresholds. The ROC curve for an instance of an outbreak on the sexual contacts dataset is shown in Fig. 3, along with the inferred pairs in one single point of the curve. It is evident that a large fraction of the entire history of the propagation can be reconstructed with a high degree of reliability, despite the apparently limited amount of information available in a single observation of the nodes’ states.
For a comparison between the true infection times of each node and the ones that can be inferred by our method, we show in the left panel of Fig. 4 a scatter plot of versus for a single epidemic cascade in the network of sexual contacts, with epidemic parameters λ = 0.4 and μ = 0.5/year (in the resulting epidemics the number of infected individuals is I = 991, total epidemic size being N_{IR} = 1070). The infection times have been simply obtained by averaging over the marginal posterior distribution of singlenode infection times. In addition, we show in the right panel of Fig. 4 the Average Time Error (ATE) between t^{true} and , which we define as , for a set of 200 different samples of simulated epidemic outbreaks in the same network and with the same epidemic parameters as in the previous example. The reconstruction of the dynamical history is remarkably good over a wide range of epidemic sizes.
Partial and noisy observations
It is not difficult to extend the present model to account for observations affected by some kind of uncertainty. One simply introduces the Observational Transition Matrix (OTM) o_{i }(y_{i}x_{i}), containing the transition probabilities from the true state x_{i} to the observed state: in order to perform inference, one has to some over all the possible true unobserved states with a weight given by the corresponding entry in the OTM, a task which is easily accomplished in the BP inference scheme. The identity matrix o_{i}(y_{i}x_{i}) = δ[y_{i}; x_{i}] corresponds to the a case in which no noise enters the observations. Please note that, in this generalized scheme, we can simply take into account partial observability with a totally flat OTM for unobserved nodes.
Firstly, we consider the case of partial observations, i.e. the case in which only a subset of nodes are accessible for observation at time T: this is the standard realistic scenario in practical applications, when a complete monitoring of a full network is infeasible in the general case. We simulate a number of epidemic spreading in the contact network and model the partial observability by a fixed probability p_{ob} of observing a node at time T. In what follows, we will use the sexual encounters dataset, mostly because of its epidemiological relevance.
Results for decreasing values of p_{ob} in the network of sexual contacts are shown in the left panel of Fig. 5 (the complete observation case p_{ob} = 1 is shown in the dashed line for reference). The BP method happens to be highly resilient even with a high amount of hidden information at time of observation T.
Turning to the problem of uncertain observations, we use a very simple symmetric model for observational noise. Let us consider the following OTM:
This matrix describes a kind of symmetric noise, where a node that is in the state x has a probability 1 − ν of being correctly observed in its state, and a probability ν of being observed incorrectly in one of the other two states. Suppose, for example, node i is S (susceptible) at observation time T: for a given noise probability ν, there is an equal probability for the node i to be observed in the R (recovered) or I (infected) state  the same holds for the others two cases, equivalently. In Fig. 5, right panel, we show that our BP algorithm is highly robust even to a significant amount of noise, up to 30%.
Discretization and binning
In order to ascertain the eventual loss of inference precision due to timediscretization, we performed the following experiment on the sexual intercourse dataset^{10}. We generated M = 100 random epidemics with the semicontinuous time model with instantaneous probability of transmission λ.
Separately, for each value of T = 10, 15, 20, 25 we produced a discrete time temporal network from by subdividing the time interval [t_{0}, t_{1}] into T equal subintervals (timesteps) of length Δ = (t_{1} − t_{0})/T, aggregating all contacts falling in a given interval [tΔ; (t + 1)Δ] into a timestep dependent weight , where k_{t} is the number of contacts in the interval^{8,6}.
We then performed the inference analysis on the M previously generated epidemics both in the discretized network with the the discretetime BP algorithm^{6} and using the semicontinuous time inference BP algorithm presented here. We investigated the difference in performance for different combinations of the epidemic parameters, noting that the semicontinuous time method strikingly outperforms the discretization procedure in all cases. As an example, we show in Fig. 6 a comparison of the continuous time method versus the discretized version for λ = 0.4 and μ = 0.5/year with an increasing number of time bins.
Discussion
In this work, we developed an inference framework to analyze the dynamics of infection in temporal contact networks with continuous or very finegrained time resolution. We showed by means of simulations on real contact networks how the approach is able to reconstruct with great degree of accuracy both the source of the epidemics, the infection times and the underlying epidemic causal history from the mere observation of the state of the system (noisy or incomplete) at a single instant in a wide range of parameters.
Moreover, we were able to quantify the loss of information due to timediscretization, demonstrating a remarkable improvement with continuous time inference when compared with time discretized data even for a relatively large number of time subintervals.
It would be interesting to apply this technique on other relatively closed communities where the interactions can be monitored but the infections themselves are hidden, such as in hospital wards^{14} and for applications to computer virus forensics^{4}. The ability to reconstruct the epidemic history and causality of transmissions could prove to be helpful to devise better containment strategies. In those cases, generalizations of the method to epidemic models related to SIR such as SEIR and other distributions of recovery time different from exponential could be necessary.
Additional Information
How to cite this article: Braunstein, A. and Ingrosso, A. Inference of causality in epidemics on temporal contact networks. Sci. Rep. 6, 27538; doi: 10.1038/srep27538 (2016).
References
AntulovFantulin, N., Lancic, A., Stefancic, H., Sikic, M. & Smuc, T. Statistical Inference Framework for Source Detection of Contagion Processes on Arbitrary Network Structures. In 2014 IEEE Eighth International Conference on SelfAdaptive and SelfOrganizing Systems Workshops (SASOW), 78–83 (2014).
Lokhov, A. Y., Mézard, M., Ohta, H. & Zdeborová, L. Inferring the origin of an epidemic with a dynamic messagepassing algorithm. Phys. Rev. E 90, 012801 (2014).
Pinto, P. C., Thiran, P. & Vetterli, M. Locating the source of diffusion in largescale networks. Physical Review Letters 109, 068702 (2012).
Shah, D. & Zaman, T. Detecting sources of computer viruses in networks: theory and experiment. ACM SIGMETRICS Performance Evaluation Review 38, 203–214 (2010).
Shah, D. & Zaman, T. Rumors in a network: Who’s the culprit? Information Theory, IEEE Transactions on 57, 5163–5181 (2011).
Altarelli, F., Braunstein, A., Dall’Asta, L., LageCastellanos, A. & Zecchina, R. Bayesian inference of epidemics on networks via belief propagation. Physical Review Letters 112, 118701 (2014).
Altarelli, F., Braunstein, A., Dall’Asta, L., Ingrosso, A. & Zecchina, R. The patientzero problem with noisy observations. Journal of Statistical Mechanics: Theory and Experiment 2014, P10016 (2014).
Salathé, M. et al. A highresolution human contact network for infectious disease transmission. Proceedings of the National Academy of Sciences 107, 22020–22025 (2010).
Isella, L. et al. What’s in a crowd? analysis of facetoface behavioral networks. Journal of Theoretical Biology 271, 166–180 (2011).
Rocha, L. E. C., Liljeros, F. & Holme, P. Information dynamics shape the sexual networks of internetmediated prostitution. Proceedings of the National Academy of Sciences 107, 5706–5711 (2010).
Rocha, L. E. C., Liljeros, F. & Holme, P. Simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts. PLoS Comput Biol 7, e1001109 (2011).
Yedidia, J. S., Freeman, W. T. & Weiss, Y. Bethe free energy, kikuchi approximations, and belief propagation algorithms. Advances in neural information processing systems 13 (2001).
Mézard, M. & Montanari, A. Information, Physics, and Computation (Oxford University Press, 2009).
Vanhems, P. et al. Estimating Potential Infection Transmission Routes in Hospital Wards Using Wearable Proximity Sensors. PLoS ONE 8, e73970 (2013).
Acknowledgements
AB acknowledges support by Fondazione CRT under the initiative “La Ricerca dei Talenti”.
Author information
Authors and Affiliations
Contributions
A.B. and A.I. conceived the experiments, conducted the experiments and analyzed the results. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
Braunstein, A., Ingrosso, A. Inference of causality in epidemics on temporal contact networks. Sci Rep 6, 27538 (2016). https://doi.org/10.1038/srep27538
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/srep27538
This article is cited by

Fast and accurate detection of spread source in large complex networks
Scientific Reports (2018)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.