A Bayesian generative neural network framework for epidemic inference problems

The reconstruction of missing information in epidemic spreading on contact networks can be essential in the prevention and containment strategies. The identification and warning of infectious but asymptomatic individuals (i.e., contact tracing), the well-known patient-zero problem, or the inference of the infectivity values in structured populations are examples of significant epidemic inference problems. As the number of possible epidemic cascades grows exponentially with the number of individuals involved and only an almost negligible subset of them is compatible with the observations (e.g., medical tests), epidemic inference in contact networks poses incredible computational challenges. We present a new generative neural networks framework that learns to generate the most probable infection cascades compatible with observations. The proposed method achieves better (in some cases, significantly better) or comparable results with existing methods in all problems considered both in synthetic and real contact networks. Given its generality, clear Bayesian and variational nature, the presented framework paves the way to solve fundamental inference epidemic problems with high precision in small and medium-sized real case scenarios such as the spread of infections in workplaces and hospitals.


Learning procedures and regularization
The learning procedure involves the minimization of the Kullback-Leibler divergence, eq.9 in the main text, between the posterior probability and the variational autoregressive neural networks. More precisely, the expression of the posterior can be factorized as a product over nodes (individuals) and times, p (x|O) gradient of the KL divergence with respect to the parameters of the trial distribution reads where we used that x q θ (x) ∇ θ log q θ (x) = x ∇ θ q θ (x) = 0 and log(Z) x ∇ θ q θ (x) = 0 due to the normalization x q θ (x) = 1. The presence of large negatives values in the derivatives (e.g. due to log terms) reduces the ability of gradient descent algorithms to explore all configurations compatible with the constraints. To overcome this issue, an annealing procedure is adopted, in which a fictitious inverse temperature β is introduced in the computation the gradient of the KL divergence, The minimization procedure starts with β = 0, where all the configurations are allowed with uniform probability, then the parameter β is slowly increased until it reaches β = 1, at which the original expression of the loss function is recovered.

Neural network architecture
In the proposed approach, each conditional probability function p i (x i |x <i ) has to be approximated with a neural network q θ i i (x i |x <i ). For monotonic models such as the SIR epidemic model, the state-space representation of temporal trajectories as a sequence x i ∈ X T +1 of T + 1 individual states is redundant, and it turns out to be more efficient to represent them by the time instant t I i ∈ (0, T +1) at which individual i becomes infected and time instant t R i ∈ (0, T +1) at which it recovers, the time T + 1 corresponding to the case of no infection or recovery occurred for i in the interval [0, T ]. More precisely, the case t I i = T + 1, t R i = T + 1 corresponds to individual i being susceptible at every time of the epidemic process, and t I i = T + 1, t R i = T + 1 corresponds to individual i being infected but not recovering in the time interval [0, T ].
This parametrization reduces the state space that is explored by our method, decreasing the number of incorrect time trajectories generated (for instances, time trajectory with transitions from state I to S).
In order to compute the probability of a trajectory of a single individual, we employ two neural networks, one for the time of infection, and one for the recovery time. Each of these networks is given as input a subset of the time trajectories of individuals with lower sorting index π i (see next section), and the one of the recovery time is also given as input the infection time of the same individual. The input of the networks are the infection and recover times of previous individuals, encoded in one hot scheme. Therefore, the (conditional) probability distributions represented by the networks are, for each individual i, q for the infection times, and q for the recovery times, where θ i,I and θ i,R are the weights of each network.
In our implementation, each neural network is a multi-layer perceptron (MLP) composed of three hidden layers plus one output layer; each layer is fully connected, and can be written as where L k is the input vector (output of layer k), W k ∈ θ is the matrix of the weights, b k ∈ θ is the bias vector, and σ is the activation function, which is Relu for the This last condition affects the size of the output of each network, but also the input size of the networks for the individuals with higher index, whose network depends on it: for example, if individual i is observed in state S at the final time (T ) with certainty, this implies that the infection and recovery times are fixed, t I i = T + 1 and t R i = T + 1, and the network will always give the same value for them. Thus, in this case, for all individuals j with π j > π i , the dependency on i will effectively disappear.

Graph approximations
We now discuss the approximation to the dependency of the conditional probabilities for the individuals' trajectories. The probability of a whole epidemic cascade can be written in the following autoregressive expression: The dependencies in the factors q θ i i (x i |x <i ) can be reduced, in the case of acyclic contact networks, to only those corresponding to the next-nearest neighbors with lower index (we relegate the derivation to section 3), but in the general case of contact networks with cycles it is just an approximation, which we employ for all contact networks analyzed in the present work.
In order to check this approximation, we have run several tests on the patient zero problem

Training procedure
To train the whole network, we do 10 000 steps of the annealing procedure described in the main text (except for the hospital case where we do 20 000 steps), increasing β linearly from 0 to 1. At each step, we generate 10 000 epidemic cascades before updating the weights using the ADAM optimizer [1] with learning rate lr = 0.001.
In the case of the risk inference problem, we also include a prior probability in the model, with strength decreasing linearly with β. This prior is computed from the probabilities of each individual being sampled as S, I or R at the last time instant t = T , and it is designed in such a way that, when β = 0, the probabilities for the three states are equal.
The running time of the ANN algorithm on a single instance ranges from one hour to two days on a single GPU (Nvidia TITAN RTX), depending on the structure of the observations O and that of the underlying contact network. Our implementation using the PyTorch framework [2] is publicly available in the repository [3]. The results presented in the main text and in SI can be reproduced using the following repository [4].
As both distributions are normalized wrt x I , they must be equal. This implies that where the last line follows from the derivation above.
Corollary 1 (Restricted autoregression). By calling I = {i}, J = {j < i : j ∈ ∂i} and K = {j < i : j / ∈ ∂i}, we obtain that for an ordering of nodes such that J separates i from K we get p(x i |{x j : j ∈ ∂i, j < i}) = p(x i |{x j : j < i}).
The last corollary is defined for a single node i. In considering the problem of approximating the posterior probability of an epidemic spreading process we distinguish two graphs: the first one, the contact graph, the nodes are the time trajectory of the states x i of each individuals and the edges are the contacts between them. The second, the factor graph, has as nodes, again, the time trajectory of individuals and as factors those ({Ψ i }) in the equation 6 in the main text. If the contact graph is acyclic and the nodes are topological ordered [5] then, thanks to the corollary 1, the following identity holds: where x ∂ 2 <i define the sets of nodes up to the next nearest neighbors (in the contacts graph) with index lower than i. The set of nodes x ∂ 2 <i in the contact graph correspond to the x ∂<i in the factor graph. To better visualize the two graphs, in Fig.3 we show two cases of contact acyclic graphs. The first one represents a linear chain defined by the contacts among individuals. The corresponding contact graph is acyclic but the factor graph contains cycles. In this case, for instance, considering the conditional probability of node 7 we have that P (x 7 |x <i ) = P (x 7 |x ∂ 2 <i ) = P (x 7 |x 4 , x 6 ). The node 4, 6 separated the nodes 7 from the previous nodes in the factor graph, as the corollary 1 require. Similarly, the same concept apply to the second case, where we consider a tree graph. In both cases the nodes are ordered topologically so the equation (7) holds for each conditional probability.   fig.4, right plot). In these cases, the inference problems analyzed become trivial or impossible to solve. On the other hand, our method has a computational cost that spans several hours on modern GPU to reach convergence for each instance analyzed, limiting the possibility of exploring the large dimensional space of the parameters of our inference problems. Nevertheless, we check how the performances of the methods analyzed vary, in the patient zero and the inference of parameters problems, with respect to the infectious parameters (λ or γ) to check the robustness of the results shown in the main text.
For the patient zero inference problem, we consider the work contact graph, and look at the performance of the different algorithms used, changing the infectivity γ. The figure 4 shows the area under the curves representing the fraction of times the patient zero is found, as in Figure 4 of the main text (right plots). The ANN method shows consistent performance with changing γ, giving the best score or on par with others algorithms.
Then we check the robustness of the inference of parameters in the case of of rrg interaction graph; the figure 5 shows that the inference of the infectivity parameter λ remains good as λ varies.

Soft margin estimator
We also use the Soft Margin estimator from [6] for finding the patient zero. This is a Monte Carlo based estimator, which applies Bayes formula to estimate the source probability P (s = i): given an epidemic x(i) which is the result of the simulation, starting from individual i, and the set of observation x O on the epidemic, the probability of node k being the source of the epidemic is  Figure 4: Accuracy in finding the patient zero while varying infectivity on the work contact graph. Here we look at the performance by the area under the curve (AUC) of the fraction of patient zeros found as a function of the fraction of considered individuals. We consider 20 different epidemic realizations for the γ values γ = 0.6 · 10 −3 , γ = 0.8 · 10 −3 , γ = 1.2 · 10 −3 , γ = 1.4 · 10 −3 , and 100 realizations for γ = 10 −3 , while the recovery rate is fixed µ = 0.02. On the right pane, the boxplot shows the number of infected individuals with the different values of γ, in a total population of 95 individuals.
relating how many individuals are in state X (either infected, X = I or recovered, X = R) in the generated cascade and the observations. Clearly, if no individuals are observed in state X, then φ X = 1 regardless of the realization. The probability of observing a certain configuration, given the source of the epidemic, can then be written as: where the coefficient a regulates the sharpness of the peak around the case of perfect matching of observations (φ I = 1 and φ R = 1), where the corresponding value in the sum is 1. For the patient zero problems, we select the alpha values that give better results (retrospectively) because they seem to be strongly dependent on the contact graphs and the number of samples considered.  Figure 5: Inference of infectiousness parameter λ in the rrg case (100 individuals interacting according a random regular graph with degree 10 for 15 days). The problem setting is equals to that explained in the subsection "Epidemic Parameters inference" in the main text. The plot shows the infectious parameter λ inferred by ANN with the perfect inference line λ est = λ highlighted. ANN correctly estimated them in error margin.