Network inference from population-level observation of epidemics

Using the continuous-time susceptible-infected-susceptible (SIS) model on networks, we investigate the problem of inferring the class of the underlying network when epidemic data is only available at population-level (i.e., the number of infected individuals at a finite set of discrete times of a single realisation of the epidemic), the only information likely to be available in real world settings. To tackle this, epidemics on networks are approximated by a Birth-and-Death process which keeps track of the number of infected nodes at population level. The rates of this surrogate model encode both the structure of the underlying network and disease dynamics. We use extensive simulations over Regular, Erdős–Rényi and Barabási–Albert networks to build network class-specific priors for these rates. We then use Bayesian model selection to recover the most likely underlying network class, based only on a single realisation of the epidemic. We show that the proposed methodology yields good results on both synthetic and real-world networks.


Introduction
Networks are a fundamental tool for modelling complex systems such as epidemics or neuronal activity in the brain.Indeed, the intricate interplay of many individual well-defined units can be captured by the links of a network, and this can be done with an unprecedented level of detail [31,18,1,36,20].For instance, directed, weighted or temporal links can all be considered within this modelling paradigm.The power of this approach is particularly clear in the study of spreading processes, with epidemics on networks having been extensively studied as a diffusion process between nodes mediated by links of the network.It is well established that the structure of the network has a profound impact on how diseases invade, spread and how to best control them.This is particularly well understood for degree heterogeneity and assortativity/disassortativity, and to a lesser extent, for clustering, the propensity of nodes that share a common neighbour to be connected [34,20].
However, depending on the field of application, the precision to which the underlying network is known can vary greatly, from absolute (when full description is available) to absent (when a description is entirely lacking).For example, whereas technological networks can be mapped out to a great degree of detail, social networks can be challenging to query [3].This has resulted in a significant amount of research aimed to develop methods for link prediction (for a survey, see [3]).Instead of assuming the availability of explicit information about nodes and edges, these methods rely on 'observables' from dynamical processes taking place on the network, offering a latent connection to the properties of the underlying network.In the framework of epidemics on networks this suggests that it is possible to get insights about the structure of the network by observing quantities of interest at node and perhaps population level.Indeed, the inverse problem of inferring networks from epidemic data has been the subject of great scrutiny.
In [13], the authors propose a link inference scheme based on a large number (i.e.proportional to the number of nodes) of independent spreading cascades using the so-called Independent Cascade Model, a simplified version of the SI (susceptible-infected) dynamics.The inference is based on extremely detailed data about the precise time of a node becoming infected and its infectious period, and it is mapped onto a likelihood optimisation problem.All entries of the adjacency matrix of the network are subject to inference but the per-node infection parameter is assumed to be known.However, the computational time needed to perform such inference does not scale favourably with network size and this has motivated the development of more efficient algorithms [29].In more realistic models for epidemics dynamics on networks, such as SIS (susceptible-infected-susceptible) and SIR (susceptible-infected-recovered), it has been proven that the likelihood optimisation problem is convex [27].In later works, inference of epidemic parameters is considered in addition to that of the adjacency matrix inference [9,12].
A different approach is to use Bayesian inference to identify the most likely parameters of a known network model based on epidemic data such as the final sizes of many realisations of the same epidemic or the times of infection and/or recovery of all the infected nodes [32,2].In this case, the focus is not on inferring each and every possible link but rather on inferring some parameters of the network model.For example, in [32,2,14], it is assumed that the underlying network is an Erdős-Rényi random network and therefore only the per-link probability needs to be inferred, in addition to the rate of infection and recovery.Within this framework, the inference is extended to Barabási-Albert networks [10].Furthermore, the authors of [42] use the basic reproductive ratio, the peak size of the epidemic, the total number of infected individuals and the epidemic duration to construct a prior and this is then used to infer the degree heterogeneity of a new network.Compared to maximum likelihood optimisation methods (e.g.independent cascade model), the Bayesian inference is usually based on a smaller number of observations of the epidemic [2,14,10].However, both network inference approaches (explicit link inference and inferring parameters of a known network model) lead to good estimates for the network and parameters of the epidemic dynamics.Moreover, there is an interesting tradeoff between them.The former is able to identify the adjacency matrix, but requires the observation of a large number of cascades, whereas the latter can only infer some structural parameters (such as the probability of a link between two nodes), but relies on fewer observations.One major potential limitation of both approaches is their reliance on the availability of detailed data at node level, such as the complete temporal knowledge of all cascade trees in [13] or the observation of all the removal/infection times in the Bayesian framework of [2].As far as we are aware (for a survey, see [7]), there is no research that specifically addresses the problem of network inference based on observations at population-level (such as the total number of infected nodes at true or fixed/regular times without the explicit knowledge of which nodes are the infected ones), the quantity which is the most likely one to be available in a real world scenario.Whilst inferring detailed network properties is extremely unlikely, it may be possible to at least infer the parameters of a known network model.However, another challenge remains.When it comes to real epidemics, even the real-time or continuous collection of population level temporal data might be hard to achieve (if not impossible), and the inference may have to rely on partial, discrete-time, imprecise observations of such quantities.Thus, it would be desirable to have an inference model that is able to provide some information about the structure of the underlying contact network based on sparse, discrete data only.
The need for detailed data at the microscopic level for inference purposes is due to the highdimensionality of the exact epidemic model on networks, which typically scales exponentially with the number of nodes.Here, we tackle this major challenge by approximating the exact epidemic model by a birth and death (BD) process.This allows us to break down the inference problem into a few intermediate steps as follows: (a) The likelihood for the rates of the approximate BD process, when sufficient statistics is available, has an analytical expression and gives the maximum likelihood estimator for all infection rates as a function of the number of infected nodes.We draw these curves for different network families.(b) In a crucial step of our approach, observing the properties and common features of these curves, we propose a three-parameter model.Through systematic simulations of epidemics on some well-known network families, we are able to show that different network families map onto non-overlapping regions in the 3-dimensional parameter space of our model.(c) This separation between regions is key, as it allows us to calibrate a Bayesian classifier that is able to identify the most likely network family by only using discrete observations of the epidemic at population level.Finally, the classifier is successfully validated on synthetic networks and stress-tested on real-world networks.
Our methodology is described in detail in Section 2, the numerical results are reported in Section 3 and we conclude with a discussion and further research directions in Section 4.

Model and methodology
The objective of this paper is to provide a methodology to infer, based on population-level incidence data, the network family most likely to have led to the observed epidemic.Three distinct families of random networks are considered, namely Regular, Erdős-Rényi and Barabási-Albert.To our knowledge, a direct attempt to identify which family generated population-level epidemic data is yet to be achieved.One of the main stumbling blocks is the high-dimensionality of the exact epidemic model on a network, i.e. of size 2 N for the susceptible-infected-susceptible (SIS) epidemic on a network with N nodes.To overcome this issue, an approximation will be introduced.This is based on a Birth-Death process with state space {0, . . ., N }.This alternative model will provide a simpler likelihood and will be calibrated (through its rates) to reproduce the behaviour of the infected nodes count of the exact SIS model on a network.Furthermore, a simple parametrisation of these rates will be provided.Specifically, the 3 random network families considered here will be characterised using this parametric form and a kernel density estimator.We will show that the combination of all these tools greatly simplify the inference task in the form of a Bayesian classifier.

SIS model on a network
A population of N individuals is considered with the contact structure between individuals described by an undirected network with adjacency matrix G = (g ij ) i,j=1,2,...,N where g ij = 1 if nodes i and j are connected and zero otherwise.Self-loops are excluded, so g ii = 0 and g ij = g ji for all i, j = 1, 2, . . .N .The standard SIS epidemic dynamics on a network is considered.The dynamics is driven by two processes: (a) infection and (b) recovery from infection.Infection can spread from an infected and infectious node (I) to any of its susceptible neighbours (S) and this is modelled as a Poisson point process with per-link infection rate τ .Infectious nodes recover from infection at constant rate γ, independently of the network, and become susceptible again.The resulting model is a continuous-time Markov Chain over a state space with 2 N elements.This consists of all arrangements of length N with each entry being either S or I independently.While this is easy to formalise and write down theoretically, the numerical integration of the system becomes intractable even for modest values of N [41,1,40,20].
As a result, the focus is often on capturing the epidemic via some expected quantity, such as the expected number of infected nodes.Unfortunately, the evolution equations at node level require information about the pairs and a hierarchy of dependencies emerges where more of the network structure needs to be taken into account.To curtail this trend closures are used whereby, for example, the expected number of triples is approximated by the expected number of singles and pairs, taking also into account some information about the structure of the network, for a review see [20].The most wide-spread mean-field models include the degree-heterogeneous [34], pairwise [17], NIMFA [43] and effective-degree [22] models.Most notably, the edge-based compartmental model [26] uses the probability generating function of the degree distribution of the network and leads to the most compact mean-field model for SIR epidemics on configuration networks [44].This model happens to be exact in the limit of network size going to infinity [8,16].Many of the models above are equivalent in some sense [25,20] and differ mainly by the quantity over which the averaging is done.
Here, inspired by the simplicity of the SIS epidemic model on a fully connected network, where the 2 N -dimensional model is reduced to a (N + 1)-dimensional one, in line with [28] we propose that the SIS epidemic on a network can be approximated by a Birth-and-Death process resulting in a master equation of (N + 1) equations as described below.

Birth-and-Death approximation of SIS epidemics
To motivate our methodology, let us consider first a fully connected network, where one can easily characterise the evolution of the number of infected nodes, and even their distribution.Indeed, in this case, the model takes the form of a Markov chain with state space {0, . . ., N } and transitions of size 1, that is a Birth-and-Death process.More precisely, given that the number of infected nodes is k ∈ {0, . . ., N } in the network, a new infection arises at rate a k = τ k(N − k), while the total rate of recovery is c k = γk.Hence, the probabilities of observing k infected nodes at time t are given by the following forward Kolmogorov equation: together with a −1 = c N +1 = 0 and an initial condition k 0 ∈ {0, . . ., N }.We will denote the solutions of Eq. ( 1), p θ k 0 ,k when the dependence on additional parameters needs to be enforced.In other words, the structure of this particular network is encoded in the infection rates a k and, as a result, the analysis of the epidemic requires a significantly smaller number of equations, going from 2 N to N + 1.Unfortunately, such a simple description may not be available for all networks.Indeed, assuming a single rate of infection when there are k infected nodes in a network is too simplistic, different numbers of S-I links are possible according to the nodes precise position in the network.
Nevertheless, Birth-and-Death processes can provide a powerful alternative, at least as an approximation, as long as one is able to find network-and epidemic-dynamics-specific infection rates (the a k rates).In [28], this idea was explored and a k /τ was found directly from rigorous stochastic network simulations by simply working out the expected number of S-I links when exactly k nodes on the network are infected.The authors obtained an excellent agreement between the average number of infected nodes given by the Birth-and-Death approximation with inferred rates on the one hand, and the exact average number of infected nodes based on simulated epidemics on the other.In this work, rather than relying on S-I link counts, we recast the problem of finding the best a k 's as the well-known problem of rate inference for a general Birth-Death process.In other words, the a k 's are estimated from simulations of the SIS epidemic on a network.It is known [37,45] that in order to infer the rates of a Birth-Death process, a sufficient statistic is given by the set [45] (called continuous data): where u k is the number of observed transitions from state k → k + 1, d k is the number of observed transitions from state k → k − 1 and τ k is the time duration of the state k, when observed over a time window, say t = 0 to time t = T .Given such statistic, the log-likelihood for the a k rates is obtained analytically [45]: and the MLE estimators are âk = u k τ k .Typical (k, âk ) curves are shown in Fig. 1.Similarly to results in [28], we notice that, although (k, âk ) curves are distinct for different networks, they all share some common features, i.e. they all pass through the points (k, âk ) = (0, 0) and (k, âk ) = (N, 0) and exhibit a single maximum in (0, N ).Perhaps the most important feature that changes between the three distinct network families is the flatness and skewness of the parabola.In particular, higher heterogeneity in the degree distribution (i.e.Regular→Erdős-Rényi → Barabási-Albert displaying no→medium→high heterogenity, respectively) leads to a left skew.This encourages a further simplification, that is, providing a parametric shape of the a k curves.The departure from the fundamental assumption of homogeneous random mixing in epidemiological and ecological models has led to a myriad of models where bi-linear transmission terms proportional to ∼ I × S or ∼ I × (N − I) have been replaced by non-linear infection terms such as I p S q [23,15,38].In particular it is noted that, in the context of classical compartmental and mean-field models, such terms can be inferred from the number of S-I links taken from simulation and that such terms can lead to more exotic model behaviours.In the same spirit, we put forward the following model for the rates: where the three parameters C, a and p offer flexibility to adapt to various networks and epidemics of different severity.This choice is well-grounded in the literature and is well motivated by the heuristic thinking of how the epidemic unfolds on the network.The parameter C > 0 gives a our motivation to choose a least-square approach rather than a maximum likelihood estimator (MLE) lies in the desire to reproduce the whole curve as precisely as possible.Indeed, the MLE estimator tends to over-penalise states with high values of t k (which are likely to be around the steady-state) and at least empirically, it provides a poor overall fit.However, if the number of different states where the process is observed is not sufficient, the model is unidentifiable.Indeed, different values of the parameters (C, a, p) can lead to a k 's that are locally similar.In this case, the data are obtained by simulation, which makes it possible to start the epidemics with different values of initial infected individuals (I 0 = 5 and I 0 = 1000), and thus avoids this problem.Figure 1 showcases the flexibility of the model in fitting (k, âk ) curves coming from different network families.
Through a systematic simulation of various networks with different disease parameters, one obtains a set of values for (C, a, p) for each network family.This is used to build a Bayesian classifier, based on a small number of discrete observations.

Bayesian classification of network families from population-level data
The final step in our methodology is to build a Bayesian classifier, using our Birth-and-Death process approximation with parsimonious rates, which provides the most likely network family from population-level epidemic data.To be more precise, these data D are counts of infected nodes at n different times during a single epidemic: which is considered to be a realisation of the random vector (I(t 0 ), . . ., I(t n−1 )) with (t 0 , . . ., t n−1 ) ∈ [0, T ] n .The likelihood of a set of parameters (C, a, p) given the data D and the infection rate γ is obtained using Eq. ( 1).More precisely, it is defined as follows where a k = a C,a,p k and c k = γk.These probabilities have no closed form.The efficient numerical evaluation of general Birth-and-Death likelihoods under discrete (in time) observations has been subject of intense work [6,5,4].Here, we used the method from [6], based on a numerical inversion of an explicit Laplace transform representation for transition probabilities.Indeed, compared to standard matrix exponentiation method or finite differences, this approach enjoys both increased stability and efficiency, making the computations robust to state space dimension.
Based on the previous estimation of (C, a, p) parameters from simulated continuous data, we build three different prior distributions over the (C, a, p) space, using a distinct kernel density estimator for each network family.These prior densities are denoted π Reg , π E−R , π B−A for Regular, Erdős-Rényi and Barabási-Albert networks, respectively.Now, assuming a non-informative uniform prior on the potential network families, one can derive the posterior probability for each of them by simple application of the Bayes formula: Again, these integrals are not analytically tractable and are estimated using a Monte-Carlo approach, based on the Corrected Arithmetic Mean Estimator from [33], obtained from the following identity: where A is a given subset of the parameter space.Specifically, we choose the set A to be a paraboloid centred at the MAP location, shaped with the log-likelihood Hessian matrix and scaled such that all posterior samples obtained from a Metropolis-Hastings MCMC run are included (meaning that we assume P F i (A|D)).P F i (A i ) is then obtained by importance sampling.

Results
First, we heuristically show that it is possible to approximate the SIS epidemic on a network with a carefully tuned Birth-and-Death process.Then we characterise three network families, showing that the (C, a, p) model distinguishes between them.Finally, we assess the quality of our Bayesian classifier by a cross-validation approach, measuring sensitivity and specificity as well as stresstesting it on real-world and simulated networks.The validation of BD process and of the (C, a, p) model is carried out numerically, on all three different network families and for a large number of parameter combinations, where both network and disease parameters vary.Simulations are based on the Gillespie [11] algorithm.With this method, we can keep track of the precise timing of events, including the time spent in different states, whether a new event is an infection or recovery, and the precise population-level count of infected nodes at all times.Everything has been implemented using the Python language and the codes are available online1 .

Numerical validation of the Birth-and-Death approximation
The first assumption we made is that the BD process is able to approximate the exact epidemic model on a network.To validate this claim, at least numerically, we solve the master equation ( 1) with rates c k = γk and a k = âk , where âk are the MLE of the rates via Eq.( 2).The expected number of infected nodes k kp k (t) from the numerical solution of the master equation is then compared to the average number of infected nodes based on a large number of explicit stochastic epidemic simulations on networks.Fig. 2 shows remarkable agreement between the two and this agreement holds for each and every network and parameter values we tested.This gives us confidence that using the BD process approximation is a viable way to proceed.We now turn our attention to validating the assumption that the rates of infection in the BD process can be parametrised by the (C, a, p)−model.These tests will also be crucial in showing that the learned parameters from the parsimonious model allow us to distinguish between different network families.For each network family, we vary the average degree (5 < k < 20) and the infection and recovery rates ((τ, γ) ∈ (0, 10] × (0, 10]) .The parameter space is sampled uniformly via Latin hypercube sampling [24].However, not all triples are reasonable, as there may be combinations such that the epidemic does not spread.Indeed, the behaviour of the epidemic is determined by the parameters of both the network and the epidemic dynamics.The former includes quantities such as the average degree and higher-order moments, the latter includes the transmission and recovery rates.All of this is well captured by the reproduction number [20], R 0 , which is the number of secondary infections caused by a typical infectious individual introduced into a fully susceptible population.If R 0 ≤ 1 the infection will die out, however, if R 0 > 1, then an epidemic is expected.R 0 is given by: Since R 0 depends directly on the triples that we sampled, we accept only triples such that 1 < R 0 ≤ 10.In the end we worked with 120 valid triples, ( k , τ, γ), per network family.For each triple, multiple realisations of the epidemic were run across different realisations of the network.All such output generates the {u k , d k , τ k } data which is used to find the MLE of the rates of the BD process, âk s.Note that the d k 's are not needed at any point in this method, as we always assume c k = γk, and they do not enter in the BD approximation nor in the (C, a, p) model.Finally, the model based on (C, a, p) is fitted to the (k, âk ) curves using a least-squares procedure, i.e. minimising the function given in Eq. ( 4).
The first task is to confirm that the (C, a, p) model is able to capture the shape of the (k, âk ) curves for all network families and epidemic parameters.We would like to explore the entire (k, âk ) curve, that is from k = 0 to k = N .Hence, the initial condition, k 0 can take values in (0, N ), so we get u k values for all k.Practically, we chose two different initial conditions, namely k 0 = 5 and k 0 = N .The former allows us to explore the curve up to the steady state, while the latter, although artificial in real-life situations, allows us to explore the curve from the steady state to N .Then C, a and p are inferred directly from {u k , τ k }.To find the triple that minimises the functional (4), we used a particle swarm algorithm [19], with bounds C ∈ [0, 1] , a ∈ − 1 N , 1 N , p ∈ 1 2 , 2 .Curves based on { Ĉ, â, p} are compared to the (k, âk ) curves both in Fig. 1 and in the insets of Fig. 2. Systematic numerical investigations (not all plots shown) show that the proposed parsimonious three-parameter model fits the (k, âk ) curves well for all networks, particularly the Regular and Erdős-Rényi ones.However, in the case of Barabási-Albert networks there are a few parameter combinations where the agreement between the master equation with the rates given by the (C, a, p) model is poorer.This is despite the seemingly small discrepancy between (k, âk ) and the (C, a, p) curves.However, the master equation with the âk -rates still leads to excellent agreement with simulations.

Statistical characterisation of the three different random network families
Upon accepting the above model and before presenting the results of the Bayesian analysis we aim to describe as fully as possible how the network structure and epidemic dynamics map onto the (C, a, p) parameter space.Remarkably, different network families lead to distinct regions in this space, as shown by Fig. 3.The three distinct clouds of points correspond to the three network families, suggesting that the proposed model is able to distinguish different network families and can be used to classify networks.
The intuitive reason for this separation is that epidemics on such different networks spread with distinct enough characteristics.In scale-free networks for example, the most exposed nodes are the hubs, so they get infected early on.This skews the (k, âk ) curve to the left, because once infected these hubs generate a disproportionately large number of S − I links.On the contrary, when all nodes have similar degrees, the (k, âk ) curves are more symmetric.Concerning Erdős-Rényi and Regular networks, the most important difference is that the former allow for some degree heterogeneity, whereas the latter do not.Degree heterogeneity plays an important role when it comes to disease transmission so it is no surprise that epidemics on Erdős-Rényi networks are slightly stronger in their initial stage than epidemics on Regular networks [20].
The impact of the network and epidemics on C, a and p is also worth investigating, although not directly relevant to the bayesian inference scheme.We report a brief selection of the most important features that emerge from our analysis of simulation results.Of the three parameters of the model, the most important one in terms of classification is a, as it encodes all the information about the heterogeneity of the underlying network, as evidenced by Fig. 3, with values from below zero for regular structures, through a = 0 for Erdős-Rényi to positive values for Barabási-Albert networks.Concerning C, Fig. 4(a) -(b) reveals similar behaviour across the three network families, with the major difference being in the magnitude of its value for homogeneous and heterogenous networks.With respect to τ , we observe a quasi-linear dependence.Where the random mixing assumption holds, i.e. a k ∼ τ k N k(N − k), it is clear that increasing τ from a small value will lead to more infected nodes and higher values of a k , although at some point saturation effects occur.The dependence on the average degree (not reported) is even more subtle, as the average degree also has an important impact on p.
The dependence of p on k , see Fig. 4-(c), has a natural interpretation.In the homogenous cases, as k increases, we observe that p → 1.This can be easily explained by noting that both the Regular and Erdős-Rényi networks tend towards the Complete network (p = 1), as their average degree increases.In contrast, the Barabási-Albert network cannot become a Complete network by construction, which drastically changes the behaviour of p.Note that we always observe smaller values of p for the Barabási-Albert networks, meaning that degree heterogeneity leads to flatter (k, âk ) curves, around the maximum.This is partially explained by the natural constraints on those curves.If the curve is left skewed and and has a high peak (manifested by large values of C) then the requirement that it reach the (N, 0) point leads to a more pronounced flatness or faster decay.The impact of τ and γ (not reported) on p suggests that p is not correlated with epidemic parameters, meaning that it is almost uniquely influenced by the structure of the network.
To summarise our analysis, we note that the network family plays the most important role in determining the position of the corresponding (C, a, p) values in the cloud, with the most pronounced impact on the sign of a. Secondly, τ has an important role in determining the magnitude of C, while k , together with the network family, has a major impact on p. Finally, γ plays a marginal role on the values of (C, a, p).As previously mentioned, the fact that different network families map onto different regions in the (C, a, p) space is going to be fundamental in inferring networks based on population-level epidemic data taken at distinct times.

Classification of simulated random networks from population level data
First, we assess the quality of our classifier, based on a simulated test dataset.In this work, we considered 360 scenarios, each defined by an average degree < k >, an infection rate τ , a recovery rate γ and a network family.Our previously mentioned simulations enabled to assign each of them to a single point in the (C, a, p) space.Next, these points were split into training and testing subsets at random but ensuring 20 testing scenarios for each network family.The (C, a, p) values associated with the training scenarios (100 per network family) were used to configure three Gaussian kernel density estimators [35] to be used as prior distributions.The bandwidth of these estimators was obtained by 10-fold cross-validation.This exercise was first performed for two different settings: (a) discrete population-level data from the BD process with rates given by the (C,a,p) fitted curved, and (b) discrete population-level data from stochastic network-based simulations on networks.Finally, we considered data based on epidemics simulated on synthetic network models (other than the three we studied) as well as real-world networks.

Discrete data based on the Birth-and-Death process
In this first test, we used discrete observations directly taken from Birth-and-Death simulations where rates are given by the fitted (C, a, p) curves.The purpose here was to validate our Bayesian classifier, in a (simpler) context where the likelihood is well-specified, that is, when no approximation is being done with respect to the data-generating process.In each of our 60 test (C, a, p) values, a single trajectory starting from k = 5 was generated and recorded at n = 21 equally-spaced times, up to the quasi-steady state.These data were then used for classification, yielding the confusion matrix in Table 1.
True \ Predicted (C, a, p) type Erdős-Rényi Regular Barabási-Albert Erdős-Rényi 80.0% 20.0% 0.0% Regular 5.0% 95.0% 0.0% Barabási-Albert 0.0% 0.0% 100.0%These results show that both Barabási-Albert and Regular families are clearly distinguished by our classifier.However, even though Erdős-Rényi networks have a sensitivity of 80% on our test set, 20% of them are classified as Regular, suggesting a potential non-identifiability problem.This confusion matrix is not symmetric because Regular networks are classified as Erdős-Rényi in only 5% of the tests.This may not come as a surprise since Fig. 3 already indicated some similarity between these two network families in the (C, a, p), i.e., the clouds corresponding to these two network families share similar features.
A secondary, but important, finding resulting from this test is that 21 observations equally spread in the transient phase appear sufficient to recognise the different network families.We believe that the ability to distinguish between different network families is not due to the frequency of the observations but rather to the fact that these observations provide a good coverage of the (k, âk ) curves or rates.Indeed, numerical tests (not reported here) showed that when observations were limited to the early stages of an epidemic (i.e., where I(t) ≤ 50), then the classifier struggled to recover the network type.
Summarising, our classifier gives excellent results when there is no model error, that is, when the likelihood and data generation are based on the same model and involve no approximations.

Discrete data based on simulations of the SIS model on networks
We now turn to the second test, when the data were taken from actual simulations of epidemics on networks.For each scenario in our testing set, we simulated 10 network realisations and ran a single SIS epidemic on each of them, see Fig. 5 starting from 5 initial infectious nodes (taken uniformly at random).Similarly to our previous tests, the observations were taken regularly spaced in time up to the epidemic steady-state (red dots in Fig. 5).Depending on both scenario and network type, the behaviour of the epidemic curves was found to vary significantly.
For a given testing scenario, our classifier returns the network family with the highest posterior probability, numerically computed using the Corrected Arithmetic Mean estimator.To quantify uncertainty on this classification, this process was repeated for all 10 realisations.Averaging the posterior probabilities over all realisations, we found that the dominant posterior probability was always the correct one, as shown in Fig. 6.Table 2 provides the confusion matrix, with estimated standard deviations.True \ Predicted Erdős-Rényi Regular Barabási-Albert Erdős-Rényi 77.5% (11.0) 22.5% (11.0) 0.0% (0.0) Regular 14.5% (8.5) 85.5% (8.5) 0.0% (0.0) Barabási-Albert 10.0% (3.9) 0.0% (0.0) 90.0% (3.9) Comparing with the first test, the effect of approximating the SIS model with the Birth-and-Death process does not significantly change the sensitivity.However, we observe an increased level of confusion between Regular and Erdős-Rényi networks, and a small degree of misclassification with the Barabási-Albert network family.These two phenomena are of a very different nature.The first source of errors results from posterior probabilities competing around a mean level across the 10 realisations, suggesting a potential issue of non-identifiability, see Fig. 7a.The second type of potential error, instead, stems from a mismatch between the SIS model and the Birth-and-Death approximation.This is characterised by posterior probabilities that are either 1 or 0 between both families, as seen in Fig. 7b.Nevertheless, the overall level of sensitivity and specificity is excellent across all network families, as shown by Fig. 8.

Discrete data based on simulations of the SIS model on other networks
We test the proposed classifier on three synthetic networks of size N = 1000, generated using the configuration model [30].Specifically, the degree distribution of the three networks is based on the negative binomial distribution with different parameters (p, n): where p is the probability of success and n is the number of failures.This is motivated by both the simplicity and flexibility of this distribution.We choose to keep the average degree fixed (i.e.k = 6) while tuning the value of the variance, see Fig. 9.To avoid the possibility of having more than one connected components, the degree distributions is shifted to the right so that the minimum degree is greater than or equal to 3.
Figure 9: Degree distributions, ordered by variance, of three single synthetic networks with degree distributions drawn from the negative binomial given in equation (10).The average degree k = 6 for all networks.
From left to right, the variance is σ = 8 (neg bin 0), σ = 40 (neg bin 1), σ = 120 (neg bin 2).The values of (p, n) are reported in the legends.Note that n is not necessarily an integer.
The reason for our choice of synthetic networks is twofold: (a) to stress-test the classifier by using networks whose degree distribution does not come from the models used to build the priors, and (b) to study the extent to which the classifier can distinguish between different levels of heterogeneity in the degree distribution of the underlying networks, while ensuring the absence of higher order structures, such as communities or high clustering coefficient.The degree distributions are chosen to exhibit different levels of heterogeneity, from small values up to values comparable to those seen in Barabási-Albert networks.For each single network realisation we ran 10 separate epidemics with parameters γ = 1 and τ = 0.5, starting from 5 infected nodes.As in Section 3.3.1, the inference was based on 21 equally-spaced time-points where the number of infected nodes was recorded.The classification results are shown in Fig. 10, and confirm that the classifier is able to distinguish between networks with high/low levels of degree heterogeneity (see Fig 10(a) -(c)).In particular, by looking at Fig. 9 it is reasonable to expect that the networks shown in Fig. 9(a) and (c) are going to be classified as Erdős-Rényi and Barabási-Albert networks, respectively.Indeed, Fig 10(a) shows that the classifier identifies the network in Fig. 9(a) as Erdős-Rényi 80% of the time, whereas the network in Fig. 9(c) is classified correctly as Barabási-Albert for every single epidemic realisation of the epidemic.When the degree distribution of the test network is such that its variance falls between typical variances observed in Erdős-Rényi and Barabási-Albert networks, see Fig. 9(b), the output of the classifier is more sensitive to the individual realisation of the epidemic.However, even in this case, the classifier correctly identifies the network as being of the type that is closest, in terms of distribution, to that of the test network.Moreover, heuristically at least, the Barabási-Albert network is the outright winner which seems reasonable upon inspecting the degree distribution of the test network.

Discrete data based on simulations of the SIS model on real networks
Finally, we stress-test the classifier on real world networks, which may exhibit higher order structure beyond degree heterogeneity We chose three real networks.The first is labeled euroroads, and it is part of the KONECT collection [21].The second and third, bio-grid-mouse and fb-messages, are part of the network data repository Networkrepository [39].Euroroads is an infrastructure network, bio-grid-mouse is a protein-protein network while fb-messages is based on the interactions of an online community of students at University of California.In Fig. 11 the degree distributions of these networks are shown.To keep the number of nodes equal to N = 1000, we only consider the largest connected component, and then, where necessary, remove peripheral, low-degree nodes such that the resulting network is still connected.In line with subsection 3.4, we fix γ = 1, and run 10 epidemics on each network.Values for the infection parameter are τ = {1.5, 2.5, 0.4} for euroroads, bio-grid-mouse and fb-messages, respectively.Results from our classifier are reported in Fig. 12 and are in line with our expectations based on the inspection of the respective degree distributions: the infrastructure network is regular, while the other two are scale-free, and hence are correctly classified as Barabási-Albert.
Taken altogether, the tests above show that the classifier is both robust and reliable, and can distinguish between networks that have sufficiently different degree distributions.

Discussion
In this paper, we proposed a new method that uses population-level incidence data at discrete times to infer the most likely family of the network over which the epidemic has initially spread.This is a challenging task because the exact epidemic model on a given network is forbiddingly high-dimensional meaning that even a numerical solution is out of reach.The key to carry out the inference is the approximation of the exact epidemic model by a birth-and-death (BD) process, whose rates not only encode the structure of the networks but also allow us to distinguish between the different network families through a parsimonious three-parameter model.
Our analysis has focused on three well-known random network families: Regular, Erdős-Rényi and Barabási-Albert.For each network family, the rate of infections in the corresponding BD approximation was obtained by a maximum likelihood estimation from continuous data.Despite these rates being network family-dependent, they all share some common features.This in turn allowed us to propose a parsimonious three-parameter model (C, a, p) that works across all network families and, at the same time, can capture the differences in the rates of the approximating BD process.In addition to being robust to different values of τ , γ and average degree, these parameters exhibit a clear distinction between the three different network families when plotted in the 3-dimensional (C, a, p) space.This knowledge is then encoded into prior distributions, constructed using kernel density estimators over the (C, a, p) space.Our classifier then consists in the numerical estimation of the relative marginal probabilities in a Bayesian framework.Our results show that the classifier has excellent specificity and sensitivity, despite the simplicity of the model.
These encouraging results lead to a multitude of questions and remarks.First of all, our choice of random network families means that the main feature of the networks is their degree heterogeneity.We have not yet considered in detail more complex networks, for example networks exhibiting clustering or community structure.This would certainly lead to (k, âk ) curves of different shapes, having potentially other features such as multiple peaks for networks with multiple communities, and thus requiring either a more sophisticated or non-parametric model.Nevertheless, considering epidemics in terms of an approximate BD process appears to be a powerful approach if a tractable likelihood is desired.Moreover, once the most likely network family has been identified, on could carry one with the estimation of τ , γ and average degree.
We have used a fixed number of nodes (N = 1000) in all our numerical experiments.We do not expect major changes when the number of nodes changes.Preliminary numerical tests suggest that there is a good degree of universality, such that the (k, âk ) curves only differ by a scaling factor when the number of nodes changes while all other parameters being fixed.In this respect, our methodology could easily be adapted by directly considering the scaled epidemic (on [0, 1]) and repeating our tests for different values of N .Fortunately, our numerical method [6] scales well with N , since the transition probabilities in the likelihood are computed individually (with deeper continued fractions).The question of the limiting behaviour in the limit of large N can also be further investigated.
So far, we have used discrete data taken on a regular time grid covering the epidemic from its early stage (a few infectious nodes) up to its quasi-steady state.Increasing the frequency of data or restricting data to the very beginning of the epidemic are of significant practical interest.In the former case, one expects the discrete likelihood to converge to the simpler continuous one, enabling faster and easier analysis.In the latter case, it would lead to a model that does not require describing the whole epidemic as we currently do.Focusing on the initial stages of the epidemic, the most critical part in many cases, and upon solving a potential identifiability problem, such an approach could have important real-world impact, making it possible to predict and control more accurately yet-to-be epidemics.
To conclude, we introduced a powerful and versatile way to approximate exact epidemic models on networks which allows for a more straightforward and efficient inference of the most likely network family based solely on population-level discrete epidemic data.This framework is rather general and can be extended to different epidemic models, different network types and can serve as a new approach when it comes to network inference.The model itself could be improved by using more sophisticated models for the infection rates and by learning a large number of different networks or network families, leading to a wide portfolio of data which then can be used for estimation.Of course, there is a trade-off in terms of what we can infer about networks using population-level discrete data.We cannot infer individual links for example but this is to be expected since the data we use for inference is not at the link-or node-level.Nevertheless, our approach addresses an important real-world need whereby we can now infer relevant properties about the underlying network when only population-level data, i.e., the most likely form of data available in the real world, are available.This could lead to a better understanding of the underlying contact structure and may serve as input to develop and/or target control measures.

Figure 2 :
Figure2: Comparison between the time-evolution of the average number of infected nodes from simulations (blue dots) and the deterministic numerical solutions (continuous red lines) of system (1) with rates âk given by the maximum likelihood estimation (2), with initial condition p k (0) = δ k k0 , where k 0 = 5 is the initial number of infected nodes used in the simulations.Three network families are reported, each with N = 1000 nodes, from left to right, ordered by increasing heterogeneity, from Regular (a) and Erdős-Rényi (b) to Barabási-Albert (c) networks.In each figure, from bottom to top, the parameters used to generate epidemics are ( k, τ ) = {(6, 1), (8, 1.2), (10, 1.5)} and γ = 2.In the insets of each figure, we show the (k, âk ) curves inferred from epidemics via MLE (2) and the proposed function (3) with parameters found by minimising the functional (4).

Figure 3 :
Figure3: (C, a, p) values resulting from the continuous observation of simulated SIS epidemics on networks with different values of (γ, τ, k , Family).A total of 120 triples (γ, τ, k ) per network family were explored, and each triple is the result of the concatenation of a total of 2500 epidemics on 50 realisations of the network.Red points come from Regular networks (Reg) , blue points from Erdős-Rényi networks (E-R) and green points from Barabási-Albert networks (B-A).In order to fully explore the curve (a, âk ) curves, half of the epidemics where started from I(t = 0) = 5 and half from I(t = 0) = N ; all networks have N = 1000 nodes.

Figure 5 :Figure 6 :
Figure 5: Gillespie simulations of SIS epidemics on 10 different network realisations of each random family.The red dots indicate the discrete measurements in each repetition.

Figure 8 :
Figure 8: Global and per-network classifier sensitivity and specificity over the 10 realisations.Each network family has 10 points.

Figure 10 :
Figure 10: Classification results for 10 epidemics on each of three different synthetic networks, presented in increasing order of heterogeneity.

Figure 12 :
Figure12: Classification results for 10 epidemics on each of three real-world networks, presented in increasing order of heterogeneity: euroroads, bio-grid-mouse and fb-messages.

Table 1 :
Confusion matrix computed out of 60 test points using discrete data simulated from Birth-and-Death processes.

Table 2 :
Confusion matrix based on average probabilities over 10 realisations with discrete data based on simulations of the SIS model on networks.Standard deviations are provided in brackets.