Introduction

The study of natural, social and technological phenomena in complex systems invariably requires approximations that coarse-grain and simplify, so that insights can be obtained about the causal mechanisms at work. A case in point, and our focus, is the study of dynamical processes on complex networks1, such as models of epidemics2,3, opinion dynamics4,5,6, the diffusion of innovations7,8,9,10, the evolution of languages11,12,13 and cultural polarisation14,15. The standard approach to analyse dynamics on networks is via mean-field approximations, which range in accuracy and complexity2,16,17,18,19. While such methods have provided important insights, the assumptions that underpin mean-field approximations—the absence of clustering (‘a friend of a friend is my friend’), modularity (community structure) and dynamical correlations (‘I’m similar to my neighbours’)—are routinely violated by dynamical processes on real-world complex networks and it is generally difficult to quantify how well a particular approximation will do a priori, given the network or dynamical process20. Mean-field approximation has also resulted in controversy concerning the critical epidemic threshold in scale-free networks17,21,22,23. Because of these issues, the quantification of approximation error has been recognised as one of the key challenges for network epidemic modellers24.

In this article we address these critical issues by presenting a foundation for mean-field approximations of dynamics on networks, which builds from the micro-scale description of the dynamics and facilitates the quantification of approximation error. We use approximate lumping to derive low-dimensional mean-field equations for a broad class of Markov chain dynamics on networks which includes models of epidemics and opinion dynamics. The coarse-grained states are based on the number of each type of ‘vertex-state’, such as the number of susceptible and infected vertices in the susceptible–infected–susceptible (SIS) model of epidemics. In contrast to standard mean-field approximations, the transition rates between these coarse-grained states are derived directly from the exact evolution of the probability distribution over states—known as the master equation or forward Kolmogorov equation—and are shown to minimise approximation error, in the sense that they are closest to an exact lumping. This provides a theoretical underpinning that simplifies and standardises the process of deriving mean-field approximations for practitioners: the microscopic formulation of a model can be easily translated into a mean-field approximation using the formulae we have obtained. Furthermore, this approach enables us to derive a bound on the approximate lumping error and compare this to errors computed from stochastic simulation of epidemic dynamics on several benchmark real-world networks.

Results and discussion

We consider Markov chain dynamics on finite, connected networks with undirected, unweighted edges and no self-loops, where each vertex in the network can be in one of a finite number of “vertex-states”. For example, in models of epidemics the vertex-states correspond to individuals’ disease status, which could be susceptible to infection, infected, recovered, etc. In models of voting behaviour, the vertex-states correspond to the party that each person plans to vote for. If M is the number of vertex-states and N is the number of vertices, then there are MN possible states, i.e. configurations of vertex-states on the network. Thus the size of the full state-space for Markov chain dynamics on networks is extremely large, even for moderate N, and consequently, unless the network contains significant symmetry25,26, approximation is essential. Despite this, the state-space is finite so we denote the probability distribution at time t over state-space by \(X(t)={({X}_{1}(t),{X}_{2}(t),\ldots ,{X}_{{M}^{N}}(t))}^{{{{{{{{\rm{T}}}}}}}}}\), where Xk(t) is the probability of being in the kth state. Variables related to the full state-space will be upper-case Latin letters and the indices k and l indicate that the index is over the full state-space. In continuous time t, the evolution of X(t) is described by the forward Kolmogorov or master equation27,

$$\dot{X}={{{{{{{{\bf{Q}}}}}}}}}^{{{{{{{{\rm{T}}}}}}}}}X,$$

where Q is the infinitesimal generator, an MN by MN matrix in which each off-diagonal component Qkl gives the transition rate from state S[k] to state S[l], and the diagonal components ensure that rows sum to zero. Bold variables indicate matrices. Our approach can also be adapted to discrete time models.

In the “Methods” section we describe how the components of the infinitesimal generator relate to the microscopic dynamics, i.e. the transition rates of individual vertices between vertex-states. We assume that the positive entries of the infinitesimal generator are affine (i.e. constant plus linear) functions of the number of neighbouring vertices in each vertex-state. For example, in epidemic models, a susceptible vertex typically becomes infected at a rate proportional to the number of infected neighbours. We also focus on ‘homogeneous’ models where the micro-scale transition rates are identical for all vertices with the same number of neighbours in each vertex-state. These features define a class of network dynamics that we call ‘homogeneous single-vertex transition models’ (homogeneous SVTs) with ‘affine vertex-state transition matrices’ (affine VSTMs). Specifically, if a model has an affine VSTM then a vertex in vertex-state \({{{{{{{\mathcal{A}}}}}}}}\), with nm neighbours in the mth vertex-state, transitions to vertex-state \({{{{{{{\mathcal{B}}}}}}}}\) with rate

$${f}_{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}({n}_{1},{n}_{2},\ldots ,{n}_{M})={\delta }_{0}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}+\mathop{\sum }\limits_{m=1}^{M}{\delta }_{m}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}{n}_{m},$$
(1)

where the \({\delta }_{m}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}\) are arbitrary non-negative constants. This covers a broad range of dynamical processes on networks28, but in Supplementary Note 5 we also consider generalisations to heterogeneous and nonlinear network dynamics with quadratic VSTMs.

We coarse-grain such Markov chain network dynamics using a method called approximate lumping, in which states are grouped together (lumped) according to a pre-defined partition of state-space29. We consider approximate lumping partitions based on sets of states that have the same total number of vertices in each vertex-state, i.e. the number of susceptible and infected vertices in the SIS model. We refer to this type of approximate lumping as a population model approximation30. To make this precise, let \(s\in {{\mathbb{Z}}}_{\ge 0}^{M}\) be a lumped state, which is a vector of length M whose mth component, sm, denotes the number of vertices in the mth vertex-state. Lumped variables will be lower-case Latin letters and m will index vertex-states. It follows that there are \(r=\left(\begin{array}{c}{N+M-1}\\ {N}\end{array}\right)\) possible lumped states, since a lumped state is a combination of N vertex-states drawn from M possibilities with repetition. Thus we number the lumped states in the lumped state-space s[1], s[2], …, s[r] and we use Π = {Π1, Π2, …, Πr} to denote the corresponding lumping partition. Let \(x(t)={({x}_{1}(t),\ldots ,{x}_{r}(t))}^{{\mathrm {T}}}\) denote the time-dependent Markov chain probability distribution over Π, where xi(t) is the probability of being in the lumped state s[i]. We use indices i and j to indicate that the index is over the lumped state-space. The evolution of x(t) is then the solution to

$$\dot{x}={{{{{{{{\bf{q}}}}}}}}}^{{\mathrm {T}}}x,$$
(2)

where q is the approximate lumping generator, which needs to be determined.

The idea here is to use the coarse-grained generator q = DQC, where \({{{{{{{\bf{C}}}}}}}}\in {\{0,1\}}^{{M}^{N}\times r}\) is the collector matrix29, whose kjth component is one if S[k] Πj and zero otherwise, and \({{{{{{{\bf{D}}}}}}}}\in {{\mathbb{R}}}^{r\times {M}^{N}}\) is the distributor matrix, whose ilth component is 1/Πi if S[l] Πi and zero otherwise. The effect of using q = DQC is to average the sum of rates out of states in one lumping partition cell and into another. This approach has the following advantages. Firstly it minimises error, in the sense that it is closest to an exact lumping where QC = Cq (details in the Methods section), which is made precise in the following theorem.

Theorem 2.1

The lumped infinitesimal generator q = DQC minimises QC − CqF (Frobenius norm).

Secondly, the matrix q can be explicitly derived for affine network dynamics, leading to the following theorem.

Theorem 2.2

Let Ω be the state-space of a homogeneous SVT with affine VSTM on a network with mean degree z, and let q = DQC be the lumped infinitesimal generator corresponding to the population model approximation Π = {Π1, Π2, …, Πr} with lumped states s[1], s[2], …, s[r]. If s[i] and s[j] correspond to a single vertex changing from vertex-state \({{{{{{{\mathcal{A}}}}}}}}\) to \({{{{{{{\mathcal{B}}}}}}}}\) and \({s}_{1}^{[i]}\) is the number of vertices in vertex-state \({{{{{{{\mathcal{A}}}}}}}}\), then

$${{{{{{{{\bf{q}}}}}}}}}_{ij}={\delta }_{0}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}{s}_{1}^{[i]}+\frac{z}{N-1}{s}_{1}^{[i]}\left[{\delta }_{1}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}\left({s}_{1}^{[i]}-1\right)+\mathop{\sum }\limits_{m=2}^{M}{\delta }_{m}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}{s}_{m}^{[i]}\right].$$
(3)

These are the main theoretical results of the paper. Outlines of the proofs are given in the “Methods” section and further details are provided in the Supplementary Methods.

For concreteness, we illustrate the approximate lumping approach in Fig. 1 using the SIS model of epidemic dynamics, which has M = 2 and is an example of “binary-state dynamics”31. The vertex-states of the SIS model are referred to as susceptible (\({{{{{{{\mathcal{S}}}}}}}}\)) and infected (\({{{{{{{\mathcal{I}}}}}}}}\)). A susceptible vertex with n1 infected neighbours becomes infected with rate βn1 and an infected vertex recovers with rate γ, where β, γ > 0 are model parameters. In relation to our notation for affine VSTMs introduced in Eq. (1), we have \({\delta }_{1}^{{{{{{{{\mathcal{S}}}}}}}},{{{{{{{\mathcal{I}}}}}}}}}=\beta\), \({\delta }_{0}^{{{{{{{{\mathcal{I}}}}}}}},{{{{{{{\mathcal{S}}}}}}}}}=\gamma\) and all other \({\delta }_{m}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}\) are zero. Our approach partitions state-space into “levels”, so that the ith level, Πi, contains all states that have i infected vertices, and this reduces the size of state-space from 2N to N + 1. For SIS dynamics, we obtain a mean-field birth-death process with infection rates given by

$${{{{{{{{\bf{q}}}}}}}}}_{i,i+1}=\beta \frac{z}{N-1}i(N-i),$$

and recovery rates

$${{{{{{{{\bf{q}}}}}}}}}_{i,i-1}=\gamma i.$$

These rates will be unsurprising to those familiar with mean-field approximations of network dynamics, but note that here we have derived these directly from the full Markov chain description rather than via moment closures based on non-rigorous probabilistic arguments, as is typical2. For the SIS model and other binary-state dynamics, this approach gives rise to a birth–death process; for network dynamics with M > 2, it yields a Markov population model30.

Fig. 1: Illustration of approximate lumping for a small four vertex ‘coat hanger’ network with SIS (susceptible–infected–susceptible) dynamics.
figure 1

a Illustrates the matrix multiplication DQC = q that lumps the infinitesimal generator Q of the full Markov chain using the collector and distributor matrices, C and D, respectively, to produce the tridiagonal approximate lumping infinitesimal generator q. Colour indicates the value of the corresponding matrix entry for the infection rate β = 4 and recovery rate γ = 1; zero entries are white. The horizontal and vertical lines indicate the different groupings of states by level; level 0 is in the left/top and level 4 is on the right/bottom. b Illustrates transitions from a state with two infected vertices that are accounted for by the full Markov chain. Blue vertices indicate susceptible and red vertices indicate infected. The transition rates are given next to the corresponding arrows. The vertical dots indicate that there are more states with two infected vertices. c Illustrates the corresponding transition rates for the approximate lumping from level two, i.e. two infected vertices. In general the lumped recovery rate is γi and the lumped infection rate is βzi(Ni)/(N−1), where i is the level (number of infected individuals); for the case illustrated N = 4, z = 2 and i = 2. d Illustrates the average number of infected vertices from solutions to the master equation for the full Markov chain (exact) and the approximate lumping (approximate). Note the log scale on the horizontal time axis.

In the lumped state-space, the error of our approximation is y(t) = CTX(t)−x(t) and so

$$\dot{y}={{{{{{{{\bf{q}}}}}}}}}^{{{{{{{{\rm{T}}}}}}}}}y+{\left({{{{{{{\bf{QC}}}}}}}}-{{{{{{{\bf{Cq}}}}}}}}\right)}^{{{{{{{{\rm{T}}}}}}}}}X(t).$$
(4)

This is an inhomogeneous linear system of ODEs, thus applying the variation of constants formula yields

$$y(t)=\int\nolimits_{0}^{t}\exp \left({{{{{{{{\bf{q}}}}}}}}}^{{{{{{{{\rm{T}}}}}}}}}s\right){\left({{{{{{{\bf{QC}}}}}}}}-{{{{{{{\bf{Cq}}}}}}}}\right)}^{{{{{{{{\rm{T}}}}}}}}}X(t-s)\,{{\mbox{d}}}\,s,$$
(5)

where we have assumed that y(0) = 0, i.e. the lumped initial state CTX(0) is known. To simplify the error computation we assume that the initial distribution of the full Markov chain is stationary so that X(t) = X*. Quasi-stationary distributions can also be handled in an analogous way and are discussed in Supplementary Note 4. In the “Methods” section, we derive a bound on the stationary absolute mean error

$$| {\bar{y}}^{* }| =\mathop{\lim }\limits_{t\to \infty }\left|\mathop{\sum }\limits_{i=0}^{N}i{y}_{i}(t)\right|,$$
(6)

for binary-state dynamics. However, this involves terms that depend on the full Markov chain, so we must resort to approximations to make further progress.

We focus on the SISa model32, which is similar to the SIS model but has an additional ‘ambient’ infection rate α, so a susceptible vertex with n1 infected neighbours becomes infected with rate α + βn1. Recovery is the same as in the SIS model. Unlike the SIS model, where the state with all susceptible vertices is absorbing, the SISa model has a stationary distribution. In the “Methods” section we obtain a bound on the stationary absolute mean error of the SISa model that depends on \({a}_{i}^{+}\), which is a constant related to the state that has the largest or smallest number of edges between susceptible and infected vertices in the ith level. Unfortunately, computing \({a}_{i}^{+}\) is computationally difficult (an algorithm that did so would provide a solution to the Max-Cut problem, which is NP-complete33). Thus we settle for an estimate, \({\widetilde{a}}_{i}^{+} > \,0\), obtained from a tractable greedy algorithm, described in detail in the “Methods” section, that sequentially picks susceptible vertices to become infected which introduce the largest or smallest number of edges between susceptible and infected vertices. Our numerically tractable bound depends on an assumption about \({\widetilde{a}}_{i}^{+}{x}_{i}^{* }\) and the full system, which is made precise in the “Methods” section. In Supplementary Note 3 we show that while this assumption does not always hold, we typically obtain an informative bound regardless. We also propose an approximation \({a}_{i}^{* }{x}_{i}^{* }\) based on averaging the minimum and maximum number of edges between susceptible and infected vertices at each level, although this approximation does not have a rigorous foundation.

Application to real-world networks

To illustrate the application of our results on a topical example, we use the SIR model of epidemics on a real-world contact network derived from GPS data. There are three vertex-states in the SIR model, namely susceptible, infected and recovered, which we denote by \({{{{{{{\mathcal{S}}}}}}}}\), \({{{{{{{\mathcal{I}}}}}}}}\) and \({{{{{{{\mathcal{R}}}}}}}}\) respectively. A susceptible vertex with n1 infected neighbours becomes infected at a rate βn1, and an infected vertex recovers at a rate γ. There are 3N states in the full Markov chain and (N + 2)(N + 1)/2 lumped states, corresponding to distinct numbers of vertices in each of the vertex-states. The lumped transition rate qij from the ith lumped state with \({s}_{{{{{{{{\mathcal{S}}}}}}}}}^{[i]}\) susceptible vertices and \({s}_{{{{{{{{\mathcal{I}}}}}}}}}^{[i]}\) infected vertices, to the jth lumped state in which a susceptible vertex has become infected is

$${{{{{{{{\bf{q}}}}}}}}}_{ij}=\beta \frac{z}{N-1}{s}_{{{{{{{{\mathcal{S}}}}}}}}}^{[i]}{s}_{{{{{{{{\mathcal{I}}}}}}}}}^{[i]}.$$

(Note that here it is convenient to use the vertex-states \({{{{{{{\mathcal{I}}}}}}}}\) and \({{{{{{{\mathcal{S}}}}}}}}\) rather than an integer to index the lumped state s[i]). Similarly, if an infected vertex recovers then the lumped transition rate is \({{{{{{{{\bf{q}}}}}}}}}_{ij}=\gamma {s}_{{{{{{{{\mathcal{I}}}}}}}}}^{[i]}.\) There are N + 1 lumped absorbing states in which there are no infected vertices and the number of recovered vertices ranges from zero to N.

We used a real-world contact network derived from data collected as part of the BBC documentary ‘Contagion! The BBC Four Pandemic’34,35. This study collected GPS traces of people who downloaded the ‘BBC Pandemic’ smart phone application. Data made publicly available from this study consists of timestamped anonymised pairwise distances within 50 m between 469 participants around the town of Haselmere, UK. We aggregated these data to create a static network between participants that came within 1 m of each other. We used the largest connected component of this network, which consists of N = 369 people and has mean degree z = 5.53. We refer to this as the ‘Haselmere 1m’ network. We used parameters γ = 1 and β = γR0(N−1)/(zN), where R0 = 3, since this would give a reproduction number of R0 in the corresponding compartmental model equations. Initially five vertices were selected uniformly at random to be infected.

In Fig. 2a, b we compare stochastic simulations (red) of the SIR model on the Haselmere 1m network with the corresponding approximate lumping (blue). Figure 2a illustrates the mean number of infected vertices over time (thick solid lines) and the corresponding 90-percentile of the simulated and approximate lumping distributions (shading). We also include, for comparison, results from homogeneous, heterogeneous and individual-based mean-field approximations (dashed, dot, and dash-dot lines respectively—see Supplementary Note 1 and Kiss et al. 2 for details), illustrating that the accuracy of our approach is comparable. However, our approach also produces a full probability distribution over the lumped states, which we use to compute the percentiles in Fig. 2a. This distribution could also be used for Bayesian parameter estimation and even data assimilation. Furthermore, with our approach we are able to compute absorption statistics and in Fig. 2b we compare the absorption probability into each absorbing state (i.e. the total number of infected individuals) of stochastic simulations (grey) and our approximate lumping (blue).

Fig. 2: Comparison of stochastic simulations and approximate lumping of the susceptible–infected–recovered (SIR) model of epidemics.
figure 2

a Illustrates the evolution of the mean number of infected vertices from 3000 stochastic simulations (thick red line) and the approximate lumping (thick blue line) for the SIR model on the Haselmere 1m network. The red and blue shading illustrate the 90-percentile of the corresponding distributions. The light blue dash, yellow dot, and grey dash-dot lines indicate the mean number of infected vertices for homogeneous, heterogeneous and individual based mean-field approximations respectively. b Illustrates the probability distribution of the total number of infections computed from 100,000 stochastic simulations, each run until t = 1000 (grey shading). The corresponding probability distribution computed from the approximate lumping is illustrated in blue. c and d Illustrate the same as a and b, respectively, but for an Erdős–Rényi graph with N = 369 vertices and mean degree z = 20.

Low dimensional mean-field approximations can perform poorly on networks with heterogeneous structure (e.g. when hubs, clustering or communities are present), and Fig. 2a, b illustrate this. By way of contrast, we also present results for an Erdős–Rényi graph where the accuracy of mean-field approximations is better. Specifically, we chose a network uniformly at random from those with N = 369 vertices (the same size as the Haselmere 1m network) and mean degree z = 20 (i.e. selecting 3690 random edges—note this is the less common type of Erdős Rényi graph), and in Fig. 2c, d we illustrate results corresponding to those in Fig. 2a, b, respectively. In this case, the accuracy is significantly improved and our approach even appears marginally better than the other comparable mean-field theories illustrated. We obtain similar results if we average over many graphs sampled at random.

In Fig. 3 we compare our error bound with the error produced via stochastic simulations of the SISa model on four benchmark real-world networks, including the Haselmere 1m network34,35, a protein interaction network36,37,38, an autonomous-systems Internet network39 and a US power grid network40. For each network in Fig. 3, we compute stochastic simulations of SISa dynamics on the network with ambient infection rate α = 0.01, infection transmission rate β = 2(γα)(N−1)/(zN) and recovery rate γ = 1, which would give a stationary infected fraction of 0.5 in the corresponding SISa compartmental model. Half of the vertices are chosen uniformly at random to be initially infected and the number of infected vertices is computed after the process is approximately stationary. For each network, we compute the mean fraction of infected vertices from multiple realisations of the stochastic simulations. We also numerically compute solutions of the lumped system to find the lumped probability distribution x(t) with initial condition corresponding to the average number of infected vertices of the stationary stochastic simulations. The stochastic simulation error (solid black lines in Fig. 3) is the absolute magnitude of the difference between the mean fraction of infected vertices in the stochastic simulations and approximate lumping. We compare this with our bound on the approximate lumping error (red dashed lines in Fig. 3) by numerically integrating Eq. (5) using \({\widetilde{a}}_{i}^{+}{x}_{i}^{* }\). The long-term behaviour of the bound is comparable, i.e. the over-estimate is a similar amount, for different sizes of network and error. The results for these examples are representative of other real-world networks. To illustrate this, in Fig. 4 we compare the errors computed from stochastic simulations (horizontal axis) with the corresponding errors computed using our approximation and bound (vertical axis) for 18 real-world networks, including the four used in Fig. 3. These networks constitute a standard benchmark test-set, including networks with heterogeneous topology on which mean-field approximations vary in accuracy20. The circular and triangular markers correspond to the approximation and bound, respectively. The SISa parameter values used are the same as in Fig. 3, i.e. α = 0.01, β = 2(γα)(N−1)/(zN) and γ = 1. The legend indicates which network has been used and these are ordered from the smallest simulation error at the top (furthest left in the figure) to the largest at the bottom (furthest right in the figure). References for each network, as well as information about size and mean degree, are included in Supplementary Note 3. Figure 4 shows that for a range of benchmark real-world networks our approximation gives a good estimate of the magnitude of the mean error and our bound is informative, i.e. these are correlated with the error (Pearson correlation coefficient: 0.62, p-value < 0.01 [without karate: 0.86, p-value  0.01]) and in all cases give a value <1.

Fig. 3: Comparison of susceptible–infected–susceptible with ambient infections (SISa) mean-field approximation error with theoretical upper bound on four real-world networks.
figure 3

Comparison of the evolution of the mean-field approximation error y(t) over time t for the SISa model (solid black line), computed using stochastic simulations, with our theoretical bound (dashed red line) for four real-world networks. a Uses the Haselmere 1m network34,35, b uses a protein interaction network36,37,38, c uses an autonomous-systems Internet network39 and d uses a US power grid network40.

Fig. 4: Comparison of susceptible–infected–susceptible with ambient infections (SISa) error with estimate and theoretical bound for benchmark real-world networks.
figure 4

Comparison of the absolute value of the mean error computed via simulations on the horizonatal axis with and our theoretically derived approximation (circular markers) and bound (triangular marker) on the vertical axis for a selection of benchmark real-world networks.

Conclusion

In summary, we have presented a mathematical foundation for mean-field approximations of a wide class of dynamical processes on networks that facilitates the quantification of approximation error. We have used approximate lumping to derive low-dimensional systems of equations directly from the exact master equation description, whose approximation error is minimal, in the sense that it is closest to an exact lumping, and can be quantified.

Our approximation results in a ‘density dependent’ system from which even lower dimensional ODE approximations can be rigorously derived in the large N limit41,42,43. Note that the lumped transition rates which we have derived only characterise network structure in terms of the mean degree, so do not account for variations in topology that may affect the dynamics. However, there is scope to extend our approach to more accurate degree-based mean-field17 and high-accuracy approximate master equations18,31 through more fined-grained lumpings by considering finer partitions of vertices and states30. There may also be alternative methods to bound the error44, potentially making use of theory developed for operator semi-groups43. While we extend our approach to quadratic VSTMs in Supplementary Note 5, further generalisations to arbitrary nonlinear VSTMs, e.g. via their power series expansions, may be possible. For non-smooth VSTMs, such as threshold models, consideration of the averaging process of the infinitesimal generator may also facilitate the derivation of approximations. The approach developed in this paper could also be applied to other complex systems, e.g. a natural generalisation is to multilayer network structures45,46 via the supra-adjacency matrix representation. However, the details of the specific application are likely to be crucial and will inevitably influence the structure of the Markov chain state-space and hence how much our approach needs to be adapted to deal with these considerations.

The COVID-19 pandemic has brought epidemic modelling into the spotlight and variants of compartmental models have influenced policy: for example, the UK’s Scientific Advisory Group for Emergencies47 at the time of writing list stochastic transmission models48,49,50,51 as modelling inputs. Such models incorporate realistic features such as age structure and geography. However, the underlying contact network is difficult to obtain and we should consider the consequence of not accounting for this in our models. For example, Fig. 2a shows that mean-field approximations (which includes compartmental models) are a poor representation of the true dynamics. Thus varying infection rates to fit such models to data could distort their interpretation and hence the consequences of policy interventions.

Methods

Mathematical formulation

Let G = (V, E) denote a network with vertex set V and edge set EV × V, where the number of vertices is N = V. We consider dynamical processes on finite connected simple networks (i.e. undirected, unweighted and with no self-loops) described by continuous-time Markov chains where each vertex can be in one of a finite number M of vertex-states and the set of possible vertex-states is \({{{{{{{\mathcal{W}}}}}}}}=\{{{{{{{{{\mathcal{W}}}}}}}}}_{1},{{{{{{{{\mathcal{W}}}}}}}}}_{2},\ldots ,{{{{{{{{\mathcal{W}}}}}}}}}_{M}\}\). We use caligraphic variables to indicate vertex-states. The state-space of the Markov chain is the set of all permutations of N vertex-states chosen from \({{{{{{{\mathcal{W}}}}}}}}\) with repetition. This is equivalent to \({{\Omega }}={{{{{{{{\mathcal{W}}}}}}}}}^{V}\), i.e. the set of all functions from V to \({{{{{{{\mathcal{W}}}}}}}}\), and so if the network is in state S Ω then the vertex-state of vertex vV is S(v). Since the number of states in Ω is MN, we can enumerate the states in state-space so that \({{\Omega }}=\{{S}^{[1]},{S}^{[2]},\ldots ,{S}^{[{M}^{N}]}\}\).

We assume that the dynamics are governed by homogeneous SVT models, which includes models of spin systems, epidemics, opinion dynamics, diffusion of innovation and a variety of other social dynamics28,52. In a homogeneous SVT model, a vertex changes vertex-state at a rate that is a function of only the number of its neighbours in each vertex-state and the rate function is the same for all vertices. Furthermore, transitions only occur between pairs of states that differ in at most one vertex-state. We call such pairs of states transition pairs and use the notation \({S}^{[k]}\mathop{ \sim }\limits^{v}{S}^{[l]}\) to indicate that the states S[k] and S[l] form a transition pair with transition vertex v, i.e. if \({S}^{[k]}\mathop{ \sim }\limits^{v}{S}^{[l]}\) then S[k](v) ≠ S[l](v) and S[k](u) = S[l](u) for all u ≠ v. For vertex v and state S[k] let \({n}^{[k]}(v)=({n}_{1}^{[k]}(v),{n}_{2}^{[k]}(v),\ldots ,{n}_{M}^{[k]}(v))\), where \({n}_{m}^{[k]}(v)\) is the number of neighbours of v with vertex-state \({{{{{{{{\mathcal{W}}}}}}}}}_{m}\). For k ≠ l, the transition rate between states S[k] and S[l] in homogeneous single-vertex transition models is then given by

$${{{{{{{{\bf{Q}}}}}}}}}_{kl}=\left\{\begin{array}{cc}{f}_{{S}^{[k]}(v),{S}^{[l]}(v)}({n}^{[k]}(v))&{{\mbox{if}}}\,{S}^{[k]}\mathop{ \sim }\limits^{v}{S}^{[l]}{{\mbox{}}}\\ 0&{{\mbox{otherwise}}}\end{array}\right.,$$

where \({f}_{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}({n}_{1},{n}_{2},\ldots ,{n}_{M})\ge 0\) is the VSTM, i.e. the rate that a vertex in vertex-state \({{{{{{{\mathcal{A}}}}}}}}\) changes to vertex-state \({{{{{{{\mathcal{B}}}}}}}}\) if it has n1 neighbours in vertex-state \({{{{{{{{\mathcal{W}}}}}}}}}_{1}\), n2 neighbours in vertex-state \({{{{{{{{\mathcal{W}}}}}}}}}_{2}\), etc. We focus on VSTMs that are affine functions of n[k](v), given by (1). Most SVTs have VSTMs of this form28, although notable exceptions include non-zero temperature Ising-Glauber dynamics53, the nonlinear q-voter model54 and threshold models10. Nonlinear VSTMs are discussed further in Supplementary Note 5, where we present results for the quadratic case.

Approximate lumping

To coarse-grain the network dynamics, we consider lumping of Markov chains55. An exact lumping Π = {Π1, Π2, …, Πr} is a partition of state-space that preserves the Markov property, a necessary and sufficient condition for which is that the sum of transition rates out of a state S[k] Πi into the cell Πj is the same for all states in the cell Πi. In matrix notation, this is equivalent to the existence of an r × r matrix q such that

$${{{{{{{\bf{QC}}}}}}}}={{{{{{{\bf{Cq}}}}}}}},$$
(7)

where \({{{{{{{\bf{C}}}}}}}}\in {\{0,1\}}^{{M}^{N}\times r}\) is the collector matrix29 whose kjth component is

$${{{{{{{{\bf{C}}}}}}}}}_{kj}=\left\{\begin{array}{cc}1&{{\mbox{if}}}\,{S}^{[k]}\in {{{\Pi }}}_{j}{{\mbox{}}},\\ 0&{{\mbox{otherwise}}}\end{array}\right.\,.$$

We call Eq. (7) the lumpability condition.

Note that q can be given explicitly by introducing the distributor matrix29\({{{{{{{\bf{D}}}}}}}}\in {{\mathbb{R}}}^{r\times {M}^{N}}\), whose ilth component is

$${{{{{{{{\bf{D}}}}}}}}}_{il}=\left\{\begin{array}{cc}\frac{1}{| {{{\Pi }}}_{i}| }&{{\mbox{if}}}\,{S}^{[l]}\in {{{\Pi }}}_{i}{{\mbox{}}},\\ 0&{{\mbox{otherwise}}}\end{array}\right.\,.$$

Specifically, q = DQC satisfies the lumpability condition when Q commutes with CD28.

A lumping that does not satisfy the lumpability condition, and hence does not preserve the Markov property, is an approximate lumping29. Recall that we consider approximate lumping partitions based on sets of states that have the same number of vertices in each vertex-state and use the generator q = DQC even when the lumpability condition is violated. Motivated by the condition for an exact lumping (7), for a given matrix norm we define the approximate lumping discrepancy as QCCq. Note that QCCq is a matrix of size MN × r, which in the case of an exact lumping has all zero entries, thus the approximate lumping discrepancy measures how far (in terms of the specific norm used) the approximate lumping is from being an exact lumping. For this reason, we choose q to minimise the approximate lumping discrepancy.

We now give an outline of the proof of Theorem 2.1, i.e. that q = DQC minimises the approximate lumping discrepancy using the Frobenius norm. With the Frobenious norm F we have

$$\parallel {{{{{{{\bf{QC}}}}}}}}-{{{{{{{\bf{Cq}}}}}}}}{\parallel }_{{{{{{{{\rm{F}}}}}}}}}^{2}=\mathop{\sum }\limits_{i=1}^{r}\mathop{\sum}\limits_{{S}^{[k]}\in {{{\Pi }}}_{i}}\mathop{\sum }\limits_{j=1}^{r}{[{({{{{{{{\bf{QC}}}}}}}})}_{kj}-{{{{{{{{\bf{q}}}}}}}}}_{ij}]}^{2}.$$

Consequently \(\parallel {{{{{{{\bf{QC}}}}}}}}-{{{{{{{\bf{Cq}}}}}}}}{\parallel }_{{{{{{{{\rm{F}}}}}}}}}^{2}\) can be minimised by choosing qij to be the average of the sum of rates out of states in the ith level and into the jth level, i.e.

$${{{{{{{{\bf{q}}}}}}}}}_{ij}=\frac{1}{\left({{N}\atop{{s}^{[i]}}}\right)}\mathop{\sum}\limits_{{S}^{[k]}\in {{{\Pi }}}_{i}}{({{{{{{{\bf{QC}}}}}}}})}_{kj},$$
(8)

where \({\left({{N}\atop{{s}^{[i]}}}\right)}\) is short for the multinomial \({\left({{N}\atop{{s}_{1}^{[i]},{s}_{2}^{[i]},\ldots ,{s}_{M}^{[i]}}}\right)}\). This is exactly what is obtained if one uses the definitions of the collector and distributor matrices to compute (DQC)ij. A detailed proof of Theorem 2.1 is provided in the Supplementary Methods. Note that the q that minimises the approximate lumping discrepancy depends on the particular norm used; the Frobenius norm is advantageous because it results in an intuitive averaging process that is also analytically tractable.

For \({{{{{{{\mathcal{A}}}}}}}}\in {{{{{{{\mathcal{W}}}}}}}}\), let \({\nu }_{{{{{{{{\mathcal{A}}}}}}}}}\) be a vector of length M whose mth component is \({\nu }_{{{{{{{{\mathcal{A}}}}}}}}m}=0\) if \({{{{{{{\mathcal{A}}}}}}}}\ne {{{{{{{{\mathcal{W}}}}}}}}}_{m}\) and \({\nu }_{{{{{{{{\mathcal{A}}}}}}}}m}=1\) if \({{{{{{{\mathcal{A}}}}}}}}={{{{{{{{\mathcal{W}}}}}}}}}_{m}\). Then for SVT models, the only possible non-zero rates are between pairs of lumped states that satisfy \({s}^{[j]}={s}^{[i]}+{\nu }_{{{{{{{{\mathcal{B}}}}}}}}}-{\nu }_{{{{{{{{\mathcal{A}}}}}}}}}\), with \({{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}\in {{{{{{{\mathcal{W}}}}}}}}\) and \({{{{{{{\mathcal{A}}}}}}}}\ne {{{{{{{\mathcal{B}}}}}}}}\), i.e. a vertex switches from vertex-state \({{{{{{{\mathcal{A}}}}}}}}\) to \({{{{{{{\mathcal{B}}}}}}}}\). It follows that the lumped states can also be ordered so that q is a quasi-birth–death process and hence q is tridiagonal by blocks.

We now give an outline of the proof of Theorem 2.2 by illustrating how we derive the elements of q from the full Markov chain description. Consider the case where qij corresponds to a vertex changing from vertex-state \({{{{{{{\mathcal{A}}}}}}}}\) to \({{{{{{{\mathcal{B}}}}}}}}\), so \({s}^{[j]}={s}^{[i]}+{\nu }_{{{{{{{{\mathcal{B}}}}}}}}}-{\nu }_{{{{{{{{\mathcal{A}}}}}}}}}\). In Eq. (8), for each state S[k] Πi we sum the rates into Πj to get (QC)kj. As assumed, these non-zero rates are associated with vertices in vertex-state \({{{{{{{\mathcal{A}}}}}}}}\) changing to \({{{{{{{\mathcal{B}}}}}}}}\). Thus we can go through each vertex in S[k] that is in vertex-state \({{{{{{{\mathcal{A}}}}}}}}\), count the number of its neighbours that are in each of the vertex-states to compute the transition rate (1), and sum these up. Equation (8) then averages these over all states in Πi. Our key insight is that rather than summing over states as Eq. (8) suggests, we can achieve the same total by summing over vertices and the possible states of neighbours.

For a vertex v with degree dv, the number of states in Πi where vertex v is in vertex-state \({{{{{{{\mathcal{A}}}}}}}}\) and has n = (n1, n2, …, nM) neighbours in each of the vertex-states is

$${\left({{{d}_{v}}\atop{n}}\right)}{\left({{N-1-{d}_{v}}\atop{{s}^{[i]}-{\nu }_{{{{{{{{\mathcal{A}}}}}}}}}-n}}\right)},$$

where we have used our generalised multinomial notation, indicated by the presence of vectors in the denominators, e.g. \({\left({{{d}_{v}}\atop{n}}\right)}={\left({{{d}_{v}}\atop{{n}_{1},{n}_{2},\ldots ,{n}_{m}}}\right)}\). The transition rate of a vertex from vertex-state \({{{{{{{\mathcal{A}}}}}}}}\) to \({{{{{{{\mathcal{B}}}}}}}}\) is given by Eq. (1). To compute qij we sum these rates over all N vertices and all possible values of n, and divide by the number of states to get

$${{{{{{{{\bf{q}}}}}}}}}_{i,j}=\frac{1}{{\left({{N}\atop{{s}^{[i]}}}\right)}}\mathop{\sum}\limits_{v\in V}\mathop{\sum}\limits_{n| {d}_{v}}\left({\delta }_{0}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}+\mathop{\sum }\limits_{m=1}^{M}{\delta }_{m}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}{n}_{m}\right){\left({{{d}_{v}}\atop{n}}\right)}{\left({{N-1-{d}_{v}}\atop{{s}^{[i]}-{\nu }_{{{{{{{{\mathcal{A}}}}}}}}}-n}}\right)},$$
(9)

where the sum over ndv denotes a sum over all possible values of n such that n1 + n2 +  + nm = dv.

We deal with the \({\delta }_{0}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}\) and \({\delta }_{m}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}{n}_{m}\) terms separately. Using a generalisation of the Vandermonde indentity (see the Supplementary Methods for details), the sum with the constant term \({\delta }_{0}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}\) is

$$\frac{1}{\left({{N}\atop{{s}^{[i]}}}\right)}\mathop{\sum}\limits_{v\in V}\mathop{\sum}\limits_{n| {d}_{v}}{\delta }_{0}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}{\left({{{d}_{v}}\atop{n}}\right)}{\left({{N-1-{d}_{v}}\atop{{s}^{[i]}-{\nu }_{{{{{{{{\mathcal{A}}}}}}}}}-n}}\right)}={\delta }_{0}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}{s}_{1}^{[i]},$$
(10)

where we have assumed, without loss of generality, that the first index of the lumped state, \({s}_{1}^{[i]}\), corresponds to the vertex-state \({{{{{{{\mathcal{A}}}}}}}}\). For the \({\delta }_{m}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}{n}_{m}\) terms, again using the generalised Vandermonde identity, we have

$$\mathop{\sum}\limits_{v\in V}\mathop{\sum}\limits_{n| {d}_{v}}{\delta }_{m}^{{{{{{{{\mathcal{A}}}}}}}},{{{{{{{\mathcal{B}}}}}}}}}{n}_{m}{\left({{{d}_{v}}\atop{n}}\right)}{\left({{N-1-{d}_{v}}\atop{{s}^{[i]}-{\nu }_{{{{{{{{\mathcal{A}}}}}}}}}-n}}\right)}={\left({{N-2}\atop{{s}^{[i]}-{\nu }_{{{{{{{{\mathcal{A}}}}}}}}}-{\nu }_{m}}}\right)}\mathop{\sum}\limits_{v\in V}{d}_{v}.$$
(11)

Substituting Eqs. (10) and (11) into Eq. (9), after some cancellation, yields Eq. (3). A detailed proof of Theorem 2.2 is included in the Supplementary Methods.

Error analysis of binary-state dynamics with a stationary distribution

We now focus on binary-state dynamics where there are two vertex-states, hence M = 2. Examples of binary-state dynamics include the SIS and voter models28 and in Supplementary Note 2 we provide a classification of the different types of binary-state dynamics. Consequently, we suppose that the set of vertex states is \({{{{{{{\mathcal{W}}}}}}}}=\{{{{{{{{\mathcal{S}}}}}}}},{{{{{{{\mathcal{I}}}}}}}}\}\) and refer to vertex-state \({{{{{{{\mathcal{S}}}}}}}}\) as ‘susceptible’ and vertex-state \({{{{{{{\mathcal{I}}}}}}}}\) as ‘infected’; an infection corresponds to a susceptible vertex becoming infected and a recovery corresponds to an infected vertex becoming susceptible. We can partition the state-space of binary-state dynamics into levels so that the ith level, Πi contains all states that have i infected vertices, for i = 0, 1, …, N, i.e. Π = {Π0, Π1, …, ΠN}. It follows that the approximate lumping generator q is tridiagonal and QCCq is tridiagonal by blocks of column vectors of varying size. For 0 ≤ i < N, the column vectors of QCCq just above the diagonal correspond to infections and we denote these by

$${A}_{{{{\Pi }}}_{i}}={\left({({{{{{{{\bf{QC}}}}}}}})}_{k,i+1}-{{{{{{{{\bf{q}}}}}}}}}_{i,i+1}\right)}_{{S}^{[k]}\in {{{\Pi }}}_{i}}^{{{{{{{{\rm{T}}}}}}}}}.$$

Thus \({A}_{{{{\Pi }}}_{i}}\) captures the difference between the sum of infection rates out of states in level i into level i + 1, and the mean qi,i+1. Note that we use the subscript Πi to illustrate that the variable is a vector over the states in Πi. Similarly, for 0 < i ≤ N, the column vectors of QCCq just below the diagonal correspond to recoveries and we denote these by

$${B}_{{{{\Pi }}}_{i}}={\left({({{{{{{{\bf{QC}}}}}}}})}_{k,i-1}-{{{{{{{{\bf{q}}}}}}}}}_{i,i-1}\right)}_{{S}^{[k]}\in {{{\Pi }}}_{i}}^{{{{{{{{\rm{T}}}}}}}}},$$

so \({B}_{{{{\Pi }}}_{i}}\) captures the differences between the recovery rates out of level i into level i−1, and the mean. We then have

$$\left({{{{{{{\bf{QC}}}}}}}}-{{{{{{{\bf{Cq}}}}}}}}\right)=\left(\begin{array}{ccccc}-{A}_{{{{\Pi }}}_{0}}&{A}_{{{{\Pi }}}_{0}}&0&\ldots &0\\ {B}_{{{{\Pi }}}_{1}}&-{B}_{{{{\Pi }}}_{1}}-{A}_{{{{\Pi }}}_{1}}&{A}_{{{{\Pi }}}_{1}}&&0\\ \vdots &\ddots &\ddots &\ddots &\vdots \\ 0&\ldots &{B}_{{{{\Pi }}}_{N-1}}&-{B}_{{{{\Pi }}}_{N-1}}-{A}_{{{{\Pi }}}_{N-1}}&{A}_{{{{\Pi }}}_{N-1}}\\ 0&\ldots &0&{B}_{{{{\Pi }}}_{N}}&-{B}_{{{{\Pi }}}_{N}}\end{array}\right),$$

where the zero entries indicate appropriately sized vectors of zeroes.

To simplify the error computation we assume that the initial distribution of the full Markov chain is stationary so that X(t) = X*, whose kth component is \({X}_{k}^{* }\). We also use \({X}_{{{{\Pi }}}_{i}}^{* {{{{{{{\rm{T}}}}}}}}}={({X}_{k}^{* })}_{{S}^{[k]}\in {{{\Pi }}}_{i}}\) to denote the vector of stationary probabilities of states in Πi. Hence we find that

$${\left({{{{{{{\bf{QC}}}}}}}}-{{{{{{{\bf{Cq}}}}}}}}\right)}^{{{{{{{{\rm{T}}}}}}}}}{X}^{* }=\left(\begin{array}{c}-{\sigma }_{0}\\ {\sigma }_{0}-{\sigma }_{1}\\ \vdots \\ {\sigma }_{N-2}-{\sigma }_{N-1}\\ {\sigma }_{N-1}\end{array}\right),$$

where

$${\sigma }_{i}={A}_{{{{\Pi }}}_{i}}^{{{{{{{{\rm{T}}}}}}}}}{X}_{{{{\Pi }}}_{i}}^{* }-{B}_{{{{\Pi }}}_{i+1}}^{{{{{{{{\rm{T}}}}}}}}}{X}_{{{{\Pi }}}_{i+1}}^{* }.$$

The σi contain information about the full system and therefore cannot be directly computed for typical systems of interest, i.e. when the size of the full state-space is beyond what can be stored in computer memory.

We now consider the equilibrium solutions of Eqs. (2) and (4) in turn. For binary-state dynamics, our lumped approximation is a birth–death process, where a birth corresponds to an infection and a death corresponds to a recovery. Thus we can write

$${{{{{{{\bf{q}}}}}}}}=\left(\begin{array}{ccccc}-{\lambda}_{0}&{\lambda}_{0}&{0}&{\cdots} &{0}\\ {\mu}_{1}&-{\mu}_{1}-{\lambda}_{1}&{\lambda}_{1}&&{0}\\ {\vdots} &{\ddots} &{\ddots} &{\ddots} &{\vdots} \\ {0}&{\cdots} &{\mu}_{N-1}&-{\mu}_{N-1}-{\lambda}_{N-1}&{\lambda}_{N-1}\\ {0}&{\cdots} &{0}&{\mu}_{N}&-{\mu}_{N}\end{array}\right),$$

where the rates λi and μi are finite and positive. The analytical expression for the stationary distribution \({x}^{* }={({x}_{0}^{* },{x}_{1}^{* },\ldots ,{x}_{N}^{* })}^{{{{{{{{\rm{T}}}}}}}}}\) of such a birth–death process can be found in standard texts, e.g. Kijima27, but we reproduce it here in order to introduce notation that we will use when we derive the equilibrium of the error ODEs (4). The stationary distribution x* solves the recursion relation

$${x}_{i+1}^{* }=\frac{{\lambda }_{i}}{{\mu }_{i+1}}{x}_{i}^{*},$$

which has solution

$${x}_{i}^{* }=\frac{{\phi }_{i}}{{{\Phi }}},$$
(12)

where ϕ0 = 1, for i > 0

$${\phi }_{i}=\frac{{\lambda }_{i-1}{\lambda }_{i-2}\cdots {\lambda }_{0}}{{\mu }_{i}{\mu }_{i-1}\cdots {\mu }_{1}},$$

and

$${{\Phi }}=\mathop{\sum }\limits_{i=0}^{N}{\phi }_{i}.$$

Similar to the lumped dynamics, the equilibrium of the error ODEs (4), \({y}^{* }={({y}_{0}^{* },{y}_{1}^{* },\ldots ,{y}_{N}^{* })}^{{{{{{{{\rm{T}}}}}}}}}\), satisfies the system of equations

$$0 =-{\lambda }_{0}{y}_{0}^{* }+{\mu }_{1}{y}_{1}^{* }-{\sigma }_{0},\\ 0 ={\lambda }_{i-1}{y}_{i-1}^{* }-({\lambda }_{i}+{\mu }_{i}){y}_{i}^{* }+{\mu }_{i+1}{y}_{i+1}^{* }+{\sigma }_{i-1}-{\sigma }_{i},\,\,{{\mbox{and}}}\\ 0 ={\lambda }_{N-1}{y}_{N-1}^{* }-{\mu }_{N}{y}_{N}^{* }+{\sigma }_{N-1},$$

where 0 < i < N. It follows that the solution solves the recursion

$${y}_{i}^{* }=\frac{1}{{\mu }_{i}}\left({\lambda }_{i-1}{y}_{i-1}^{* }+{\sigma }_{i-1}\right).$$

Since both X* and x* are probability distributions, their elements sum to one and thus the sum of \({y}_{i}^{* }\) is zero. Consequently for i > 0 we find

$${y}_{i}^{* }={\phi }_{i}{\psi }_{i}-{x}_{i}^{* }{{\Psi }},$$
(13)

where ψ0 = 0, for i > 0

$${\psi }_{i}=\mathop{\sum }\limits_{j=0}^{i-1}\frac{{\sigma }_{j}}{{\phi }_{j+1}{\mu }_{j+1}},$$

and

$${{\Psi }}=\mathop{\sum }\limits_{i=0}^{N}{\phi }_{i}{\psi }_{i}.$$

By substituting Eq. (13) into the definition of the mean error \({\bar{y}}^{* }\), given by Eq. (6), we find

$${\bar{y}}^{* }=\mathop{\sum }\limits_{i=0}^{N-1}{\rho }_{i}{\sigma }_{i},$$
(14)

where

$${\rho }_{i}=\frac{1}{{\phi }_{i+1}{\mu }_{i+1}}\mathop{\sum}\limits_{j = i+1}^{N}(j-{\bar{x}}^{* }){\phi }_{j}$$

and \({\bar{x}}^{* }={\sum }_{i = 0}i{x}_{i}^{* }\) is the stationary mean number of infected vertices. Thus we have split the calculation of \({\bar{y}}^{* }\) into terms σi, which depend on the full Markov chain (and hence must be approximated), and terms ρi, which depend on the lumped system (and hence can be computed). Moreover, using the definition of \({\bar{x}}^{* }\) and Φ, it is straightforward to prove that ρi > 0 for all i, which suggests an intuitive bound on the absolute value of the stationary mean error given by

$$| {\bar{y}}^{* }| \le \mathop{\sum }\limits_{i=0}^{N-1}{\rho }_{i}| {\sigma }_{i}| .$$
(15)

Example: error approximation for the SISa model

We now consider results for the SISa model32, where the VSTM has infection rate \({f}_{{{{{{{{\mathcal{S}}}}}}}},{{{{{{{\mathcal{I}}}}}}}}}({n}_{1},{n}_{2})=\alpha +\beta {n}_{1}\), recovery rate \({f}_{{{{{{{{\mathcal{I}}}}}}}},{{{{{{{\mathcal{S}}}}}}}}}({n}_{1},{n}_{2})=\gamma\) and \({f}_{{{{{{{{\mathcal{S}}}}}}}},{{{{{{{\mathcal{S}}}}}}}}}={f}_{{{{{{{{\mathcal{I}}}}}}}},{{{{{{{\mathcal{I}}}}}}}}}=0\). We derive bounds on the σi terms for the SISa model, which with Eq. (15) allow us to bound \(| {\bar{y}}^{* }|\). We also consider approximations of the σi terms, which with Eq. (14) allow us to approximate \({\bar{y}}^{* }\). Using Eq. (8), for the SISa model we find for S[k] Πi that

$${\left({{{{{{{\bf{QC}}}}}}}}\right)}_{k,i+1}=\alpha (N-i)+\beta {n}_{{{\mbox{SI}}}}^{[k]},$$

where

$${n}_{{{\mbox{SI}}}}^{[k]}=\mathop{\sum}\limits_{v\in V}{{{{{{{{\bf{1}}}}}}}}}_{\{{S}^{[k]}(v) = {{{{{{{\mathcal{S}}}}}}}}\}}(v){n}_{1}^{[k]}(v).$$

Note that \({n}_{{{\mbox{SI}}}}^{[k]}\) is the number edges that connect susceptible vertices with infected vertices (hereon referred to as SI edges) in the state S[k]. Our formula for \({\left({{{{{{{\bf{QC}}}}}}}}\right)}_{k,i+1}\) above for the SISa model follows from the fact that there are Ni susceptible vertices, and summing how many infected neighbours each has is equivalent to counting the number of SI edges. It follows that

$${A}_{{{{\Pi }}}_{i}}=\beta {\left({n}_{{{\mbox{SI}}}}^{[k]}-\frac{z}{N-1}i(N-i)\right)}_{{S}^{[k]}\in {{{\Pi }}}_{i}}^{{{{{{{{\rm{T}}}}}}}}},$$
(16)

so the entry in \({A}_{{{{\Pi }}}_{i}}\) corresponding to the state S[k] is proportional to the difference between the number of SI edges in state S[k] and the average of the number of SI edges in states in the ith level. A similar calculation shows that (QC)k,i−1 = γi and hence \({B}_{{{{\Pi }}}_{i}}=0\) for all i, i.e. the total recovery rate of a state in the SISa model is the same for all states in the same level. Thus for the SISa model \({\sigma }_{i}={A}_{{{{\Pi }}}_{i}}^{{{{{{{{\rm{T}}}}}}}}}{X}_{{{{\Pi }}}_{i}}^{* }\), hence if \({a}_{i}^{+}=\mathop{\max }\nolimits_{{S}^{[k]}\in {{{\Pi }}}_{i}}| {A}_{{{{\Pi }}}_{i}}|\) then

$$| {\sigma }_{i}| \le {a}_{i}^{+}\mathop{\sum}\limits_{{S}^{[k]}\in {{{\Pi }}}_{i}}{X}_{k}^{* }.$$

Determining \({a}_{i}^{+}\) and the sum of probabilities in the ith level would allow us to bound the absolute value of the mean error, but this may be intractable in practice because it requires knowledge of the full Markov chain. Thus to obtain a bound on the stationary absolute mean error of the SISa model, we use an approximation for \({a}_{i}^{+}\), denoted by \({\widetilde{a}}_{i}^{+}\), and then assume that \({\widetilde{a}}_{i}^{+}{x}_{i}^{* }\ge | {A}_{i}^{{{{{{{{\rm{T}}}}}}}}}{X}_{{{{\Pi }}}_{i}}^{* }|\). In Supplementary Note 3 we show that while this assumption does not always hold, we typically obtain an informative bound regardless.

We now describe how we obtain \({\widetilde{a}}_{i}^{+}\). Note that \({a}_{i}^{+}\) arises from the state in level i with either the largest or smallest number of SI edges. We refer to these states as the max and min SI states respectively. Finding the max SI states is equivalent to the Max-Cut problem, which is NP complete33. Finding the min SI states is also difficult because one needs to identify maximal cliques, which is also NP complete56. Because of this, we settle instead for estimates based on a greedy algorithm that starts from the state with all susceptible vertices and sequentially chooses a susceptible vertex to become infected that introduces the largest or smallest number of SI edges.

The algorithm is as follows. For binary-state dynamics in which vertices are either susceptible or infected, we iterate from level 0 to \(\lfloor \frac{N}{2}\rfloor\), picking a new vertex at each level to switch from susceptible to infected. There is only one state in level 0, in which all vertices are susceptible, so this is the state identified by the algorithm at the 0th level. Suppose that at the ith level the state S[k] is identified by the algorithm, then for each susceptible vertex v in S[k], we compute the number of infected neighbours \({n}_{1}^{[k]}(v)\) and the number of susceptible neighbours \({n}_{2}^{[k]}(v)\). We then pick the vertex with the largest difference \({n}_{1}^{[k]}(v)-{n}_{2}^{[k]}(v)\) (which may be negative) to be infected, and this is the state that the algorithm identifies for the i + 1th level. If there are multiple such vertices then we pick the one with the lowest index. This last step ensures our algorithm is deterministic, although to destroy possible correlations between vertex degrees and their labels, it may be necessary initially to randomise the vertex labelling. In binary-state dynamics there is a symmetry about \(\lfloor \frac{N}{2}\rfloor\), by switching susceptible vertices to infected and infected to susceptible, which preserves the number of SI edges. We apply this symmetry to the states selected so far to determine the states in levels above \(\lfloor \frac{N}{2}\rfloor\). Clearly one could perform a more extensive search, but our goal is to have an algorithm that scales well with the number of vertices. A nearly identical process can be used to identify a state in each level with a low number of SI edges by selecting the vertex with the smallest difference \({n}_{1}^{[k]}(v)-{n}_{2}^{[k]}(v)\) to become infected.

For level i, we use \({\widetilde{n}}_{i}^{+}\) and \({\widetilde{n}}_{i}^{-}\) to denote the maximum and minimum number of SI edges found by this algorithm, respectively. We also attempt to approximate σi with \({a}_{i}^{* }{x}_{i}^{* }\), where

$${a}_{i}^{* }=\beta \left(\frac{{\widetilde{n}}_{i}^{+}+{\widetilde{n}}_{i}^{-}}{2}-\frac{z}{N-1}i(N-i)\right).$$

This gives a measure of the skew of the distribution of the number of SI edges in each state in the same level.