Introduction

The worldwide simulation of diseases spreading is a challenge involving many scales of complexity1,2,3,4. These scales are progressively taken into account by introducing, every time, more realistic elements in the description of the whole scenario. Relatively recent advances on this specific goal came from the consideration of the specific topology of mobility networks that represent main carriers of epidemics at the global scale4,5,6. The worldwide airports network has been identified as the fastest interaction mechanism between humans living very far apart. In this context, metapopulation models constitute a natural approach for the analysis of epidemic spreading processes, since they combine in a single framework the local contagions in the so-called reaction phase, and the mobility of the individuals in the diffusion stage7,8,9,10,11,12.

Although metapopulation models have shown its predictive power in recent outbreaks, it has been recognized the crucial role played by the mobility patterns13,14,15. A common hypothesis in metapopulation simulations is that the diffusion through the links of the network is markovian, i.e., each individual willing to move to a neighboring node will choose the destination according to certain fixed probabilities assigned to each of the links9,11,12,16. Usually this method is complemented with memory of the origin node, thus making the individuals come back to their respective residences in a home-to-work travel pattern15,17. Another possible extension is the selection of a random destination instead of a neighbor, e.g. with probability proportional to the strength of the node, thus allowing for longer trips18. This hypothesis is a shortcut on the difficult problem of assessing how many people travel from a certain origin to a certain destination. Although this information is probably registered by every air carrier, its recollection at world wide scale is not easy. Data scientists usually have at disposal the amounts of (usually annual) accumulated number of passengers between two connected places, that is the weight of the link, but no information about initial origin or final destination is provided.

In this work we present a methodology to assess origin-destination flows in the air transportation networks, and after we analyze the results obtained with these more realistic flows of passengers on the incidence of epidemic spreading considering metapopulation models7,10. Our results reveal that the consideration of real flow patterns affect the microscopic description of the incidence at the level of cities, while the statistical aggregated incidence is similar to those predicted without considering origin-destination flows.

Essentially, we could summarize that the hard problem of estimating origin-destination patterns can be avoided if the interest is focused on macroscopic details of the epidemic spreading, however it should be considered when microscopic details about the epidemic spreading are needed. In the following we present the method used to estimate the data-driven origin-destination patterns, the metapopulation model, and the comparison of the different routing flow strategies compared to the data-driven scenario. To conclude we present a discussion about the convenience of using origin-destination data-driven flows in the assessment of epidemic spreading.

Results

Estimation of an Origin-Destination matrix

For most kind of analyses in transportation networks, there is a need for origin-destination (O-D) matrices, which specify the travel demands between the origin and destination nodes in the network. Here we wonder up to which point the estimation of O-D matrices are essential factors to determine the outcome of a certain epidemic process that uses the transportation network as the substrate for the carriers. Let us first determine what is the mathematical problem we are facing.

Given a transport network with n nodes and m links, an Origin-Destination (O-D) matrix T is a two-dimensional trip table whose entries tij represent the number of trips going from origin node i to destination node j. Let va be the observed trip volume on link a of the network, and let \({p}_{ij}^{a}\) be the proportion of trips going from origin node i to destination node j that use link a. For the latter proportions we assume that travelers follow shortest-path routes and, in case of several alternatives, any of them is selected at random with equal probability.

The base constraints to be satisfied for an O-D matrix estimated from link counts state that the sum of all the trips crossing a given link must be equal to the link counts observed on that link,

$$\sum _{i=1}^{n}\,\sum _{j=1}^{n}\,{p}_{ij}^{a}{t}_{ij}={v}^{a},\,a=1,\ldots ,m,$$
(1)

and we have to add the lower bounds for the number of trips,

$${t}_{ij}\ge 0,\,i,j=1,\ldots ,n.$$
(2)

The system of equations formed by Eqs (1) and (2) has always more unknowns to estimate (n × n O-D cells) than base constraints (m links), the only exception being the complete networks (cliques); in fact, most real networks are sparse, thus the number of unknows is much larger than the number of equations, \({n}^{2}\gg m\). Given the possibility of existence of multiple solutions, additional considerations to select a preferred O-D matrix are needed. Several approaches have been traditionally used to solve this problem, from which we have selected a maximum entropy model that estimates the most likely trip matrix consistent with observed link counts. The application of the entropy maximization principles to the O-D estimation problem was initially proposed by Willumsen19 and Van Zuylen20. In the maximum entropy approach, the most likely trip matrix is the one having the greatest number of microstates associated with it, what is equivalent to estimate an O-D matrix that adds as little information as possible to the knowledge contained in the link counts.

Let t be the total number of O-D trips traversing a network. Then the number of ways, defined as entropy, in which t trips can be divided into groups of tij trips can be computed as

$$S=\frac{t!}{{\prod }_{i=1}^{n}\,{\prod }_{j=1}^{n}\,{t}_{ij}!}.$$
(3)

To be able to compute the derivatives needed to optimize the objective function (3), we take first the natural logarithm of the function and make use of Stirling’s approximation of the factorials, ln(t!) ≈ t ln(t) − t, thus obtaining a more computationally convenient objective function:

$$\mathrm{ln}(S)\approx t\,\mathrm{ln}(t)-t-\sum _{i=1}^{n}\,\sum _{j=1}^{n}\,({t}_{ij}\,\mathrm{ln}({t}_{ij})-{t}_{ij}).$$
(4)

Assuming that t is constant and changing the sign of the function, the first two terms can be dropped and the goal becomes to minimize the function

$$E=\sum _{i=1}^{n}\,\sum _{j=1}^{n}\,({t}_{ij}\,\mathrm{ln}({t}_{ij})-{t}_{ij})$$
(5)

subject to the constraints in Eqs (1) and (2).

Epidemic spreading model

The epidemic spreading model we use is based on the well-known reaction-diffusion11 and metapopulation12 models, but incorporating a routing scheme for the moving individuals similar to those in18,21. The main ingredients are the sites or nodes (e.g. cities or airports), where the individuals of the global population are located, and the links between these nodes, which represent the communication channels (e.g. roads or airways). Each link has an associated weight which accounts for the traffic, the number of individuals moving between nodes in a certain amount of time. The contagion model we have chosen is the standard susceptible-infected-removed (SIR) one, a good proxy for diseases in which the individuals acquire immunization after being infected. In this model individuals may be in three different states: infected (I), those who have got the disease; susceptible (S), healthy but which may become ill by contacts with infected individuals; and removed (R), once they have recovered from the disease and become immune to it. The dynamics of the epidemic spreading consists of two alternating phases, a reaction in which the individuals in the nodes’ subpopulations merge, and a diffusion of some individuals through the communication channels. In the reaction step the subpopulation is considered as well-mixed, i.e., every individual is in contact with the rest in the same node, and here is where the contagions and recoveries take place. There are no contagions between individuals at different nodes, but the disease spreads due to the mobility in the diffusion step.

In the standard metapopulation model12 the diffusion is just a Markovian process, where individuals move to neighboring nodes with a probability proportional to the weight of the link; we call this a Random Diffusion (RD) process. A more realistic mobility is obtained if every individual has a “home” node, it travels to a selected destination, and finally comes back home after spending some time in the destination. It would be possible to choose uniformly Random Destinations (RU), however it is more plausible a selection proportional to the strength of each possible target subpopulation as in21, what we call here the Strength Proportional (SP) scheme. Finally, the information in the O-D matrix defines a new diffusion model (OD), since it directly tells the number of trips from any origin to any destination.

In all the diffusion models where the destinations are not bound to be neighbors of the origin, it is necessary to establish the paths followed by the travelers, i.e., a routing algorithm. We have decided just to select the shortest path (in number of hops) and, if it is not unique, one of them is chosen at random with equal probability. It is possible to improve this routing algorithm taking into account geographic and economic constraints that restrict the feasible available paths, e.g. travel times and costs; however, this would only affect the availability and priority of paths, not the results and conclusions from this work.

Microscopic analysis

First, we analyze the results obtained from our metapopulation spreading models at the microscopic level of nodes and links. We make use of the World Air Transportation Network (ATN)22, considered as a directed and weighted network. For the calculation of its O-D matrix, we have only considered trips of length lower or equal to three, both for computational reasons and also because longer trips are rarely observed in real life (see Methods). All the presented results come from averages over 150 randomly chosen initial conditions in the Monte Carlo simulations. In all the cases, the mobility parameters have been chosen to ensure that at every time step the average number of circulating individuals is the same for all the mobility strategies. See Methods for a full description of the implementation details of our metapopulation epidemic spreading dynamics.

In Fig. 1 we plot the outgoing fluxes obtained with Random Diffusion and Strength Proportional schemes, comparing them to the ones obtained with O-D flows. At the level of nodes, both the Random Diffusion and the Strength Proportional strategies show fluxes clearly related to the O-D strategy ones. However, when we take into consideration the link fluxes, we can see that the behavior of the two strategies is different. In the case of the Random Diffusion process, link fluxes also show a linear relation with the O-D fluxes. On the contrary, the link fluxes obtained with the Strength Proportional strategy are not related at all with the O-D ones. The different behavior between the two strategies is explained because both the O-D and the Random Diffusion strategies satisfy constraints (1): the O-D matrix satisfies them explicitly as a result of a constrained optimization problem, and the Random Diffusion strategy satisfies them implicitly by the Markovian diffusion process that takes into account the weights of the links. On the contrary, the routing strategy used in the Strength Proportional approach sends travelers through shortest paths without taking into account the weights of the links. Therefore, according to the observed fluxes, we can conclude that travelers pass through the same airports independently of the diffusion strategy used, albeit the routes followed are completely different.

Figure 1
figure 1

Node and link output fluxes. Histogram of the registered outgoing fluxes at node level (left) and at link level (right), comparing the two strategies Random Diffusion (top) and Strength Proportional (bottom) with respect to the O-D strategy. The color accounts for the number of nodes (left) or links (right) in each bin of the histogram. The values of the Pearson correlation are 0.998 (RD) and 0.952 (SP) for node flux, and 0.979 (RD) and 0.304 (SP) for link flux. The network is the World ATN, and the measured normalized fluxes do not depend on the details of the epidemic process.

Although the Random Diffusion and the Strength Proportional strategies show different routing dynamics, it should be analyzed whether these differences are relevant for the epidemic spreading process. With this purpose, we compute the fraction of recovered individuals R at each node (remember that a node is a population) for both diffusion strategies, and we compare them to the ones obtained with the O-D flows. We can see in Fig. 2 the distribution of R for the three mobility schemes. We observe that, the larger the population size in the node (which is proportional to its strength, see Methods), the larger the incidence of the epidemics, with a non-linear dependency and an important dispersion between nodes of similar strength. The three schemes show a similar distribution profile, thus we need a more direct comparison of incidences node to node. Since the estimated errors on the mean incidence values at each node are mostly below 0.03 (see Fig. 2), differences in the incidence about or larger than 0.1 (in absolute value) can be considered as statistically significant. In the left plots of Fig. 3 we can observe an increasing difference as a function of the out strength of the node up to values about 104, and then the differences start to decrease. The right plots of Fig. 3 show the fraction of recovered individuals R from the Random Diffusion and the Strength Proportional strategies comparing them with respect to the ones obtained with the O-D flows. Again, we do not observe any difference between the patterns followed by the Random Diffusion and the Strength Proportional approaches, but it has become evident that the routing strategies greatly affect the epidemic spreading at the level of nodes. Therefore, the availability of O-D information is completely necessary to be able to have good predictions of the microscopic incidence of the epidemics.

Figure 2
figure 2

Incidence of the epidemics on the nodes. Left: distribution of the incidence of the epidemics R across the nodes, as a function of the out strength of the node, for the different routing schemes, O-D strategy (top), Random Diffusion (middle), and Strength Proportional (bottom). The value of the incidence for each node has been obtained by averaging over 150 Monte Carlo simulations. Right: the corresponding estimated errors on the previous mean values. The color encodes the number of nodes in each bin (left) and the average error in the epidemic incidence for the nodes in each bin (right). The network is the World ATN, and the parameters of epidemic spreading process are μ = 0.04, R0 = 2.0, and \(\lambda =5\cdot {10}^{-6}\).

Figure 3
figure 3

Comparison of the incidence of the epidemics on the nodes. Left: Differences in the fraction of recovered individuals per node, as a function of the out strength of the node, comparing the two strategies, Random Diffusion (top) and Strength Proportional (bottom), with respect to the O-D strategy. Right: Histogram of the fraction of recovered individuals R at \(t\to \infty \) comparing the two strategies, Random Diffusion (top) and Strength Proportional (bottom), with respect to the O-D strategy. The network is the World ATN, and the parameters of epidemic spreading process are μ = 0.04, R0 = 2.0, and \(\lambda =5\cdot {10}^{-6}\).

Macroscopic analysis

According to the results presented in Figs 1 and 3, the different routes that travelers follow in the Random Diffusion and the Strength Proportional strategies show the same patterns of discrepancies when we compare them to the O-D flows in epidemic spreading terms at the microscopic level. Next, we analyze whether these microscopic differences have any significant effect at the macroscopic level.

In Fig. 4 we show both the fraction of recovered individuals R and the fraction of subpopulations that experienced and outbreak D/V as a function of the traffic generation rate λ. The results show that, especially for the fraction of recovered individuals, the deviations observed between the O-D matrices and the Random Diffusion and Strength Proportional strategies are minimal at the macroscopic level. However, the Strength Proportional strategy seems to predict a slightly lower value of D/V than the other two strategies, with a deviation which increases for high values of the mobility.

Figure 4
figure 4

Global incidence of the epidemics for varying mobility. Fraction of recovered individuals R (top) and fraction of subpopulations that experienced an outbreak D/V (bottom) as a function of the traffic generation rate λ. The network is the World ATN subnetwork, and the epidemic spreading process has a recovery rate μ = 0.04.

Analyzing the effect of the reproductive number at the macroscopic level gives similar results. In Fig. 5 we show the fraction of recovered individuals R and the fraction of subpopulations that experienced and outbreak D/V as a function of the reproductive number R0. Taken together the results of Figs 4 and 5, they confirm that only minimal differences are present between the three diffusion strategies in terms of epidemic spreading at the macroscopic level. In the same figure we also show that this behavior observed for the Random Diffusion and the Strength Proportional strategies cannot be generalized to the Random Destination strategy, which does not take into account the weights of the links in the network. The Random Destination strategy shows a slight reduction in the fraction of recovered individuals, but an enormous overestimation of the fraction of infected subpopulations, thus discarding it as a feasible mobility method.

Figure 5
figure 5

Global incidence of the epidemics for varying reproductive number. Fraction of recovered individuals R (top) and fraction of subpopulations that experienced an outbreak D/V (bottom) as a function of the reproductive number R0 for different routing strategies. The network is the World ATN, and the parameters of epidemic spreading process are μ = 0.04 and \(\lambda =5\cdot {10}^{-6}\).

Discussion

We have analyzed the influence of the three different mobility models in a metapopulation model of epidemic spreading. Two of them, the Random Diffusion and Strength Proportional routing strategies, have been traditionally used in the simulation of epidemic spreading processes. Here, we have compared them against the more realistic O-D matrix scheme, which is able to capture not only the observed fluxes through the links of the network, but also the fraction of individuals moving between each origin-destination pair. We have shown how the origin-destination matrix can be calculated, and the differences between using these three different models.

Although the three methods show equivalent fluxes per node, the Strength Proportional clearly deviates at the level of fluxes per link, thus indicating the routes that travelers follow are completely different. These differences have an important impact on the epidemic spreading at the microscopic level of nodes, with significant departures in the incidence of the epidemics in the subpopulations at each node, and being especially important for nodes with intermediate values of the output strength. The observed deviations affect both the Strength Proportional and Random Diffusion models when compared with the more reasonable O-D matrix scheme, and with a very similar deviation pattern.

When we consider the epidemic spreading at the macroscopic level of the whole World ATN network, we realize the differences almost vanish, thus making the Random Diffusion, the Strength Proportional and the O-D matrix approaches basically equivalent. This means it is safe to use any of the three mobility models if the interest just lays on the global incidence of the epidemics, but care must be taken when the details are needed at the level of nodes, where only origin-destination information is able to provide the desired quality of the predictions.

Methods

O-D matrix of air transportation network

The World Air Transportation Network (ATN) data set used in this work is composed of passenger flights operating between November 1, 2000 and October 31, 2001 as compiled by the OAG Worldwide (Downers Grove, IL) and analyzed previously in22. The network is formed by 3618 different airports, that means the existence of a total of 13086306 possible O-D trips. The analysis of all the shortest paths using the Floyd-Warshall algorithm23 shows that there are 27028 (0.21%) direct connections (shortest paths of length one), 440038 (3.36%) connections with one transfer (length two), and 2463230 (18.82%) with two transfers (length three). These are the only trips considered here, since longer air trips are rarely observed in real life. Additional cost factors cij (equal to 1, 2, or 3), accounting for the length of the route from origin node i to destination node j, have been added to the objective function E in Eq. (5) in order to avoid the dominance of longer trips:

$${E}_{c}=\sum _{i=1}^{n}\,\sum _{j=1}^{n}\,{c}_{ij}\,({t}_{ij}\,\mathrm{ln}({t}_{ij})-{t}_{ij}).$$
(6)

With this final formulation of the objective function Ec, the O-D matrix estimation problem becomes a separable convex optimization problem that we have solved using MOSEK24, a software especially designed for large-scale mathematical optimization problems.

Epidemic spreading model implementation

The epidemic model is implemented as follows. Each simulation starts with a small fraction of infected individuals, I0. Namely, we randomly choose a small fraction of sites (less than 1%) and within these subpopulations only the 1% of the individuals is infected, ensuring that the condition \({I}_{0} > \tfrac{\mu }{\beta }\) is fulfilled, where β and μ are the infection and recovery rates, respectively, and \({R}_{0}=\tfrac{\beta }{\mu }\) is the so-called reproductive number. In the simulations the diffusion and reaction dynamics have the same time scale so, at each time step, first a movement step is performed and then the SIR dynamics takes place.

In the diffusive phase, for each possible pair of origin-destination nodes (i, j), the number of individuals starting a trip from node i towards node j is given by a binomial distribution with parameters the subpopulation size Ni(t) and the probability \({p}_{ij}=\lambda \,\tfrac{{t}_{ij}}{{\sum }_{k}\,{t}_{ik}}\), where 0 ≤ λ ≤ 1 is a mobility rate to distinguish between different traffic regimes, and tij is the (i, j)-th entry of the O-D matrix. To simulate the Strength Proportional and Random Destinations schemes, the O-D matrix is substituted respectively by a Strength-Destination matrix S whose elements are equal to the input strength of the destination node, \({s}_{ij}={w}_{j}^{\,(in)\,}={\sum }_{k}\,{w}_{kj}\), and by an uniform Random-Destination matrix U in which all the elements equal the unity, uij = 1. Once the number of traveling individuals and their respective destinations have been chosen, they start a shortest path trip to their destination, making one hop to a neighboring node at each time step. When the individuals arrive to their destinations, they come back to their home nodes, also following a shortest path, with one hop per time step. This travel pattern yields and almost stationary size of the subpopulations for all the mobility schemes, provided the mobility rate is not too large.

After the diffusive step, the reaction dynamics is evaluated. As a proxy of the initial population of each node Ni(0) we use node’s strength wi calculated in the original ATN data22, which amounts a total of \(5.82\cdot {10}^{7}\) individuals in the network. The distribution of subpopulation sizes (i.e., strengths) approximately follows a power-law distribution, with sizes ranging between 10 and \(3.05\cdot {10}^{6}\) individuals. When time goes on, Ni(t) changes according to the chosen destination selection scheme and routing strategy. Within the nodes, one step of a SIR process takes place, supposing a well-mixed population. The state of every individual inside a node i is modified according to the following probabilities: a susceptible individual becomes infected with probability \({p}^{(S\to I)}=1-{(1-\tfrac{\beta }{{N}_{i}})}^{{I}_{i}}\), and an infected one recovers with probability \({p}^{(I\to R)}=\mu \). Specifically, the exact number of individuals that change their state is determined by binomial distributions with probabilities \({p}^{(S\to I)}\) and \({p}^{(I\to R)}\), and population sizes Si(t) and Ii(t) of susceptible and infected individuals, respectively. Note that in this scenario, R0 only participates in the internal nodes’ dynamics; individuals traveling through node i are involved in the epidemic dynamics and thus they can change their state while at node i. Finally, simulations end when the stationary state I(t) = 0 is reached.