Identifying influential spreaders by gravity model

Identifying influential spreaders in complex networks is crucial in understanding, controlling and accelerating spreading processes for diseases, information, innovations, behaviors, and so on. Inspired by the gravity law, we propose a gravity model that utilizes both neighborhood information and path information to measure a node’s importance in spreading dynamics. In order to reduce the accumulated errors caused by interactions at distance and to lower the computational complexity, a local version of the gravity model is further proposed by introducing a truncation radius. Empirical analyses of the Susceptible-Infected-Recovered (SIR) spreading dynamics on fourteen real networks show that the gravity model and the local gravity model perform very competitively in comparison with well-known state-of-the-art methods. For the local gravity model, the empirical results suggest an approximately linear relation between the optimal truncation radius and the average distance of the network.


Identifying influential spreaders by gravity model
Zhe Li 1 , Tao Ren 1 , Xiaoqi Ma 2 , Simiao Liu 1 , Yixin Zhang 1 & Tao Zhou 3 Identifying influential spreaders in complex networks is crucial in understanding, controlling and accelerating spreading processes for diseases, information, innovations, behaviors, and so on. Inspired by the gravity law, we propose a gravity model that utilizes both neighborhood information and path information to measure a node's importance in spreading dynamics. In order to reduce the accumulated errors caused by interactions at distance and to lower the computational complexity, a local version of the gravity model is further proposed by introducing a truncation radius. Empirical analyses of the Susceptible-Infected-Recovered (SIR) spreading dynamics on fourteen real networks show that the gravity model and the local gravity model perform very competitively in comparison with well-known state-of-the-art methods. For the local gravity model, the empirical results suggest an approximately linear relation between the optimal truncation radius and the average distance of the network.
Network science is playing an increasingly significant role in many domains including physics, sociology, engineering, biology, management, and so on 1 . The heterogeneous nature of real networks 2 asks for a crucial question: How to quantitatively measure a node's importance in a dynamical process? Taking spreading dynamics as an example, a popular star in Twitter may remarkably accelerate a rumor and a few superspreaders could largely expand the epidemic prevalence of a disease 3 . Therefore, a good answer to the above question, namely an efficient algorithm to identify influential spreaders in complex networks, can help to better control the outbreak of an epidemic 4 , optimize the use of limited resources to facilitate the dissemination of information 5 , prevent catastrophic disruptions of power grid or the Internet 6 , discover the candidates of drug target and essential proteins 7 , and so on. Till far, most known methods only make use of the structural information 8 , which can be roughly classified into neighborhood-based centralities and path-based centralities.
Typical representatives of the neighborhood-based centralities are degree centrality 9 (DC), H-index 10 and k-shell decomposition method 11 (KS). For DC, nodes with larger degrees are more influential. For H-index, nodes connecting with many large-degree neighbors are more influential. KS assigns a k-shell index to each node based on its topological location, where nodes closer to the core of the network will get higher k-shell indices, and nodes in the periphery will get lower k-shell indices. The nodes with higher k-shell indices are considered to be more influential. Besides, PageRank 12 and LeaderRank 13 are two representative neighborhood-based iterative methods, both suggesting that the influence of a node is determined by the influences of its neighbors. Two well-studied path-based centralities are closeness centrality 14 (CC) and betweenness centrality 15 (BC). CC claims that a node averagely closer to other nodes is more influential while BC assumes that a node locating in many shortest paths is of high influence.
Inspired by the gravity law, recently, Ma et al. 16 proposed two gravity-law-based algorithms by considering both neighborhood information and path information (see Methods for the details of algorithms). Analogously, we proposed a variant algorithm named gravity model (GM), which also takes into account both neighborhood information and path information, where a node with larger degrees (neighborhood information) and averagely shorter distances to other nodes (path information) is more influential. Furthermore, we propose a local version of the gravity model (named as local gravity model, LGM for short) to lower the computational complexity and reduce the possible noise caused by interactions at distance. Such local model only accounts for pairwise interactions within a truncation radius. Empirical results show that GM and LGM perform very competitively in comparison with well-known state-of-the-art methods. In particular, for LGM, an empirically linear relation between the optimal truncation radius and the average distance of the network is observed.

Results
Algorithms. Individually speaking, nodes with large degrees are likely to be more influential. In addition, a node is of higher impacts on nearby nodes 17 . According to the above issues and inspired by the gravity law, we regard the degree of a node as its mass, and the shortest distance between two nodes as their distance. Hence a node i's influence can be estimated as where k i is the degree of node i, d ij is the shortest distance between node i and node j, and j runs over all nodes other than i. Obviously, a node with many neighbors and be close to most nodes is more influential according to Eq. 1. Such method is named as gravity model as it adopts the formula of the gravity law. Although GM can identify the nodes averagely closer to other nodes and with larger degrees, it has two shortcomings. Firstly, to calculate shortest distances between all node pairs is time-consuming for large-scale networks 18 . Secondly, in real propagation a node is hard to impact other nodes at distance and to estimate the interacting strength between distant nodes is usually inaccurate since the step-by-step decaying influence will be disturbed by accumulated noise 19 . Therefore, by introducing a truncation radius, we only consider the pairwise interactions within the truncation radius. Hence a node i's influence can be estimated as where R is the truncation radius. Such method (Eq. 2) is named as local gravity model as it only takes into account local information of the network.
Data description. In this paper, fourteen real networks from disparate fields are used to test the performance of GM and LGM, including three collaboration networks (Jazz, NS and GrQc), four communication networks (EEC, Email, PG and Enron), four social networks (PB, Facebook, WV and Sex), one transportation network (USAir), one infrastructure network (Power) and one technological network (Router). Jazz 20 is a collaboration network of jazz musicians. NS 21 is a co-authorship network of scientists working on network science. GrQc 22 is a collaboration network of eprint articles in arXiv categories General Relativity and Quantum Cosmology. EEC 23  www.nature.com/scientificreports www.nature.com/scientificreports/ Empirical results. We apply the well-known SIR model 36 to compare the rankings of influences produced by algorithms and simulations. Initially, one node (called seed) in the network is in the infected state (I) and the others are in the susceptible state (S). Each of the infected nodes can infect its susceptible neighbors with probability β. And in each step, every infected node changes to be recovered and will never participate in the dynamics with probability λ. The spreading process repeats until there are no more infected nodes in the network. The influence of any node i can be estimated by r where N r is the number of recovered nodes at the end of the dynamics. For simplicity, we set λ = 1, and the corresponding epidemic threshold 34 is where 〈k〉 and 〈k 2 〉 denote the average degree and the second-order moment of the degree distribution. Given a network and the transmission probability β, to obtain the standard ranking of nodes' influences, we implement 1000 independent runs, in each run every node is selected once as the seed once. The accuracy of an algorithm is measured by the Kendall's Tau (τ) 37 between the standard ranking and the ranking by the algorithm (see details in Methods). A larger value of τ means a stronger correlation between the two sequences and thus a better performance. Table 2 compares the accuracies of the two proposed algorithms (i.e., GM and LGM) and seven benchmark algorithms (see details about the benchmark algorithms in Methods). The transmission probability for each case is fixed as β = β c (for more values of β, see Fig. 1) and the parameters in relevant algorithms are all adjusted to their optimal values subject to the largest τ.
As shown in Table 2, both GM and LGM are very competitive. In particular, G+ and LGM perform best among the nine algorithms. Notice that, G+ also adopts the gravity formula 16 (see Methods) but a node's mass in G+ is defined as its k-shell index so G+ is indeed a global index. The results reported in Table 2 demonstrate the advantage of gravity models (e.g., G, G+, GM, LGM) and show that a local index (LGM) can outperform most benchmark algorithms including some global indices. As shown in Fig. 1, results for other values of β not too far from the threshold are consistent to the one at β c , suggesting the robustness of our findings.
Since to determine the optimal truncation radius, denoted by R * , asks for more computation, we want to see whether topological information can be used to profile R * . As shown in Fig. 2, R * approximately scales linearly with the average distance, as at β = β c . Such approximately linear relation also holds for other values of β not so far from β c . This empirical relation can save computational cost in practice.

Discussion
To measure influences of nodes in a certain networked dynamics, a straightforward method is to estimate the interacting strengths between node pairs in advance. The gravity law is a simple, elegant and representative formula that estimates the interacting strength between two nodes by simultaneously considering the intrinsic influences of the two nodes themselves and the distance between them. In this paper, the gravity model (Eq. 1) makes use of both the neighborhood information and the path information, which were separately adopted in many previous methods. Furthermore, to reduce the computational complexity and to avoid the accumulated noises  www.nature.com/scientificreports www.nature.com/scientificreports/  The relation between R * and 〈d〉 for β = β c . Fourteen pentagrams represent fourteen networks and the slope of the blue line is 1/2. The pentagram in black is the outlier -the Enron network. Although the optimal truncation radius R * = 7 is much different from what Eq. 5 predicts (i.e., R = 2), the algorithmic accuracy at R = 2 (τ = 0.4949) is very close to the best accuracy at R * = 7 (τ = 0.5075) in comparison with the traditional methods (e.g., about 0.34 for BC, 0.42 for CC and 0.46 for DC, KS and H-index). That is to say, to apply Eq. 5 can still achieve much better algorithmic performance than the traditional methods. www.nature.com/scientificreports www.nature.com/scientificreports/ through long paths, we proposed a local version of the gravity model (LGM, see Eq. 2). Both GM and LGM are very competitive, and of particular interests, the LGM requires less computation yet performs even better. Indeed, LGM is one of the two best-performed methods among many well-known benchmark algorithms.
A potential disadvantage of LGM is that it has a free parameter, namely the truncation radius R. The negative effects of the existence of R are twofold. Firstly, it asks for more computation to determine the optimal value of R. Secondly, if the optimal value, say R * , is very large, the computational complexity of LGM will be more or less the same to GM. Fortunately, as shown in Fig. 2, we found an empirical relation between R * and the average distance 〈d〉, so that if the computational resource is highly limited, we can use the relation (see Eq. 5) to approximate R * . In addition, since most real networks are of small-world property 31,38 , R * should be small and thus it requires much less computation than GM. Fortunately, the difference between two rankings of nodes produced by neighboring R will quickly converge to a very small value, so that to choose a small value of R will probably perform very well. In Table 3, we show the values of τ(R), which is the Kendall's tau between two rankings of nodes' influences with truncation radius being R and R + 1. One can observe that after R = 5, all networks are of τ(R) > 0.97 and a half of them are of τ(R) > 0.99. This indicates a strong saturation, namely the increasing of R will produce almost the same rankings if the value of R is already large.
Another similar model (named G+, see Eq. 11) shows very close performance to LGM. In comparison, LGM is more efficient since it completely depends on the local topological structure and thus can be calculated not only faster but also under the case where the global topology is not known. In the absence of global topology, G+ cannot be obtained since it sets a node's k-shell index as its mass, and to determine the k-shell index needs the knowledge of the whole network. In despite the difference between G+ and LGM, the very good performance of G+ and LGM strongly suggest the validity and advantage of the usage of the gravity law to estimate the interacting strength. Of course, both G+ and LGM are very simple and general, which can be further improved by the following aspects (also leaving as open issues for future studies). Firstly, by introducing a few tunable parameters that can adjust the relative importance of mass and distance (e.g., to replace d 2 by some d a and/or to replace k by some k b ) may result in more accurate predictions as indicated by known variants of the gravity law in other applications 39 . Secondly, we should explore how the topological features and dynamical processes affect the prediction accuracy and thus improve the original methods by introducing some topology-dependent and/ or dynamics-sensitivity items 40,41 . Thirdly, the original gravity law is symmetric, while due to the different roles of different nodes or the essentially asymmetric nature of the dynamics 42,43 , the influence from node i onto node j could be different from the influence from node j onto node i, where an asymmetric form of the gravity law may be relevant.

Methods
The Kendall's Tau. The Kendall's Tau 37 is an index measuring the correlation strength between two sequences. Considering two sequences with N elements, X = (x 1 , x 2 , …, x N ) and Y = (y 1 , y 2 , …, y N ). Any pair of two-tuples (x 1 , y 1 ) and (x j , y j ) (i ≠ j) are concordant if both x i > x j and y i > y j or both x i < x j and y i < y j . They are discordant if x i > x j and y i < y j or x i < x j and y i > y j . If x i = x j or y i = y j , the pair is neither concordant nor discordant. The Kendall's Tau of two sequences X and Y can be calculated as where n + and n − denote the number of concordant and discordant pairs, respectively. It can be seen that the extent to which τ exceeds zero indicates the strength of the correlation.
Benchmark centralities. Degree Centrality 9 of node i is defined as  Table 3. The Kendall's Tau between two rankings of nodes' influences produced by the LGM with truncation radius R and R + 1.
www.nature.com/scientificreports www.nature.com/scientificreports/ ∑ = DC i a ( ) , j ij where A = {a ij } is the adjacency matrix, that is, a ij = 1 if i and j are connected and 0 otherwise. H-index 10 of node i, denoted by H(i), is defined as the maximal integer satisfying that there are at least H(i) neighbors of node i whose degrees are all no less than H(i). Such index is an extension of the famous H-index in scientific evaluation 44   where g st is the number of shortest paths between nodes s and t, and g st (i) is the number of shortest paths between nodes s and t that pass through node i. Gravity Centrality 16 (G) of node i is defined as where k s (i) is the k-shell index of node i, and ψ i is the set of nodes whose distance to node i is less than or equal to 3. Extended Gravity Centrality 16 (G+) of node i is defined as where Λ i is the set of neighbors of node i.

Data Availability
All relevant data are available at https://github.com/MLIF/Network-Data.