Identifying influential spreaders in complex networks by an improved gravity model

Identification of influential spreaders is still a challenging issue in network science. Therefore, it attracts increasing attention from both computer science and physical societies, and many algorithms to identify influential spreaders have been proposed so far. Degree centrality, as the most widely used neighborhood-based centrality, was introduced into the network world to evaluate the spreading ability of nodes. However, degree centrality always assigns too many nodes with the same value, so it leads to the problem of resolution limitation in distinguishing the real influences of these nodes, which further affects the ranking efficiency of the algorithm. The k-shell decomposition method also faces the same problem. In order to solve the resolution limit problem, we propose a high-resolution index combining both degree centrality and the k-shell decomposition method. Furthermore, based on the proposed index and the well-known gravity law, we propose an improved gravity model to measure the importance of nodes in propagation dynamics. Experiments on ten real networks show that our model outperforms most of the state-of-the-art methods. It has a better performance in terms of ranking performance as measured by the Kendall’s rank correlation, and in terms of ranking efficiency as measured by the monotonicity value.


Results
Algorithms. Firstly, we take a toy network shown in Fig. 1 to illustrate the resolution limit problem for DC and KS. The degree and k-shell values of each node in the toy network are shown in Table 1. Obviously, k(1) = k(8) = k(9) = 1 , k(2) = k(3) = 3 , k(4) = k(5) = k(6) = 4 , k s (1) = k s (8) = k s (9) = 1 , k s (2) = k s (3) = 2 , k s (4) = k s (5) = k s (6) = k s (7) = 3 , where k(i) and k s (i) are the degree and k-shell value of node i, respectively. DC and KS always assigns too many nodes with the same value, which leads to the problem of resolution limitation in distinguishing the real influences of these nodes.
A simple solution is to consider both DC and KS, that is, to estimate the influence of node i by k(i) + k s (i) . However, the problem has not been completely solved. Take node 2 and node 3 as an example, compared with node 2, node 3 is closer to the center of the network, so node 3 may be more conducive to propagation. However, we cannot distinguish the two nodes by the above proposed method. Although both node 2 and node 3 are in the 2-shell, node 3 is removed later than node 2, that is, the 2-shell decomposition process includes two stages, node 2 is removed in the first stage and node 3 is removed in the second stage. So we introduce the stage number at which the node is removed from the network while performing the k-shell decomposition.
Given a network G, during the process of k-shell decomposition for the k-degree iteration, the total number of stages is q(k), and node i is removed in the p(i) stage. The improved k-shell index of node i , denoted by k * s (i) , can be calculated by The process of k-shell decomposition and the k * s value of each node in the toy network are shown in Table 2  and Table 3, respectively. Take node 3 as an example, q(1) = 1 , q(2) = 2 , q(3) = 1 , and then max k q(k) = 2 , so k * s (3) = k s (3) + p(3)/(max k q(k) + 1) = 2 + 2/(2 + 1) ≈ 2.667. The index combining degree and k-shell of node i, denoted by DK(i), can be defined by Such index is named as degree k-shell (DK) index. The DK value of each node in the toy network are shown in Table 4. As shown in Table 4, node 2 and node 3 can be distinguished (DC, KS, DC+KS failed), node 7 can (2) DK(i) = k(i) + k * s (i).  www.nature.com/scientificreports/ be distinguished from nodes 4-6 (KS failed), so DK index is a high-resolution index. Furthermore, DK carries both the local and global information of nodes. Inspired by the gravity law, we regard DK value of a node as its mass and the shortest distance between two nodes in the network as their distance. Hence the influence of node i can be estimated as follows where d(i, j) is the shortest distance from node i to node j and R is the truncation radius 29 . Such method is named as DK-based gravity model (DKGM). The algorithmic description of the DKGM is provided in Algorithm 1.     www.nature.com/scientificreports/ The result of DKGM with R = 2 of the toy network is shown in Table 5. Take node 3 as an example, the 1-order neighbors of node 3 are node 2, node 4 and node 7, the 2-order neighbors of node 3 are node 1, node 5 and node 6, so By Algorithm 1, we can find that calculating the improved k-shell index needs the following times operations, N ks1 �k� + N ks2 �k� + · + N ksmax �k� = (N ks1 + N ks2 + · + N ksmax )�k� = N k = M, so the computational complexity of this part is O(M), where N ks1 is the number of 1-shell nodes, ksmax is the max k-shell value and k is the average degree. The part with the highest computational complexity in our model is computing the R-order neighbors of each node, it needs N k R times operations, so the computational complexity of this part is O(N k R ) . Therefore, the computational complexity of our model is O(N k R ) . Fortunately, since most real networks are of small-world property, R is usually set to 2 or 3 to obtain the optimal result. So the computational complexity of our model in real-life applications is generally not more than O(N k 3 ) , where �k� ≪ N.

Data description.
In this paper, we use ten real networks from different fields to test the performance of DKGM, including four social networks (PB 41 , Facebook 42 , WV 43 and Sex 44 ), two collaboration networks (Jazz 45 and NS 46 ), one transportation network (USAir 47 ), one communication network (Email 48 ), one infrastructure network (Power 49 ) and one technological network (Router 50 ). These networks' topological features are shown in Table 6, including the number of nodes, denoted by N, the number of links, denoted by M, the average degree, denoted by k , the average distance, denoted by d , the clustering coefficient 49 , denoted by C, the assortative coefficient 51 , denoted by r, the degree heterogeneity 52 , denoted by H, and the epidemic threshold 53 of the SIR model 54 , denoted by β c .
Empirical results. In this paper, we apply the famous SIR model 54 to compare the influential rankings produced by algorithms and simulations. Given the network and infection rate β , 1000 independent implementations are performed and averaged in order to obtain the standard ranking of the influences of nodes (see details about SIR model in Methods). In each implementation every node is selected once as the seed once. The accuracy of an algorithm is measured by Kendall's Tau ( τ) 55 (see details about the Kendall's Tau in Methods) between the standard ranking and the ranking produced by the algorithm. The larger the value of τ , the better the perfor-  Table 7, and the accuracies of different β values are shown in Fig. 2.
As shown in Table 7, compared with the five classic methods (DC, KS, H-index, BC, CC), GC, LGM and DKGM are very competitive. Especially in the NS, Power and Router networks, the advantage of the gravity-based methods are extremely obvious. It can be seen from Table 6 that NS, Power and Router are extremely sparse (with very few links). In this tree-like networks, there are very few cycles, that is, most paths have no alternative paths, so propagation is very difficult. In this case, neither the neighborhood-based methods (DC, KS and H-index) nor the path-based methods (BC and CC) can work well. Furthermore, compared with GC and LGM, DKGM Table 7. The algorithms' accuracies measured by Kendall's Tau for β = β c . The parameters in the related algorithms (i.e., LGM and DKGM) are adjusted to their optimal values according to the largest τ . The best algorithm for each network is emphasized by bold. www.nature.com/scientificreports/ always performs best. As shown in Figure 2, DKGM also performs very competitive compared with the seven benchmark algorithms for different β not too far from β c . The optimal truncation radius R * of LGM can be estimated by at β = β c 29 . As shown in Figure 3, DKGM still keeps this property. Furthermore, the accuracies of GC, LGM with R = �d�/2 and DKGM with R = �d�/2 for β = β c are compared in Table 8. As shown in Table 8, although the truncation radius is set heuristically, DKGM still performs best among the three algorithms.  Figure 3. The relation between R * of DKGM and d for β = β c . Ten circles represent ten real networks and the slope of the blue line is 1/2. The black circle is the Power network. Although the optimal truncation radius R * = 6 in the Power network is slightly different from what Eq. 4 predicts (i.e., R = 9 ), the algorithmic accuracy at R = 9 ( τ = 0.7366 ) is very close to the best accuracy at R * = 6 ( τ = 0.7575).  www.nature.com/scientificreports/ Finally, we apply the monotonicity 56 , denoted by M r , to measure the ranking efficiency of algorithms. This metric is used to measure the uniqueness of the elements in a ranking list and it can be computed by where L is the ranking list, and N t (r) is the number of ties with the same rank r.
The monotonicity of node ranking list produced by different algorithms is shown in Table 9. As shown in Table 9, except the PB network, DKGM always performs best among the eight algorithms. In the PB network, the reason why GC narrowly defeated DKGM is that DKGM just considers 1-order neighbors while GC considers 3-order neighbors. The results reported in Table 9 demonstrate DKGM is a remarkably high-resolution algorithm.

Discussion
Degree centrality and the k-shell decomposition method, as the most widely used neighborhood-based centralities, were introduced to the network world to evaluate the spreading ability of the nodes. However, the two methods always assign too many nodes with the same value, which leads to the problem of resolution limitation in distinguishing the real influences of these nodes. To solve the above problem, combining the two methods (i.e., DC and KS), we propose a high-resolution index (DK) that can simultaneously reflect the local and global information of nodes. Furthermore, we propose an improved gravity model (DKGM) that combining DK index and the gravity law to evaluate the spreading ability of nodes. The empirical results show that DKGM performs best in comparison with seven well-known benchmark methods and DKGM is a remarkably high-resolution algorithm.
A potential disadvantage of DKGM is how to set truncation radius R. Fortunately, as shown in Fig. 3, we find an empirical relation between R * and the average distance d , so we can use the relation (see Eq. 4) to approximate R * . In addition, since most real networks are of small-world property 49,57 , R * should be small, it can be set to 2 or 3 generally.
There are still some potential problems in the future. First of all, the original law of gravity is symmetrical, but due to the different effects of different nodes or the inherent asymmetry of dynamics 58,59 , the influence of node i on node j may be different from that of node j on node i, in which the asymmetric form of gravity law may be involved. Secondly, as the heterogeneity of the links greatly change their importance 60 , how to use gravity model in the weighted networks is still an open issue. We will also develop some other better methods based on the gravity law to identify influential spreaders.

Methods
Benchmark centralities. We denote an undirected and unweighted network as G =< V , E > , where V and E are the sets of nodes and links, respectively, denote |V | = N and |E| = M , so the network has N nodes and M links. The adjacent matrix of G is represented by A = (a ij ) N×N , if there is a link from node i to node j, a ij = 1 , otherwise, a ij = 0. DC 17 of node i can be calculated by where k(i) = j a ij . KS 18 works by iterative decomposition of the network into different shells. The first step of KS is to remove all the nodes in the network whose degree k = 1 . Then it remove nodes whose degree k ≤ 1 after one round removal because this step may lead to the reduction of the degree values during the process of removal. Until there are no nodes in the network with degree k ≤ 1 , all the nodes which have been removed in this step create 1-shell and their k-shell values are equal to one. Then repeat this process to obtain 2-shell, 3-shell, ... , and so on. Finally all nodes are divided into different shells and the k-shell value of each node can be obtained.
The H-index 19 of node i, represented by H(i), is defined as the maximal integer value satisfying that there are at least H(i) neighbors of node i and degrees of these neighbors are all no less than H(i).
BC 20 of node i can be calculated by where g st is the number of shortest paths from node s to node t, and g st (i) is the number of shortest paths from node s to node t that pass through node i. CC 21 of node i can be calculated by GC 28 of node i can be calculated by where ψ i is the neighborhood set whose distance to node i is less than or equal to 3.  54 initially considers all nodes as susceptible (S) except the source node in the infected (I) state. Each infected node can infect its susceptible neighbors with probability β . In each subsequent step, all infected nodes change their own states to recovered (R). A node in the recovered state will never participate in the propagation dynamic process with the probability . The propagation process continues until there are no nodes in the infected state. The influence of node i can be estimated by where N r is the number of recovered nodes when dynamic process achieving steady state. is set to 1 for simplicity, and the corresponding epidemic threshold 53 is where k 2 is the second-order moment of the degree distribution.
The Kendall's Tau. The Kendall's Tau 55 is a measure of the strength of correlation between two sequences. X = (x 1 , x 2 , ..., x N ) and Y = (y 1 , y 2 , ..., y N ) are two sequences with N elements. For any pair of two-tuples (x i , y i ) and (x j , y j ) (i = j) , if x i > x j and y i > y j or x i < x j and y i < y j , the pair is concordant. If x i > x j and y i < y j or x i < x j and y i > y j , the pair is inconsistent. If x i = x j or y i = y j , the pair is neither concordant nor inconsistent.
Kendall's Tau of X and Y can be defined as where n + is the number of concordant pairs and n − is the number of discordant pairs.

Data availability
All relevant data are available at https:// github. com/ MLIF/ Netwo rk-Data. (11) F(i) = N r /N, www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.