Identifying influential spreaders by gravity model considering multi-characteristics of nodes

How to identify influential spreaders in complex networks is a topic of general interest in the field of network science. Therefore, it wins an increasing attention and many influential spreaders identification methods have been proposed so far. A significant number of experiments indicate that depending on a single characteristic of nodes to reliably identify influential spreaders is inadequate. As a result, a series of methods integrating multi-characteristics of nodes have been proposed. In this paper, we propose a gravity model that effectively integrates multi-characteristics of nodes. The number of neighbors, the influence of neighbors, the location of nodes, and the path information between nodes are all taken into consideration in our model. Compared with well-known state-of-the-art methods, empirical analyses of the Susceptible-Infected-Recovered (SIR) spreading dynamics on ten real networks suggest that our model generally performs best. Furthermore, the empirical results suggest that even if our model only considers the second-order neighborhood of nodes, it still performs very competitively.

The focus of network science has been shifting from discovering macroscopic statistical regularities to microscopic elements, vital nodes identification has received a huge amount of attention from researchers of network science in recent years. Vital nodes identification can be widely used in disease analysis 1,2 , rumor analysis 3 , information propagation 4 , power grid protection 5 , discovery of candidate drug targets and essential proteins 6 , discovery of important species 7,8 , and so on.
So far, most known methods only use structural information 9 , which can be classified into neighborhoodbased centralities and path-based centralities roughly. Typical representatives of neighborhood-based centralities are degree centrality 10 (DC), H-index 11 and k-shell decomposition method 12 (KS). For DC, the more neighbors a node has, the greater its influence. For H-index, the more large-degree neighbors a node has, the greater its influence. For KS, the more central a node locates in the network, the greater its influence. Besides, eigenvector centrality 13 (EC) is the representative neighborhood-based iterative method, suggesting that the influence of a node is not only determined by the number of its neighbors, but also determined by the influence of each neighbor. Typical representatives of path-based centralities are betweenness centrality 14 (BC) and closeness centrality 15 (CC). For BC, the more a node is located in shortest paths, the greater its influence. For CC, the closer a node is to other nodes, the greater its influence.
However, a significant number of experiments indicate that depending on a single characteristic of nodes to reliably identify influential spreaders is inadequate 9 . As a result, the methods integrating multi-characteristics of nodes have been proposed. In particular, the methods based on gravity law seem very promising. As several laws behind phenomena in life are similar to the gravity law, the gravity model, which derives from the gravity law, is also favored and exhibited in many real-life scenarios. Representative examples include predicting the population migration between regions in demography 16 and forecasting the trade flows throughout countries in economics 17 . In network science, the gravity model is utilized to evaluate the influence [18][19][20] of nodes, and so on. Recently, a series of gravity-law-based algorithms [18][19][20][21][22][23][24][25][26][27][28][29][30] considering both neighborhood information and path information have been proposed, and their performance is much better than the above well-known stateof-the-art methods. Typical representatives are gravity centrality 18 (GC), improved gravity centrality 19 (IGC) and local gravity model 20 (LGM). For GC, the k-shell value of a node is regarded as its mass. For IGC, the focal node uses the k-shell value as its mass while its neighbors view the degree value as their masses. For LGM, the degree value of a node is regarded as its mass. However, whether the degree or k-shell is regarded as mass, the influence of neighbors is not taken into consideration. In view of this, we propose a gravity model that effectively www.nature.com/scientificreports/ integrates multi-characteristics of nodes to measure the influence of nodes in spreading dynamics. In our model, the number of neighbors, the influence of neighbors, the location of nodes, and the path information between nodes are all taken into consideration.

Preliminaries
Well-known state-of-the-art methods. Denote G =< V , E > an undirected and unweighted simple network, where V and E are the sets of nodes and links. Denote |V | = N and |E| = M , then the network has N nodes and M links. The adjacent matrix of G is denoted by A = (a ij ) N×N , if node i links to node j, a ij = 1 , otherwise, a ij = 0. The degree centrality 10 (DC) of node i is measured by where k(i) = N j=1 a ij . The H-index 11 of node i, denoted by H(i), is defined as the maximal integer satisfying that there are at least H(i) neighbors of node i whose degrees are all greater than or equal to H(i).
The k-shell decomposition method 12 (KS) works by iterative decomposition of the network into different shells. The first step of KS is to remove the nodes whose degrees are equal to 1 from the network, which will cause a reduction of the degree value to the remaining nodes. Continually remove all the nodes whose residual degrees are less than or equal to 1, until all the remaining nodes' residual degrees are greater than 1. All the removed nodes in the first step form the 1-shell and their k-shell values are all equal to 1. Repeat this process to obtain 2-shell, 3-shell, . . . , and so on. The decomposition process will continue until there are no more nodes in the network.
The eigenvector centrality 13 (EC) of node i is measured by where c is a constant, generally speaking, c is set to the reciprocal of the largest eigenvalue of A. The betweenness centrality 14 (BC) of node i is measured by where g st is the number of shortest paths between node s and node t, and g st (i) is the number of shortest paths via node i between node s and node t. The closeness centrality 15 (CC) of node i is measured by where d(i, j) is the shortest distance from node i to node j. The gravity centrality 18 (GC) of node i is measured by where k s (i) is the k-shell value of node i, and ψ i is the neighborhood set whose distance to node i is not greater than 3. An extended version of GC, denoted by GC+, GC+ of node i is measured by where i is the neighborhood set whose distance to node i equals to 1. The improved gravity centrality 19 (IGC) of node i is measured by An extended version of IGC, denoted by IGC+, IGC+ of node i is measured by The local gravity model 20 (LGM) of node i is measured by , . www.nature.com/scientificreports/ where R is the truncation radius, and the optimal truncation radius R * can be estimated by where d is the average distance of the network.
The SIR model. The SIR model 31 initially considers all the nodes as in the susceptible (S) state except the source node in the infected (I) state. At each time step, each infected node can infect its susceptible neighbors with probability β . Then, each infected node enters the recovered (R) state with probability . The propagation process continues until there are no more nodes in the infected state. The influence of node i can be estimated by where N r is the number of recovered nodes when dynamic process achieves steady state. For simplicity, is set to 1, then the corresponding epidemic threshold 32 can be calculated by where k is the average degree, and k 2 is the second-order moment of the degree distribution.

The Kendall's Tau. The Kendall's Tau 33 is an index describing the strength of correlation between two
sequences. Denote X = (x 1 , x 2 , . . . , x N ) and Y = (y 1 , y 2 , . . . , y N ) are two sequences with N elements. For any pair of two-tuples (x i , y i ) and (x j , y j ) (i = j) , if both x i > x j and y i > y j or both x i < x j and y i < y j , the pair is concordant. If both x i > x j and y i < y j or both x i < x j and y i > y j , the pair is discordant. If x i = x j or y i = y j , the pair is neither concordant nor discordant. The Kendall's Tau of X and Y can be calculated by where n + is the number of concordant pairs, and n − is the number of discordant pairs.
The monotonicity. The monotonicity 34 M of ranking list L is used to quantitatively measure the resolution of different indices, and it can be calculated by where U is the size of L, and U r is the number of ties with the same rank r.

Results
Algorithms. According to previous studies, the degree value of a node indicates the number of its neighbors, the k-shell value of a node reflects where it locates in the network, the eigenvector centrality value of a node can reflect both the number of its neighbors and the influence of each neighbor, and the distance between two nodes can describe the path information. Individually speaking, nodes with large degree value, k-shell value and eigenvector centrality value are likely to be more influential. Furthermore, a node is of higher impacts on nearby nodes. According to the above issues and inspired by the gravity law, we regard the sum of degree value, k-shell value and eigenvector centrality value of a node as its mass, and the shortest distance between two nodes as their distance. Therefore, the influence of node i can be estimated as Such method is named as multi-characteristics gravity model (MCGM) as it considers multi-characteristics of nodes and adopts the gravity law.
It is not difficult to find that these three indices (DC, KS, EC) are not in the same order of magnitude, so normalization is required. As a result, Eq. (15) can be rewritten as where k max , k smax and x max denote the maximum of degree value, k-shell value and eigenvector centrality value, respectively.
However, since the k-shell index has smaller value space, the normalized k-shell index is still larger than the other two indices. Therefore, it is necessary to lower the impact of the k-shell index. Given an index, due to the (9) LGM , www.nature.com/scientificreports/ scale-free property of networks, the index values of most nodes are relatively small. Therefore, the index with larger value space generally has a smaller ratio between the median and the maximum. In our model, it is obvious that the value space of degree centrality and eigenvector centrality is larger than that of k-shell index. In view of this, we can lower the impact of k-shell index by where k mid , k smid and x mid denote the median of degree value, k-shell value and eigenvector centrality value, respectively. The purpose of taking the maximum value of k mid k max , x mid x max is to prevent the function of k-shell index from being excessively weakened.
Finally, Eq. (15) can be rewritten as The Algorithmic description of MCGM is provided in Algorithm 1. We take a toy network shown in Fig. 1 to illustrate the calculation process of Algorithm 1.
Firstly, calculate the degree value, k-shell value and eigenvector centrality value of each node in the toy network, the results are shown in Table 1.
Finally, the result of MCGM with R = 2 of the toy network is shown in Table 2. Take node 3 as an example, the 1-order neighbors of node 3 are node 2, node 4 and node 7, the 2-order neighbors of node 3 are node 1, node 5 and node 6, so MCGM(3) = 16.9320. Data description. In this paper, we apply ten real networks from six fields to test the performance of MCGM, including one transportation network (USAir 35 ), one communication network (Email 36 ), one infra-  Table 3 shows these networks' topological features, including the number of nodes, the number of links, the average degree, the average distance, the clustering coefficient 37 , denoted by C, the assortative coefficient 45 , denoted by r, the degree heterogeneity 46 , denoted by H, and the epidemic threshold 32 of SIR model 31 .
Empirical results. Based on the above real networks, the well-known SIR model 31 is used to compare the influential rankings produced by algorithms and simulations. Given the network and the transmission probability β , in order to guarantee the reliability of the results, 1000 independent realizations are executed and averaged to obtain the standard ranking of the influence of nodes (see details about SIR model in Preliminaries). In each realization, every node is selected once as the seed once. We apply the Kendall's Tau ( τ ) between the standard ranking and the ranking produced by the algorithm to measure the accuracy of an algorithm. Since τ ∈ [−1, 1] , the closer the τ is to 1, the better the performance of the algorithm. The benchmark algorithms include degree centrality 10    www.nature.com/scientificreports/ (GC+), the extended version of improved gravity centrality 19 (IGC+) and local gravity model 20 (LGM). Table 4 compares the accuracies of MCGM and the ten benchmark algorithms for β = β c . Furthermore, the accuracies of different β values (not too far from β c ) are shown in Fig. 2.
As shown in Table 4, the methods based on gravity law (GC+, IGC+, LGM and MCGM) show great advantages over the classic methods (DC, H-index, KS, EC, BC, CC), especially in Power, Router and NS, the advantage of the methods based on gravity law are extremely obvious. Notice that, except the above three networks, the performance of EC is significantly better than other classic methods, and even performs competitively in comparison with the methods based on gravity law, which indirectly shows that the stability of the method based on the gravity law is better and their performance will not decline precipitously due to the differences of networks. Furthermore, for the methods based on gravity law, MCGM generally performs best since it effectively considers more characteristics of nodes. As shown in Fig. 2, MCGM still performs very competitively compared with the ten benchmark algorithms for different β not too far from β c , suggesting the robustness of our findings. Figure 3 shows the optimal truncation radius of MCGM in the ten real networks. It is not difficult to find that the optimal truncation radius of most networks is concentrated at R = 2 . Therefore, we may simply set R = 2 to test the performance of MCGM. Table 5 compares the accuracies of MCGM with R = 2 and the benchmark algorithms.
As shown in Table 5, MCGM with R = 2 generally performs best in comparison with the benchmark algorithms, it still obtains the best results in six of the ten real networks. Since the optimal truncation radius approximately scales linearly with the average distance 20 , if the average distance of the network is relatively large, setting R = 2 will have a significant impact on the performance of MCGM, such as Power whose average distance is 18.9892. Fortunately, most real networks have small-world property, R * tends to be small in most cases.
Furthermore, we need to compare MCGM and MCGM without normalization to illustrate the importance of normalization. Table 6 compares the accuracies of MCGM using Eq. (15), MCGM using Eq. (16) and MCGM using Eq. (18). As shown in Table 6, MCGM has been gradually improved by normalization, suggesting the importance of normalization and the effectiveness of our normalization strategy.
Finally, we apply the monotonicity 34 to measure the resolution of different algorithms. As shown in Table 7, MCGM generally performs best even if it only considers 1-order neighbors or 2-order neighbors in most cases. The results reported in Table 7 demonstrate MCGM is a remarkably high-resolution algorithm.
Computational complexity. The computational complexity of the methods used in this paper is shown in  O(N + M) , respectively. Therefore, it is obvious that the part with the highest computational complexity of MCGM is computing the R-order neighbors of each node, it needs N k R times operations. Hence the computational complexity of MCGM is O(N k R ) . Since most real networks have small-world property, R * = 2 in most cases (see Fig. 3), so the computational complexity of MCGM is generally not more than O(N k 2 ) , where �k� ≪ N.

Discussion
In summary, we propose a novel gravity model that effectively integrates multi-characteristics of nodes, named as multi-characteristics gravity model (MCGM). The number of neighbors, the influence of neighbors, the location of nodes, and the path information between nodes are all taken into consideration in our model. In addition, we propose a normalization strategy to solve the problem that different indices are not in the same order of magnitude, Table 6 suggests the importance of normalization and the effectiveness of our normalization strategy. Compared with well-known state-of-the-art methods, empirical analyses of the SIR spreading dynamics on ten real networks suggest that our model always performs very competitively, as shown in Table 4. Table 4. The algorithms' accuracies of MCGM and the benchmark algorithms measured by Kendall's Tau for β = β c . The parameters in the related algorithms (i.e., LGM and MCGM) are adjusted to their optimal values subject to the largest τ , that is, we need to search the optimal truncation radius which can maximize τ by traversing the truncation radius. Obviously, searching the optimal truncation radius in this way is very timeconsuming, fortunately, in subsequent experiments, we find that MCGM still performs very competitively even if the truncation radius is just set to 2. For each network, the best algorithm is emphasized by bold.   www.nature.com/scientificreports/ However, MCGM needs to find the optimal truncation radius by traversing the truncation radius and it is very time-consuming. Fortunately, the optimal truncation radius approximately scales linearly with the average distance 20 , and most real networks have small-world property 37,48 , so even if the truncation radius is just set to 2, MCGM still performs very competitively in most cases, as shown in Table 5. Therefore, without increasing the computational complexity, MCGM effectively considers more characteristics of nodes and obtains more accurate results.
Although the computational complexity of MCGM is not high, it needs the global topological structure, same as GC+ and IGC+. While LGM can work under the case where the global topology is not known. As a result, our suggestions for practical use are as follows: if the network's global topology is known, apply MCGM and set R to 2, otherwise, apply LGM and set R to 2 or 3.
Of course, there are still some potential problems in the future. First of all, the gravity law is symmetrical, but due to the different effects of different nodes or the inherent asymmetry of dynamics 49,50 , an asymmetric form of the gravity law may be relevant. Secondly, in weighted complex networks, the heterogeneity of links greatly changes nodes' importance 51 , a weighted form of the gravity law may be relevant. Finally, in order to establish a unified research framework, a unified gravity model is needed to be proposed. Although GC+, IGC+ and LGM are proposed from different perspectives, a unified form of expression exists. We propose a rough model which intends to start further discussion on this issue. The rough unified gravity model is described as where a and b are adjustable parameters. If a = 1 and b = 1 , the unified gravity model degenerates to LGM. If a = 0 and b = 0 , the unified gravity model degenerates to GC (GC+ can be obtained by Eq. (6)). If a = 0 and b = 1 , the unified gravity model degenerates to IGC (IGC+ can be obtained by Eq. (8)).

Data availability
All relevant data are available at https:// github. com/ MLIF/ Netwo rk-Data.