Identification of nodes influence based on global structure model in complex networks

Identification of Influential nodes in complex networks is challenging due to the largely scaled data and network sizes, and frequently changing behaviors of the current topologies. Various application scenarios like disease transmission and immunization, software virus infection and disinfection, increased product exposure and rumor suppression, etc., are applicable domains in the corresponding networks where identification of influential nodes is crucial. Though a lot of approaches are proposed to address the challenges, most of the relevant research concentrates only on single and limited aspects of the problem. Therefore, we propose Global Structure Model (GSM) for influential nodes identification that considers self-influence as well as emphasizes on global influence of the node in the network. We applied GSM and utilized Susceptible Infected Recovered model to evaluate its efficiency. Moreover, various standard algorithms such as Betweenness Centrality, Profit Leader, H-Index, Closeness Centrality, Hyperlink Induced Topic Search, Improved K-shell Hybrid, Density Centrality, Extended Cluster Coefficient Ranking Measure, and Gravity Index Centrality are employed as baseline benchmarks to evaluate the performance of GSM. Similarly, we used seven real-world and two synthetic multi-typed complex networks along-with different well-known datasets for experiments. Results analysis indicates that GSM outperformed the baseline algorithms in identification of influential node(s).


Scientific Reports
| (2021) 11:6173 | https://doi.org/10.1038/s41598-021-84684-x www.nature.com/scientificreports/ of propagation probability, Ma et al. 31 proposed hybrid degree centrality (HC), which combines local indicators and degree centrality measures. All in all, these approaches have their own shortcomings and limitations. Still, the identification of influential nodes is a challenge. From the above discussion, to address these changeling problems, inspired from literature 10,29,30 , in this study, we design a new mechanism called GSM that not only considers the self-influence of the node in the network but also concentrates on the global influence of nodes. To analyze the algorithmic performance, we employed GSM on different kinds of real as well as synthetic networks where we used the susceptible-infected-recovered (SIR) and kendall's τ coefficient models to examine the effectiveness of GSM. In addition, we compared the experimental results of the baseline algorithms and with recently proposed approaches, where simulation results on seven different types of real and two synthetic networks showed that GSM effectively identifies influential nodes.
The framework of the paper is organized as follows: We present preliminaries and a brief introduction of baseline algorithms, including BC, PL, GIC, HI, CC, ECRM, DNC, IKH and HITS in Preliminaries section. The proposed GSM model is presented in Proposed method. Results and discussion to illustrate the effectiveness of the GSM are discussed in Results and discussion section, and finally, some conclusion and future recommendations are given in Conclusion and future recommendations section.
1. Betweenness Centrality (BC): BC calculates influential nodes based on global information 32 . BC ( i) is defined as: where g jk indicates number of paths between nodes j and k, and g jk (i) represents the shortest paths between nodes j and k, that pass through node i. 2. Closeness Centrality (CC): CC also calculates influential nodes based on global information. It uses the shortest distance between each pair of nodes to identify the influence of each node 33 . CC of node i is defined as: 3. Hyperlink Induced Topic Search (HITS): This algorithm is based on two factors i-e., Authority Update, and Hub Update. Authority update is computed by considering the number of hub edges associated with the authority website, and Hub Update is computed by considering the number of authority websites linked by the Hub website 34 . 4. H-Index (HI): This algorithm identifies the influential node's by taking into account the node's neighbor and using H-index notation. A high H-index represents that the node has more important than other connected nodes 35 . 5. Profit Leader (PL): This algorithm is based on profit leader concept analysis and suitable for any network i.e., directed or undirected 28 . 6. Improved K-shell Hybrid (IKH): This algorithm considers the k-shell, shortest distance between the nodes and parameter (in range between 0 and 1) to identify the most influential nodes 12 . 7. Gravity Index Centrality (GIC): This algorithm basis on universal gravity concept; that considers both neighbor's nodes influences and path information 30 . GIC(i) is defined as: where θ i is the set of neighbors node i. 8. Extended Cluster Coefficient Ranking Measure (ECRM): This algorithm is working on the basis of local clustering coefficients and uses link similarity between adjacent nodes 36 . 9. Density centrality (DNC): It is inspired by the area density formula to identify the influence of nodes in the spreading dynamics 37 . DNC(i) is defined as:

Proposed method
Several approaches based on the global structure of the network to identify the influence of nodes have been developed and deployed, but better utilization of self as well as global structure influence is still a challenge, needs to be addressed. Inspired from literature 37-39 , a Global Structure Model (GSM) was proposed, which consists of self and global influences.

Self-influence.
In this context, we used e (natural logarithm) and take k-shell Ks(v i ) , and nodes number (N) in the network as power parameters to minimize the overestimation of the self-influence.
where N shows the number of all nodes in the network.
Global-influence. The node influence also considers the influence of the other connected nodes to it. Normally, the node influence is increased if its neighborhoods have a high value of k-shell; however, the contact distance between the two nodes cannot be ignored, which is inversely proportional to the influence of the nodes.
where d ij is the shortest distance between node i and node j.
Node influence. The node V i influence is not only on its own influence but also on the nodes around it.
Therefore, the proposed GSM simultaneously considers these two aspects, self and global influence, which can be defined as, We can also express GSM of the node v i , where Ks(v i ) and Ks(v j ) denote the k-shell of node i and node j, Computation process. The proposed GSM model is divided into four parts; first, construction of corresponding network; second, calculation of the network's global influence and the k-shell of node and the distance between nodes. In the third step, we consider the self-influence of the network, the self influence of the node www.nature.com/scientificreports/ itself is computed. Finally, the influence of each node on the entire network is calculated. To further demonstrate GSM method, as shown in Fig. 1, for a specific calculation process, here we consider a simple network to clarify it in detail. In Fig. 2, consists of 13 nodes and 17 edges. As shown in the network, we consider GSM method by taking the node V4 influence as an example. First, we calculate the k-shell and the shortest distance between each node; we have d4-2 = 1,d4-3 = 1, d4-5 = 2, d4-6 = 3, d4-7 = 1, d4-8 = 1, d4-9 = 2, d4-10 = 1, d4-11 = 2, d4-12 = 2, d4-13 = 2.
To calculate the self influence and global influence, here we apply Eqs. (6) and (7); we have, S(4) = e 3 13 = 1.25956 , and GI(4 − 1)  7 ) , w e h av e , , Finally, the influence of node V4 can be calculated, we have GSM 4 = 1.25956 × 17.333333 = 21.833 . Table 1 shows the ranking influence of each node in the given simple network.
Evaluation metrics. SIR model. We used the SIR model to investigate the spreading dynamic of each node 40,41 to quantify the performance of GSM and other benchmark centralities. In the SIR model, there are three states, (i) Susceptible (S), (ii) Infected (I), (iii) Recovered (R). Susceptible (S) refers to a healthy state and can be infected by others. Infected (I) refers infected state and can infect other individuals. Recovered (R) denotes a recovered state, which cannot be infected by other individuals again. For the first time, all the seed nodes are in a susceptible form. At each time step, the seed node can infect its nearest and next-nearest neighbor nodes (in the susceptible state) with a probability β , then each node (the node which was infected) enters into the recovered state with a probability µ . This process continued till there are no more infected nodes. Finally, all the recovered nodes are used to simulate the actual node impact. Here, S(t), I(t), and R(t) indicate the nodes numbers in susceptible, infected, and recovered states, respectively. Therefore,  Kendall's Tau (τ ). We used kendall's (τ ) 42,43 to calculate the performance of GSM further. Let suppose, two-node sequences (X&Y ) are correlated with similar nodes number (n), X = (x 1 , x 2 , . . . , x n ) and Y = (y 1 , y 2 , . . . , yn) . One pair of two annotations (x i , y i ) and (x j , y j ) (i = j) are said to be concordant if the ranking of both component agree, i-e. if both xi > xj and y i > y j or x i < x j and y i < y j . They are said to be discordant if x i > x j and y i < y j or xi < x j and y i > y j or if x i = x j or y i = y j , the pair is neither concordant nor discordant. The kendall's (τ ) is defined as: where n c , and n d denote the number of concordant and discordant pairs, respectively.

Datasets description.
Real-world networks. We evaluated GSM on seven different real-world networks to validate its efficiency. The seven real networks are publicly available and can be obtained from (http://netwo rkrep osito ry.com). The datasets are, e.g., (i) Jazz 44 Table 2.
Synthetic networks. There is a bulk of exemplary complex networks exist in the real world and we do not know the details of ground realities about all of them because it's not even possible to conceal such a large number of information about a matter that is so widely being exploited. Therefore in order to evaluate GSM and the baseline methodologies, we applied the benchmark generator model 51,52 to generate different synthetic networks for the process of experimentations.

Results and discussion
To measure the influence of nodes in different real and synthetic networks and to validate the applicability and effectiveness of the GSM, we used two evaluation metrics i-e., SIR, and Kendall's models. First, we used a simple graph containing 13 nodes and 17 edges, as shown in Fig. 2, applied GSM to find the influential nodes and, results are compared and analyzed with the outcomes of the rest of the benchmark algorithms such as BC, CC, HITS, HI, GIC, DNC, IKH, ECRM and PL. Kendall's (τ ) of the proposed GSM and other algorithms are shown in Fig. 3. As, it can be seen that in terms Kendall (τ ) , GSM achieves higher values, i.e., the values in range from 0.9 to 1 for β = 0.01 − 0.1 , shows that In order to further examine the propagation effect of GSM, we analyzed the spreading impact of the ranked nodes in the SIR model. To better distinguish the influential nodes, the infection probability needed to be set in the range between 0.01 and 0.1. For big networks (Astroph-e, Web-spam, BA and H-friendship), we set = 0.01 because, in case of bigger values, propagation will occur across the whole network 24 . Where it is not easy to differentiate the importance of distinct nodes. For small networks (Dolphin, Jazz, Crime, Random, and E-mail), we set = 0.1 , and also we set the recovery probability ǫ = 1 and the time t = 1000. First, the influence of each node is computed using different algorithms, and then sorted in descending order. Tables 3 and 4 shows the top ten ranked nodes; due to the limited space, only we present the top ten nodes of two networks Dolphin and Crime. We observed that most of the top-10 nodes of our algorithm are also exist in other algorithms. Hence, the proposed GSM validity is verified. Second, each ranked node is treated as a seed node to impacting other ranked nodes. Finally, we computed the infected numbers of nodes for each seed node through an average of over 1000 turns. Figure 4 indicates the results of the average infected number of nodes using ten algorithms. In general, more influential nodes can infect more nodes, so an efficient and effective method can create a curve that decreases from left to right. As shown in Fig. 4, our proposed GSM gets a better infection effect than other methods on different networks.
Moreover, we compared the top ten nodes' effects selected by our proposed GSM and the corresponding baseline centrality measures for different networks. All top ten nodes are considered as seed nodes and the time 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 www.nature.com/scientificreports/ t in the range between 1 and 25. Figure 5 illustrates the influence of the top ten nodes in nine different networks; as can be seen, the proposed GSM gets the highest spreading efficiency than other centralities. In addition, It clearly shows that when the infection F(t) increases as time t increases and finally gets a steady value at a time t after consecutive time point. Since there are ten seed nodes, and most network propagation arrives in a steady state on time t=25, where we analyzed the spreading effects of GSM and the rest of other centralities measures.

Computational complexity of GSM
There are two main components of the proposed GSM. In the first stage, the time complexity of the node's global influence is calculated. We used Dijkstra to calculate the shortest distance, and its complexity is O(n 2 ) . In the second stage, the time of complexity is O(n). Therefore, the total computational complexity of GSM is O(n 2 ) . Table 5 lists the computational complexity of the proposed GSM and other benchmarks, as we can see that the computational complexity of GSM is not very low, but its accuracy is better than other benchmarks, and also GSM can automatically measure nodes influence without any parameters (shown in Figs. 3, 4 and 5). In future work, we plan to enhance GSM as paralleling computations.

Conclusion and Future Recommendations
We studied the problem of identification of nodes influence in complex networks. Several approaches have been developed and deployed in this area but still, it is a big issue for scientists and researchers. In this regard, we proposed an algorithm called GSM to identify influential nodes, which considers both self as well as global influence of nodes in the networks. We applied the proposed GSM on different real as well as synthetic networks and employed two evaluation metrics (SIR and Kendall τ ) to verify its efficiency. Experimental results demonstrated that our algorithm performed better than the benchmarks. For further work, the proposed GSM algorithm can be extended to many forms for better results. For instance, adding some parameters to control the intensity among various nodes to yield better performance. Furthermore, we also plan to combine the profit leader algorithm concept with the proposed algorithm to enhance the performance. Table 3. Top-10 ranking nodes of the Dolphin network using ten different methods. 1  37  37  15  15  15  15  15  15  15  38   2  2  41  38  17  46  38  38  38  38  15   3  41  38  46  19  38  46  46  46  46  46   4  38  21  34  21  34  34  21  21  21  21   5  8  15  51  22  52  52  34  34  41  34   6  18  2  30  25  58  21  41  30  34  51   7  21  8  52  30  21  30  37  41  37  41   8  55  29  17  34  14  41  30  52  51  30   9  52  34  41  38  30  18  52  51  30  37   10  58  9  22  41  18  58  51  39  2  52   Table 4. Top-10 ranking nodes of the Crime network using ten different methods.