Identifying influential nodes in complex networks using a gravity model based on the H-index method

Identifying influential spreaders in complex networks is a widely discussed topic in the field of network science. Numerous methods have been proposed to rank key nodes in the network, and while gravity-based models often perform well, most existing gravity-based methods either rely on node degree, k-shell values, or a combination of both to differentiate node importance without considering the overall impact of neighboring nodes. Relying solely on a node's individual characteristics to identify influential spreaders has proven to be insufficient. To address this issue, we propose a new gravity centrality method called HVGC, based on the H-index. Our approach considers the impact of neighboring nodes, path information between nodes, and the positional information of nodes within the network. Additionally, it is better able to identify nodes with smaller k-shell values that act as bridges between different parts of the network, making it a more reasonable measure compared to previous gravity centrality methods. We conducted several experiments on 10 real networks and observed that our method outperformed previously proposed methods in evaluating the importance of nodes in complex networks.


Preliminaries Centrality measures
In the context of an undirected and unweighted simple network G =< V , E >,V and E respectively represent the sets of nodes and links.The cardinality of V and E can be expressed as |V | = N and |E| = M , indicating the presence of N nodes and M links within the network.The network's connectivity structure is typically captured by its adjacency matrix A = (a ij ) N×N , where a ij = 1 if node i and node j are linked, and 0 otherwise.Degree centrality 17 of node i is defined as where The maximum integer fulfilling that there are at least H(i) neighbors of node i whose degrees are all at least H(i) , represented by H(i) , is known as the H-index 18 of the node i.
The k-shell decomposition method 24 (KS), operates through an iterative process of decomposing the network into distinct shells.Initially, KS removes nodes with a degree of 1 from the network, resulting in a decrease in the degree values of the remaining nodes.This process is repeated by removing nodes with residual degrees less than or equal to 1 until all remaining nodes have residual degrees greater than 1.The nodes removed in the first step constitute the 1-shell, and their k-shell values are assigned as 1.This process is then iteratively applied to obtain the 2-shell, 3-shell, and so on.The decomposition process continues until all nodes in the network have been accounted for.
Gravity centrality 32 (G) of node i is defined as where k s (i) is the k-shell value of node i , d(i, j) is the shortest path distance from node i to node j , and ψ i is the set of nodes whose distance from node i does not exceed 3. Extended gravity centrality 32 (G+) of node i is described as i is the nearest neighborhood of node i.The improved gravity centrality 33 (IGC) of node i is measured by where R is the truncation radius, and the optimal truncation radius R * can be estimated by ( 1) where d is the average distance of the network.Extended improved gravity centrality 33 (IGG+) of node i is described as i is the nearest neighborhood of node i.The local gravity model 34  (LGM) of node i is determined by The generalized gravity centrality 35 (GGC) of node i is defined as where C i is the local clustering coefficient of node i , n i denotes the number of edges between neighbors of node i , and α = 2.
The k-shell based on gravity centrality 36 (KSGC) is defined as where c ij is the coefficient of attraction exerted by node i on node j , k s (i) and k s (j) denote the k-shell values of node i and node j , respectively.ks max and ks min refer to the largest and smallest k-shell values present in the network.d(i, j) is the shortest path distance from node i to node j.
The DK-based gravity model 37 (DKGM) is measured by assume that the value of the k-shell of node i is k s (i).For the process of the k-degree iteration, the total iteration number is q(k) , and node i is removed in the p(i) iteration of the k-degree process.k * s (i) is called the improved k-shell index of node i.
The multi-characteristics gravity model 38 (MCGM) is measured by where k mid , k smid and x mid denote the median of degree value, k-shell value and eigenvector centrality value, respectively.k max , k s max and x max denote the maximum values of degree value, k-shell value, and eigenvector centrality value.
The entropy-based gravity model 39 (SEGM) is defined as where E(i) is the information entropy of node i , Ŵ(i) represents the set of neighboring nodes of node i,and I(i) is the importance of node i.

The SIR model used in this paper
To evaluate the ranking of impact generated by the algorithm and the simulation, we employed the widely used SIR model 40 .In the beginning, a single node in the network, referred to as the "source node," is in the infected state (I), while the remaining nodes are in the susceptible state (S).An infected node has the potential to infect its susceptible neighbors with a probability of β , and the probability of each infected node entering the recovery (R) state is , after which it ceases to participate in the dynamics.This propagation process continues until no infected nodes remain in the network.The impact of any given node i can be estimated by the number of nodes that recover after the diffusion process has stabilized is represented by N r .For the sake of simplicity, has been set to 1. Subsequently, the corresponding epidemic threshold 41 can be computed by where k and k 2 are the degree distribution's average degree and second-order moments.

Measures Kendall's tau coefficient
Kendall's tau coefficient 42 is a measure of correlation between two sequences, with a larger value indicating a greater similarity between the sequences.The definition of Kendall's tau coefficient is as follows: given two sequences X and Y of the same length, where the i th values are represented by x i and y i , respectively.Let each pair of elements x i and y i form a set, denoted by (x i , y i ) .If x i > x j and y i > y j , or x i < x j and y i < y j , the pairs (x i , y i ) and (x j , y j ) are considered concordant.They are considered discordant if x i > x j and y i < y j , or x i < x j and y i > y j .If x i = x j and y i = y j , the pair is neither concordant nor discordant.Therefore, the Kendall's tau coefficient τ is defined as where n + is the number of concordant pairs, and n − is the number of discordant pairs.

Jaccard similarity coefficient
In some applications, concentrating on the top-rank nodes rather than all nodes may be appropriate.In contrast to the Kendall correlation coefficient, the Jaccard similarity coefficient is utilized to assess the similarity between the top-k nodes in two ranking lists 25,43 .The Jaccard similarity is calculated by dividing the number of common nodes by the number of unique nodes in the two lists, and its expression is where X and Y represent the top-k nodes with the highest influence as determined by two different methods.In the context of our experiments, X represents the top-k nodes identified by HvGC and other baseline methods, while Y represents the top-k nodes obtained through the SIR simulation.We use the Jaccard similarity coefficient to measure the similarity between these two sets of top-k nodes.The Jaccard similarity coefficient ranges from 0 to 1, where a higher value indicates a greater degree of similarity between the two ranking results.A Jaccard similarity coefficient of 0 indicates completely distinct results, while a value of 1 indicates that the two sets of top-k nodes are identical.( 19)

The monotonicity index
The monotonicity 25 M is used to quantitatively measure the resolution of different indices in ranking list X , and can be calculated by where N is the size of network, and N c is the number of nodes with the same index value c.

Algorithms
Previous research has utilized the gravity model approach to analyze node importance in complex networks.Degree and k-shell values are commonly used metrics to consider the number of neighbors a node has and its position within the network, respectively.However, these metrics alone do not capture the overall influence of a node's neighbors.While the H-index considers the importance of a node's neighbors, it may overlook certain information from neighboring nodes, failing to account for the collective impact of all neighbors.We take the toy network shown in Fig. 1 to illustrate the problem for H-index, where the node spreading capacity derived from 1000 independent runs of the SIR model has been numerically labeled in Fig. 1.Obviously, , where H(i) represents the H-index of node i .The H-index always assigns the same value to different nodes, which leads to a lack of excellence in the ability to differentiate the influence of nodes.
The same issue exists in DC 17 and KS 24 .Additionally, from Fig. 1, it can be observed that Node 3 has a higher propagation capability compared to Node 9, but Node 3 has a lower H-index than Node 9.This indicates that the H-index overlooks some information from the neighbors of a node.From this, we take out all neighboring nodes in the set of neighbors of node i with degree values greater than or equal to H(i) and add up the degree values of these nodes to measure the overall influence of the neighboring nodes on node i .The value obtained is denoted as HV (i) , and the expression is where i is the nearest neighborhood of node i,H(i) represents the H-index of node i.
By incorporating the overall influence of node neighbors into the definition, it enhances the discriminative power of node identification compared to the H-index.However, it is still insufficient to accurately distinguish cluster-like nodes, due to their close connections, these nodes can more easily achieve greater HV values, but, their actual influence may not be greater than that of nodes with lower HV values, As shown in Fig. 1.HV (6) = 8 ,HV (9) = 7,HV (3) = 4 , and the actual propagation capacity from high to low is nodes 3, 9, and 6, a similar problem with the k-shell approach was noted by Liu et al. 44 In other words, removing node 3 from the network would result in nodes 1, 2, and 4 losing their interactions with the core nodes, while removing node 6 has a minimal impact on information transmission in the network.This finding demonstrates the higher importance of nodes that serve as bridges between different clusters compared to those within individual clusters.
Based on this, we considered the structural hole position of nodes to enhance the algorithm's ability to identify nodes within community networks.This allows us to identify those bridge nodes that may not have high HV values but play a crucial role in facilitating information flow across different parts of the network.The network constraint coefficient measures the level of constraints imposed on nodes forming a structural hole (SH) in a network 45 , and it can be calculated as follows: where Ŵ(i) represents the set of neighboring nodes of node i , and w ∈ Ŵ(i) ∩ Ŵ(j) indicates the nodes that are common neighbors of both node i and node j .p ij represents the proportion of energy invested by node i to maintain its relationship with node j .where z ij = 1 (i � = j) if there is a link between nodes i and j , otherwise z ij = 0 .Based on the above discussions, the gravity centrality based on the H-index (HVGC) measure proposed in this paper is defined as follows: where c(i) represents the structural hole constraint coefficient in Eq. (29).A smaller value of c(i) indicates that the node occupies more structural holes and has a stronger ability to bridge different parts of the network.Finally, the metrics, including HVGC, H-index, HV, DC, and KS, were computed for each node in the toy network and compared with the node's spreading capability (SC).The results are presented in Table 1, revealing that HVGC achieves a nearly identical ranking to SC, indicating excellent performance.The algorithmic description of the HVGC is provided in Algorithm 1.
In addition, Fig. 2 depicts a network with a clear community structure, where the four nodes with the strongest propagation capabilities are marked in green.The propagation capabilities of these nodes were determined (29)

Data description
This paper evaluates the efficacy of HVGC by analyzing ten real networks from six distinct domains, including a transportation network(USAir 46 ), an infrastructure network (Power 47 ), a communication network (Email 48 ), a technology network (Router 49 ), two collaborative networks (Jazz 50 and NS 51 ), and four social networks (Facebook 52 , PB 53 , WV 54 , and Sex 55 ).Table 3 presents the fundamental topological properties of these networks.N represents the number of nodes in the network, and M represents the number of links.The average degree of nodes is denoted as k , and the average distance between pairs of nodes is denoted as d .The clus- tering coefficient 47 of the network is denoted by C , while r represents the assortative coefficient 56 .The degree heterogeneity 57 of the network is denoted by H . Additionally, β c represents the epidemic threshold 58 of the SIR model 40 used to simulate the diffusion process.

Empirical results
Based on the aforementioned real network, we conducted simulations and compared the influence rankings of various algorithms utilizing the SIR model.In order to ensure the credibility of our findings and the standard ranking of nodes' influence, we conducted 1000 independent experiments for each given network and transmission probability β , with any one node being chosen as the seed node once during each run.The processor and runtime environment used for the calculations are i7-12700H and Python 3. The development platform used for this paper is Anaconda 3, and the code was executed in Jupyter Notebook.Kendall's tau ( τ ) was utilized to evaluate the accuracy of the algorithms, with a higher value indicating a greater correlation between the observed sequences and an improved algorithm performance.Table 4 provides a comparison of the accuracy of the proposed algorithm (HVGC) and ten benchmark algorithms, which include degree centrality 17 (DC), k-shell decomposition method 24 (KS), the extended version of gravity centrality 32 (G+), extended version of improved gravity centrality 33 (IGC+), local gravity model 34 (LGM), generalized gravity centrality 35 (GGC), the improved gravitational centrality based on k-shell values 36 (KSGC), the DK-based gravity model 37 (DKGM), multi-characteristics gravity model 38 (MCGM), and entropy-based gravity model 39 (SEGM).Additionally, Fig. 3 displays the accuracy of the different algorithms for varying values of β , within the range of 0.5β c to 1.5β c .4, the methods that utilise the gravitational formula (G+, IGC+, LGM, GGC, KSGC, DKGM, MCGM, SEGM, and HVGC) exhibit significant advantages over classical methods (DC and KS).These advantages are especially prominent in the Power, Router, NS, and Sex networks.Furthermore, it is noteworthy that among all gravity-based algorithms tested on the ten networks, HVGC exhibited the best overall performance.Its Kendall coefficient ranked first in six out of ten networks, with a remarkable 70% proportion being in the top two ranks.Specifically, HVGC ranked first in the Jazz, email, Facebook, PB, WV, and USAir networks and second in the Router network.Additionally, as shown in Fig. 3, when β = β c , although HVGC did not perform best in the NS, Power, and Sex networks, as β increases, its performance becomes very close to or even surpasses the previous best-performing algorithm.Taking into account HVGC's superior performance in community-type networks discussed earlier, it demonstrates a stronger overall performance, affirming the robustness of our findings.Furthermore, Fig. 4 displays the optimal truncation radius of HVGC in the ten real networks, revealing that the majority of networks concentrate their optimal truncation radius at R = 1 .This indicates that HVGC achieves remarkably high accuracy by considering only the influence of the first-order neighbouring nodes of a node, while most other gravity model methods require considering information from second-or third-order neighbouring nodes.In other words, HVGC achieves a high level of accuracy while incurring lower time costs.

Discussion
This paper introduces a novel method called HVGC for identifying influential nodes in a network.While the original gravity model considered both neighbourhood and path information, this new method enhances the existing gravity centrality approaches by taking into account the overall influence of a node's neighbourhood, considering the structural hole position of nodes, and incorporating the differences in interactions between nodes.This method addresses the limitations of existing gravity centrality methods and strengthens the ability to identify important nodes in networks with clear community structures.Therefore, this approach demonstrates a high level of comprehensive performance.We conducted an analysis of the SIR dynamic propagation process in 10 real networks to compare the performance of HVGC with previous state-of-the-art methods.The results, as shown in Table 4, indicate the strong competitiveness of our method.
In certain scenarios, it is necessary to identify the top-k influential nodes for controlling information propagation.Therefore, in addition to evaluating the different ranking methods for individual nodes, we also assessed their performance in identifying the top-k influential spreaders.In other words, we compared the ranked lists of node influence obtained from the ranking methods with the ranked lists of node influence obtained from the SIR simulation, both sorted in descending order.Subsequently, we analysed the similarity between the two lists www.nature.com/scientificreports/by considering the top-k nodes.Figure 5 illustrates the results of the Jaccard coefficient for identifying the top-k influential spreaders, ranging from 5 to 100 with a step size of 5.The X -axis shows the number of top influential spreaders, and the Y -axis shows the Jaccard similarity coefficients.We can observe that, except for the Sex, Power, PB, and Router networks, HVGC exhibits the best and most stable overall performance in identifying the top-k influential spreaders in other networks.Specifically, across all networks, as the number of selected top-k nodes increases, HVGC consistently maintains a high-level or steadily increasing Jaccard coefficient, while other methods display varying degrees of fluctuations.Furthermore, we provide detailed plots for the top-25 nodes, revealing that HVGC consistently ranks among the top three in identifying the top-25 influential spreaders and, in some cases, even secures the first position, except for the Sex network.Therefore, we can conclude that HVGC not only accurately ranks the influence of all nodes in the network but also successfully identifies the top-k nodes with the highest impact.
After applying monotonicity 25 , we assessed the resolution of various algorithms.Table 5 illustrates that HVGC and MCGM demonstrate similar performance in terms of monotonicity.However, HVGC excels in the majority of networks by solely considering the first-order neighbour information of nodes, whereas MCGM, even with the inclusion of second-order neighbour information, does not necessarily outperform HVGC and incurs higher computational complexity.Furthermore, HVGC demonstrates significantly better performance in identifying important nodes in networks with community structure compared to MCGM.Therefore, overall, HVGC surpasses other gravity model algorithms.Based on the results presented in Table 5, HVGC consistently ranks either at the top or very close to the best-performing algorithm in terms of monotonicity.
Based on the above discussion, it is evident that centrality based on the gravitational model is more accurate than classical centrality.However, many of these models tend to identify false core nodes in the network and do not take into account the influence of neighbouring nodes.In our proposed HVGC (H-index-based Gravity Centrality), we address this limitation by comprehensively considering the overall impact of a node's neighbours and its position within the network's structural holes.This approach effectively overcomes the drawbacks of gravity-based methods and demonstrates superior performance compared to other algorithms.The optimal truncation radius R * of HVGC at β = β c is presented in the graph.Each pentagram in the graph corresponds to a network, with a total of ten networks represented.The blue line corresponds to R * = 1 .Specifically, for HVGC, the value of R * is 1 in Email, Facebook, Jazz, PB, USAir, NS and WV networks, 2 in Router and Sex networks, and 4 in Power network.The majority of the networks have an optimal truncation radius of 1, with the next most common radius being 2. This outcome aligns with the characteristics of domain centrality, which typically considers first-order and second-order neighbor nodes.HVGC represents a significant advancement over the H-index in domain centrality to obtain centrality, which is consistent with this characteristic.However, this does not impede its competitiveness relative to other algorithms, as it manages to achieve both simplicity and accuracy.
Despite the excellent performance exhibited by HVGC, it shares a common limitation with other gravitybased methods, namely the need to determine the optimal truncation radius R .However, this disadvantage is mitigated by the fact that most real networks exhibit small-world characteristics 47,59 , and the optimal truncation radius is approximately linearly related to the average distance 34 .Furthermore, since HVGC is derived from the domain centrality method, even considering only the first-order neighbor nodes in the ten real networks studied can lead to very high performance and accurate results.In conclusion, while HVGC demonstrates better overall performance compared to other gravity-based methods and introduces improvements to existing gravity models, there are still areas that require further refinement.For example, the current approach does not consider the influence of weight factors associated with different indicators.Instead, it directly operates on the indicator values of the nodes.The weights of HV and the structural hole constraint coefficient c(i) in the computation process may affect the accuracy of the algorithm.In networks with clear community structures, a higher weight for c(i) may lead to better performance, while in other types of networks, a lower weight may yield better results.Therefore, future work may involve incorporating adjustable parameters to balance the weights of different indicators, which is a direction for further exploration.Additionally, these algorithms have not been evaluated in weighted networks, where the impact of the path from node i to node j may differ from that of the path from node j to node i , and the link heterogeneity 60 in a weighted network may result in varying node impact.Lastly, future research may involve incorporating adjustable parameters to modify the interplay of gravitational forces among nodes and balance the weights of different metrics in order to improve the performance of the algorithm.

Figure 1 .
Figure 1.A toy network.The red node is ranked first in terms of H-index, while green and yellow represent second and third, respectively.

Figure 3 .
Figure 3. Kendall's Tau was utilized to measure the accuracy of the algorithms at various β values.The different colour symbols represent different methods, and the red symbol represents HVGC algorithms.

Figure 5 .
Figure 5.The Jaccard similarity coefficients on the top-k influential spreaders.

Figure 5 .
Figure 5. (continued) through 1000 independent experiments using the SIR model.We compared HVGC with other gravity modelbased methods in identifying the top 5 nodes in this network, and the results are presented in Table2.

Table 2 .
Comparison of the rankings of the top-5 nodes identified by different methods and the rankings based on the SIR propagation ability in the sample network.

Table 3 .
The topological features of ten real networks.

Table 4 .
The algorithms' accuracies for β = β c , measured by the Kendall's Tau (τ).The top-ranked value in each row of the table is marked in italics, the second in bold.

Table 5 .
Monotonicity of the various algorithms is observed, with the best algorithm for each network highlighted in bold.