Introduction

In the real world, the phenomenon of networks has a very broad application, and complex systems with numerous entities can be represented as networks1. A complex network can be thought of as the abstract representation of a complex system2, where the nodes represent the entities in the system and the edges represent the relationships between them. In computer networks, computers can be abstracted as nodes, and the network cables between computers can be abstracted as edges. In social networks, people can be abstracted as nodes, and the relationships between them can be abstracted as edges. In a complex network, a small number of nodes that play a key role in its structure and operation are called important nodes. The protection and utilization of important nodes can ensure the security and functional effectiveness of the complex system. In computer networks, redundant backup for the links to important equipment can provide additional security for network communication. In social networks, the news posted by important people spreads faster. In biological networks, important nodes play an essential role in disease discovery and drug development. Excavating important nodes is therefore crucial for various real-world applications.

In recent decades, many achievements have emerged in the study of important node excavation, and the results can be classified into node-based, edge-based, and node-edge fusion algorithms. The better-known node-based algorithms are the degree centrality (DC)3 and K-shell4. The DC measures the importance of a node by the number of neighbor nodes5, which only considers the most local information. It is fast in computation but poor in accuracy. The information of second-order or third-order neighbor nodes is further integrated to improve the accuracy6, which, however, increases the time complexity. The K-shell measures the importance of nodes by their location information in the network, and recursively deletes nodes with the same degree value. The greater the degree value used to delete a node, the more important the node is7. The K-shell only considers the degree of a node. It is fast in computation speed, but the results are much more coarse-grained. The classical edge-based algorithms are the betweenness centrality (BC)8 and closeness centrality (CC)9. In BC, the greater the number of shortest paths that pass through a node are, the more important the node is, whereas in CC, the fewer edges that a node passes through to other nodes, the more important the node is. These two algorithms both introduce global information while being less computation-efficient. Scholars also integrate the attributes of nodes and edges to excavate important nodes. New algorithms include identification of nodes influence based on global structure model (GSM)10, identification of nodes influence based on Global Structure Influence(GSI)11, k-shell based key node recognition method (KBKNR)12, influential node identification by aggregating local structure information (ALSI)1, and others. GSM calculates the importance of nodes through the K-shell values and shortest paths, and it considers that importance is proportional to the K-shell value and inversely proportional to the length of the shortest path. GSI considers that the degree and K-shell value have a great relationship with network structure and uses them to determine the importance of nodes while also integrating the number of nodes. KBKNR improves the K-shell algorithm by differentiating nodes in the same layer through neighbor nodes and second-order neighbor nodes based on the K-shell hierarchy, which makes the K-shell more refined. ALSI uses different formulas to calculate the importance of nodes by comparing their K-shell values, and it considers that the own degree, neighbor degree, and K-shell values determine the importance of nodes13. Drawing inspiration from real-world physics formulas, researchers have proposed the gravity model and continuously made improvements. Recently, researchers have addressed the issue of only focusing on the local static geographical distances between nodes and neglecting the dynamic interactions between nodes in real networks. They have introduced the Effective Distance Gravity Model14, which considers both global and local information of complex networks. By utilizing effective distance to merge static and dynamic information15, this method can uncover hidden topological structures in real-world networks and obtain more accurate results. To tackle the problem of the gravity model ignoring the surrounding environment of nodes, researchers have proposed a method based on an adaptive truncation radius and omni-channel paths16. This method integrates multiple node attributes and accurately describes the distance of node interactions, demonstrating good stability on networks with different scales and structural features. These studies have provided valuable insights for the development of this work.

Real-world networks have numerous stochastic characteristics. The importance of different nodes is closely related to the characteristics of the network, and excavating important nodes through multiple attributes is much more efficient than through a single attribute17. This is the basis for this paper’s consideration of the importance of nodes from the perspective of how much contribution they make using nodes, edges, and structural characteristics as indicators in virtue of the heat conduction model (HCM). The fundamental concepts of the HCM are described in the sequel.

Basic idea

Person A interacts socially with person B, who may be Person A’s colleague, superior, or friend. Who among these is the most important? For A, who helps A more is more important. In other words, the more help A provides, the more important A is perceived18. The amount of help provided is influenced by various factors. Firstly, the more resources A possesses, the more help A is likely to offer19. Secondly, the amount of help provided also depends on the ability gap between A and B20. If person A has greater capabilities than B, A can offer more assistance than B21,22. Thirdly, the closer the relationship between A and B, the more A is willing to help B23. Fourthly, People who are directly known by A are more likely to accept A's help than those who are indirectly known24,25. Fifthly, the greater the influence of B is, the more A is motivated to help, as A may need B's help in the future26,27,28. These factors provide new idea for excavating important nodes in network analysis. Based on these five influencing factors, this paper evaluates the importance of nodes in complex networks by considering five indicators: degree, eigenvector centrality, distance, network density, and degree density. By leveraging a heat conduction model from the real world, the paper calculates the output capacity of nodes. The larger the values are, the more important the nodes are. The main contributions of this paper are summarized as follows29:

  1. (1)

    A new algorithm for excavating important nodes, the HCM, is proposed. This algorithm measures the importance of nodes from the perspective of how much contribution the nodes provide to other nodes. In other words, the importance of nodes is measured by their output capacity in complex networks.

  2. (2)

    The factor of the difference between nodes is considered to determine their output capacity values, which enhances the differentiation of output capacity and makes the evaluation of node importance more accurate. Meanwhile, the HCM is more in line with reality.

  3. (3)

    Real-world networks have numerous stochastic characteristics. The HCM considers the network density and the degree density of other nodes, which reduces the influence of the network structure on its accuracy and makes it more universal.

The remainder of this paper is organized as follows. Section “Preliminaries” describes the definitions involved in the HCM. In Section “Proposed algorithm”, the HCM process is specified. Simulation experiments are conducted in Section “Experimental results” and the experimental results are analyzed. Finally, Section “Conclusion” concludes this work.

Preliminaries

The network used in this paper is an undirected unweighted network, denoted by G, and \(G = (Vertex, Edge)\), in which Vertex denotes a node and Edge denotes an edge. In this section, the concepts and theoretical models are described.

  1. (1)

    HCM: This model describes the process of heat conduction in a solid, and is an algorithm for calculating the value of heat conducting from a high-temperature object to a low-temperature object. The HCM is defined as follows:

    $$Q = \frac{\Delta T * K * A}{{\Delta L}}$$
    (1)

    where Q is the value of conducted heat, ΔT is the temperature difference between the objects, K is the coefficient of heat conduction, ΔL is the distance traveled, and A is the contact area between the objects.

  2. (2)

    DC: G is denoted by the adjacency matrix A = (aij)N*N, and the value aij is located in the jth column and the ith row of the matrix A30. When aij = 1, there is an edge between nodes vi and vj, while aij = 0 indicates that there is no edge between them. The degree of node vi is defined as follows:

    $$D\left( {vi} \right) = \sum\limits_{j = 1}^{N} {aij} = \sum\limits_{i = 1}^{N} {aji}$$
    (2)

    where vi denotes the node number, D(vi) denotes the degree value of node vi, and N denotes the number of nodes.

    To facilitate the degree of nodes in different networks, the degree values are normalized5. DC is defined as follows:

    $$DC\left( {vi} \right) = \frac{{D\left( {vi} \right)}}{N - 1}$$
    (3)
  3. (3)

    Eigenvector centrality (EC)31: G is denoted by the adjacency matrix A = (aij)N*N, and A is a square matrix of dimensions N × N. An eigenvalue \(\lambda i\) of the square matrix A is a scalar, and the corresponding eigenvector \(xi\) is a non-zero vector, which satisfies the following relationship:

    $$A * xi = \lambda i * xi$$
    (4)

    Therefore,

    $$xi = \frac{1}{\lambda i} * A * xi = \frac{1}{\lambda i} * \sum\limits_{j = 1}^{N} {aij * xj}$$
    (5)

    In general, there are multiple eigenvalues \(\lambda\) satisfying Eq. (4), as well as multiple corresponding eigenvectors \(x\). When \(\lambda\) takes the maximum value \({\text{max}}\lambda\), the obtained eigenvector \(\max x\) is an important eigenvector. EC is defined as follows:

    $$EC\left( {vi} \right) = \frac{1}{\max \lambda } * \sum\limits_{j = 1}^{N} {aij * xj}$$
    (6)
  4. (4)

    CC: Node vi of G is connected to vj, then there is at least one path \(path\left( {vi,vj} \right)\) between nodes vi and vj, and the path containing the least number of edges is the shortest path \(spath\left( {vi,vj} \right)\). The distance \(R\left( {vi,vj} \right)\) between nodes vi and vj is defined as32:

    $$R\left( {vi,vj} \right) = \left| {spath\left( {vi,vj} \right)} \right|$$
    (7)

    where \(\left| {spath\left( {vi,vj} \right)} \right|\) is the number of edges that the shortest path contains.

    The smaller the distance between a node and other nodes, the closer it is to the network center. CC is defined as:

    $$CC\left( {vi} \right) = \frac{N - 1}{{\sum\limits_{j = 1}^{N} {R\left( {vi,vj} \right)} }}$$
    (8)
  5. (5)

    Network density: Network density measures the closeness of the connections between nodes33. A larger value indicates that nodes are more closely connected, while a smaller value indicates that nodes are more loosely connected. Its definition is:

    $$Density\left( G \right) = \frac{{2 * \left| {Edge} \right|}}{{N * \left( {N - 1} \right)}}$$
    (9)

    where \(\left| {Edge} \right|\) denotes the actual number of edges and N is the number of nodes.

  6. (6)

    Degree density: The area of a circle is calculated by taking the node vj as the center of the circle and the distance \(R\left( {vj,vi} \right)\) as the radius. The ratio of \(D\left( {vj} \right)\) to the area is called degree density from vj to vi, and is defined as:

    $$Dd\left( {vi,vj} \right) = \frac{{D\left( {vj} \right)}}{{\pi * R\left( {vi,vj} \right)^{2} }}$$
    (10)

The degree density is used to adjust the influence of the receiving node on the output node.

Proposed algorithm

The HCM incorporates five factors. Eigenvector centrality is used to calculate the feature vector values for each node, while degree centrality is used to calculate the degree values for each node. The degree values are then used as coefficients, and the difference in feature vector values between two nodes is considered as the temperature difference ΔT. The greater the degree, the more the output value; The larger the difference between the eigenvectors, the larger the output value34. The network density is used as the thermal conductivity coefficient K, the higher the network density, the closer the connection between nodes, and the larger the output value35. The degree density from a node to the target node is considered as the contact area A, the higher the degree density of the acceptance node, the higher the influence, and the higher the output value. The distance between two nodes is used to calculate ΔL, the greater the distance, the smaller the output value36. With the help of a heat conduction model formula, the value of Q is calculated as the output value for the target node. With Eq. (1), the output value of node vi for vj is defined as follows:

$$Q\left( {vi,vj} \right) = \frac{{D\left( {vi} \right) * e^{{EC\left( {vi} \right) - EC\left( {vj} \right)}} * Density\left( G \right) * Dd\left( {vi,vj} \right)}}{{R\left( {vi,vj} \right)}}$$
(11)

In this algorithm, the output capacity of nodes is measured by their output value. As the number of nodes in different networks is different, the output value is normalized. The output capacity of vi is defined as:

$$I\left( {vi} \right) = \frac{1}{N - 1} * \sum\limits_{j = 1\& j \ne i}^{N} {Q\left( {vi,vj} \right)}$$
(12)

Algorithm process description

First, the capacity difference between nodes is calculated by their degree and eigenvector values. The network density, the degree density, and the distance between nodes are then computed. Finally, the output value is calculated using the HCM formula. The pseudo-code for this algorithm is shown in Table 1.

Table 1 Pseudo-code of the HCM algorithm.

Example description

In Fig. 1a, node v1 plays an important role in the topology of the entire network. If node v1 is removed, as shown in Fig. 1b, the entire network becomes two disconnected sub-nets.

Figure 1
figure 1

An example of a network. (a) is the original diagram, and (b) is the comparison diagram after processing. The network consists of 11 nodes and 16 edges, and the yellow node is an example node.

Therefore, node v1 is an important node in the network shown in Fig. 1a.

  1. (1)

    Calculate the degree value, the eigenvector value, and the distance between nodes.

    Figure 1a is taken as an example to illustrate the calculation process of the HCM. The degree value of each node, the eigenvector value, and the distance between each node are first determined. The results are shown in Table 2.

    Table 2 The degree value, the eigenvector value, and the distance between nodes.
  2. (2)

    Calculate the network density.

    The network density can be determined according to Eq. (9): Density(G) = 0.29091.

  3. (3)

    Calculate the degree density.

    The degree density from each node to v1 is computed according to Eq. (10) and the results are shown in Table 3.

    Table 3 The degree density from each node to v1.
  4. (4)

    Output value of node v1.

The output value of v1 for each node is calculated according to Eq. (11), and the results are shown in Table 4.

Table 4 The output value of v1 for each node.

The average value of the output values from v1 to all other nodes is calculated using Eq. (12) to measure the output capacity of v1. I(v1) = 0.615762.

According to the above calculation process, the output capacity of each node is calculated and then sorted in descending order. The sorting results are shown in Table 5.

Table 5 The output capacity of each node.

As can be seen from Table 5, node v1 has the highest output capacity, so it is the most important node, followed by v9 and v4. It can be seen from Table 2 that the degree values of nodes v1, v4, and v9 are all 4. The values of EC(v1) and EC(v9) differ only slightly, but the values of CC(v1) and CC(v9) show a significant difference. This indicates that node v1 is closer to the network center. Therefore, the output capacity of node v1 is stronger. The values of CC(v4) and CC(v9) are the same, but the difference between EC(v4) and EC(v9) values is significant, so the output capacity of node v9 is stronger. Due to the consideration of many influencing factors, the HCM can distinguish the output capacity of nodes.

Time complexity analysis

The HCM algorithm consists of four stages, and the temporal complexity analysis results are described below. In the first stage, calculating the number of edges and nodes in the network has time complexities of O(ǀEdgeǀ) and O(N), respectively. In the second stage, using the Dijkstra method to calculate the distance between any two nodes in the network has a time complexity of O(N2). In the third stage, calculating the degree, eigenvector centrality, and degree density of each node has time complexities of O(N < d >), O(N2), and O(N), respectively. In the fourth stage, outputting the nodes in descending order of their importance has a time complexity of O(N2). In summary, the HCM algorithm has a time complexity of O(N2).

Experimental results

Evaluation index

  1. (1)

    SIR infectious disease model37.

    The SIR model is a mathematical model applied to information transmission research, and an essential standard for evaluating important nodes in complex networks38. The SIR model splits the population into susceptible, infective, and removed categories, with the respective numbers of all populations at time t denoted by S(t), I(t), and R(t)39. In disease transmission, the susceptible population becomes the infective population with infection probability α, and the infective population turns into the removed population with recovery probability β. The mathematical model of the SIR is defined as follows:

    $$\left\{ {\begin{array}{*{20}l} {S\left( {\Delta t} \right) = - S\left( t \right) * \alpha * \Delta t} \hfill \\ {I\left( {\Delta t} \right) = S\left( t \right) * \alpha * \Delta t - I\left( t \right) * \beta * \Delta t} \hfill \\ {R\left( {\Delta t} \right) = I\left( t \right) * \beta * \Delta t} \hfill \\ \end{array} } \right.$$
    (13)

    where Δt represents the time interval.

    In the experiment, one node is selected as the infective node, and the others are chosen to be the susceptible nodes. Infective nodes infect all susceptible nodes with probability α. The number of susceptible nodes that turned into infective nodes is used as the infection value to measure the infection capacity of nodes.

  2. (2)

    IC model40.

    The independent cascade model is an information propagation model that provides an abstract description of the process by which information spreads. In this model, a node is designated as a seed node, and each edge in the network is assigned a propagation probability denoted as "P". The seed node attempts to activate its neighboring nodes with a probability of "P"41. Each node has only one opportunity to activate another node, and if it fails, it will not make any further attempts to activate that particular node42. This propagation process is iterated until no more nodes in the network can be activated. IC model was originally used to describe the dissemination of commodity information in marketing and has now been widely used in the analysis of influence spreading in various fields43,44.

  3. (3)

    Kendall \(\tau\) coefficient45.

The Kendall \(\tau\) coefficient is a statistic that measures the similarity between two sets of random numbers. First, a network can determine the infection value si of each node through the SIR model. The infection value of all nodes is a set that can be expressed as S = (s1, s2, …, si-1, si, si+1, …, sn), with n being the number of nodes. The HCM calculates the output capacity set of all nodes H = (Q1, Q2, …, Qi-1, Qi, Qi+1, …, Qn), with Qi being the output capacity of node vi. When Qi > Qi+1, there is si > si+1, or when Qi < Qi+1, there is si < si+1, and then the two sequences (Qi, Qi+1) and (si, si+1) are regarded as being similar. Otherwise, they are not considered similar. The Kendall \(\tau\) coefficient is used to measure the similarity between the two groups of sequences S and H. The calculation formula is as follows46:

$$\tau (S,H) = \frac{{2 * \left( {n_{c} - n_{d} } \right)}}{n(n - 1)}$$
(14)

where nc and nd denote the number of similar and dissimilar sequences, respectively. Higher \(\tau\) values indicate greater similarity between H and S, while lower values indicate greater dissimilarity.

Data description

To evaluate the accuracy and applicability of the HCM, nine real networks of three sizes—large, medium, and small—are selected with details shown in Table 6.

Table 6 Statistical characteristics of nine actual networks.

All networks data are available at https://github.com/hhf602/HCM.

Contrast algorithm description

To verify the effectiveness of the HCM, eight algorithms for excavating important nodes are selected for comparison, including four well-known and more recent. The eight algorithms are described in Table 7.

Table 7 Metrics for the eight comparison algorithms.

Experimental results

To evaluate the effectiveness of the HCM more comprehensively, the probability of infection α was taken as ten values on the interval [0.01,0.1] with a step size of 0.01 in the SIR model, and the recovery probability was set to β =5,10,54,55,56,57. The experimental equipment is a desktop computer with an Intel i5-10100@3.10 Hz CPU and 32 GB memory, and the software environment is Spyder (Python 3.7.3).

  1. (1)

    Kendall \(\tau\) value comparison.

The output capacity calculated by nine algorithms is sorted according to node number. Similarly, the infection values calculated by the SIR model under different probabilities are sorted as well. According to Eq. (14), the sorting results of each algorithm are compared with the sorting results of the SIR model under ten probabilities, and the Kendall \(\tau\) value is obtained. The comparison results are shown in Fig. 2.

Figure 2
figure 2

Kendall τ values of different algorithms. Among the ten τ values obtained by comparing the HCM’s calculation results with SIR, in Lastfm, Ca-Astroph, and EmailEU, nine of them are higher than those obtained by other algorithms; in Hamsterster and Dblp, eight of them are higher than other algorithms; in David and Ca-GrQc, seven of them are higher than other algorithms; and in Netscience and AS, five of them are higher than other algorithms. Generally speaking, the HCM has absolute advantages.

To further test the performance, we compare the HCM with the other eight methods on three small actual networks. The basic statistics of these three small actual networks are summarized in Supplementary Table S10. The results (Supplementary Tables S11S13) suggest that the HCM method are still very competitive (in-coreness performs overall best).

In different networks above, the comparison results of the τ values obtained by each algorithm are evaluated when the infection probability α takes different values. Results show that the HCM has shown the best effect under most infection probabilities, but ordinary performance under some infection probabilities. To more comprehensively verify the effectiveness of the HCM, the average Kendall \(\tau\) value obtained by various algorithms under different infection probabilities is further compared, and the results are shown in Fig. 3.

Figure 3
figure 3

Average Kendall τ values of nine algorithms under ten infection probabilities. The average τ value of each algorithm in EmailEU is small, but the HCM is still better than the other eight algorithms. In the other eight networks, the results of the HCM are almost a horizontal line and are in the highest position. This indicates that the HCM has the best overall effect, and is suitable for various networks. Other algorithms show different performances in different networks, and the results fluctuate greatly.

As the HCM takes into account factors such as degree, eigenvector, and distance, the effect of the algorithm is related to degree centrality, eigenvector value centrality, and closeness centrality to some extent. Netscience has a hierarchical organizational structure. From Fig. 2, the CC and EC perform the worst, resulting in a low Kendall τ value in the front part of the HCM. With the increase in their τ values, the HCM outperforms other algorithms in the interval [0.06, 0.1]. There is a big difference between the maximum degree Max d and the average degree < d > in EmailEU and AS. This indicates that the high degree values are concentrated on a small number of nodes, resulting in poor degree differentiation of other nodes, so the performance of the DC is poor. Influenced by DC, the τ value obtained by the HCM is relatively low. However, because the EC and CC have better performance, the HCM is still better than other algorithms. Meanwhile, by comprehensively considering the network and degree densities of other nodes, the nine networks do not significantly differ in their τ values for the HCM.

In order to accurately evaluate the effectiveness of the HCM algorithm, we increased the value of α in the SIR model, setting α = 0.2 and keeping β = 1, and conducted the experiment again. The comparison results of the Kendall values at α = 0.2 are shown in Fig. 4. The experimental results indicate that our proposed model (HCM) still performs well.

Figure 4
figure 4

The Kendall τ between the node influence of SIR model and nine algorithms. In the six networks Netscience, Ca-GrQc, AS, Lastfm, Ca-Astroph, and EmailEU32430, the HCM obtained the highest Kendall τ value. In David, Hamsterster, Dblp, the GSI performs the best and the HCM is only marginally inferior, but the HCM's value is also greater than 0.86.

  1. (2)

    Sorting comparison of node importance.

In this section, nodes are sorted in descending order according to output capability, and their positions in the sequence are compared with those of SIR. From networks of three scales, each one is selected to present David, AS, and EmailEU, respectively. In order not to lose generality, α of the SIR model is set to 0.04 in David and AS and 0.01 in EmailEU, while β is 1. To display the results more intuitively, the top ten important nodes are selected for comparison.

The top ten important nodes of each algorithm excavated in David are shown in Tables 8, 9, 10.

Table 8 The first ten nodes of each algorithm in David.
Table 9 The first ten nodes of each algorithm in AS.
Table 10 The first ten nodes of each algorithm in EmailEU.

From Table 8, the important nodes excavated by GSI, GSM, ALSI, KBKNR, and HCM are completely consistent with that of the SIR model. The order of the first seven nodes of HCM and ALSI as well as the first six nodes of GSI and KBKNR is consistent with that of the SIR model. Therefore, HCM and ALSI have the best effect in excavating important nodes. Table 9 shows that nine of the top ten important nodes of GSM and EC are the same as those of the SIR model, which indicates that they have the best effect. The first eight nodes of HCM, KBKNR, and GSI are the same as those of the SIR model, while only the first six nodes of HCM are in the same order as those of SIR. HCM is less effective than GSM and EC in excavating important nodes, but better than other algorithms. It can be observed in Table 10 that eight of the first ten nodes of HCM, GSI, and ALSI are the same as those of the SIR model, and the results of excavating important nodes are the best. Followed by BC and GSM with seven nodes being the same as those of SIR. CC and DC exhibit the worst performance with only four nodes being the same as those of SIR.

To further verify the effectiveness of the HCM in excavating important nodes, the sorting results of all nodes by various algorithms are compared with those of the SIR model. For comparability, the infection value in the SIR model is used as the reference. The infection value of each node is first obtained through the SIR model. The infection values are then re-sorted according to the node order obtained by each algorithm. When the sorting results of each algorithm are consistent with the SIR model, the new sequence of infection values is from large to small and form a smooth downward curve from left to right in the graph. To highlight the important nodes, they are presented linearly on small-scale networks like David and Netscience and presented in Log10 in other networks. To ensure the accuracy of the results, the SIR model is applied with 100 iterations in the large-scale network EmailEU and with 1000 iterations in other networks, and the average value is taken as the infection value of nodes. The results of nodes sorted by various algorithms and the SIR model are compared in Fig. 5.

Figure 5
figure 5

Results of re-sorting infection values by each algorithm. The x-axis represents the number of nodes in the sorting results of each algorithm. For example, 10 represents the top ten nodes with the highest output capacity. The y-axis is the infection value of each node obtained in the SIR model.

As seen in Fig. 5, the curve formed by the HCM has a narrower overall fluctuation range in David, Netscience, Hamsterster, Dblp, and EmailEU in comparison to other algorithms. This indicates that the HCM is the most consistent with the SIR model in sorting the importance of nodes. In Ca-GrQc, the HCM, ALSI, GSM, GSI, and EC curves are smooth on the left, indicating that the sorting of the most important nodes by these five algorithms is consistent with that of the SIR model. They each have a burr on the right side, indicating that the sorting position of some individual nodes is different from that of the SIR model. Among them, EC exhibits the best effect with relatively less burr. In AS, the results of GSM, EC, and CC compared with the SIR model form a smooth downward curve. Although there are burrs, their number is small, and their amplitude is small as well. So, these three algorithms have the best effect. The effects of HCM and GSI are worse than those of GSM, EC, and CC, but better than those of BC, DC, KBKNR, and ALSI. In Lastfm, the left part of the curve formed by EC shows a smooth downward trend. The curve formed by GSI has burrs with a large amplitude, but their number is small. The number of burrs in the curve formed by GSM is large, but their amplitude is small. The left part of the curve formed by HCM fluctuates greatly, but the right part fluctuates less. Therefore, EC has the best effect, followed by GSI and GSM. Although HCM has no obvious advantages, it is still better than other algorithms. In Ca-Astroph, the performance of other algorithms is similar except for the BC. Based on the impact of each of the nine algorithms, the HCM has the best overall performance.

  1. (3)

    Comparison of infection capacity of the top ten nodes

In the previous experiment, the SIR model is used as the criteria to evaluate the output capacity and sorting results of important nodes excavated by different algorithms. Next, in the sequence of nodes sorted by various algorithms, the top ten nodes are selected as infective nodes. The SIR model is used to calculate their infection values and to measure the infection capacity of multiple nodes. For the SIR model, α is set to 0.5, β is set to 1, the infection time t is set to 30, and the number of iterations is set to 100055,56. The infection results of each algorithm are shown in Fig. 6.

Figure 6
figure 6

Infection values of the top ten nodes of the nine algorithms. The x-axis represents time t, and the y-axis represents the number of infective nodes at time t. As the infection values are close, the results in some networks are amplified.

In Fig. 6, the infection values of ten infective nodes gradually increase with the increase in time t. When t≈5, the number of infective nodes reaches the maximum. At the beginning stage, because the infective nodes selected by BC pass the shortest path with the largest number and the infective nodes selected by DC have the most neighbors, the infective nodes selected by both BC and DC infect the most susceptible nodes. These two algorithms have the best effect. The infection values of nodes selected by the HCM are not maximum at the beginning stage, but when t > 9, they exceed those of other algorithms in David, Hamsterster, Ca-GrQc, AS, Lastfm, Dblp, Ca-Astroph, and EmailEU. Additionally, in these eight networks, the HCM can easily infect other nodes as well as more nodes. Netscience has a hierarchical organization structure, and the nodes identified by DC and BC have the strongest infection capacity, followed by HCM. The analysis of Fig. 6 shows that the HCM has the best overall performance in the evaluation of multi-node infection capacity.

To test the performance, we performed the experiment again by setting the value of α to 0.4 and β to 1, the results (Supplementary Tables S23S31) suggest that the HCM method are still very competitive.

In order to further demonstrate the effectiveness of the method, this study also conducted multi-node propagation experiments using the IC model. The experiments used the top 10 nodes identified by each method as the seed-set. Sequentially selecting 2, 4, 6, 8, and 10 seed nodes, the other nodes were activated with a propagation probability P set to 0.5, and the iteration was set to 1000 times. The average value was taken as the propagation value. The propagation results of each method are shown in Fig. 7.

Figure 7
figure 7

Comparison of the numbers of activated nodes by HCM and other algorithms on nine networks. The x-axis represents seed set size, and the y-axis represents the number of activated nodes.

From Fig. 7, it can be observed that as the number of seed nodes increases, the number of activated nodes gradually rises. Among the David, Hamsterster, Ca-GrQc, and AS networks, the HCM method outperforms other methods in selecting seed nodes for propagation. In the Dblp, Ca-Astroph, and EmailEU networks, when the number of seed nodes is 4, 6, 8, and 10, the HCM method outperforms all other algorithms. In the Lastfm network, when the number of seed nodes is 2, 4, 8, and 10, the HCM method shows a significant advantage. In the Netscience network, the HCM method performs worse than the BC, CC, and DC methods. Overall, the HCM method achieves good results in the David, Hamsterster, Ca-GrQc, AS, Lastfm, Dblp, Ca-Astroph, and EmailEU networks, and the experimental results in the IC model are consistent with the SIR model.

The source code is available at https://github.com/hhf602/HCM/blob/main/Code.

Conclusion

In this paper, an important node excavating algorithm, the HCM, is proposed from the perspective of node output capacity. Inspired by degree centrality, eigenvector centrality, and closeness centrality, it considers degree value, eigenvector value, and distance between nodes when measuring the importance of nodes. Meanwhile, the network and degree densities of other nodes are introduced to reduce the influence of network structure characteristics on the algorithm accuracy. Finally, the output capacity of nodes is calculated by the HCM formula, which is used as an indicator to measure the importance of the nodes. Nine real networks are selected from real-world complex systems, and similarity experiments of output capability between nodes, comparison experiments of node importance sorting, and capability experiments of multi-node infection are carried out using the SIR model as the evaluation criterion. Furthermore, the top-2, top-4, top-6, top-8, and top-10 nodes of each algorithm were taken as seed nodes for multi-node concurrent propagation experiments in the IC model. Compared with eight algorithms for excavating important nodes, the experimental results show that the HCM outperforms other algorithms overall, verifying the accuracy and effectiveness of this algorithm.

The advantage of the HCM is that the output capacity of nodes is calculated through five attributes, namely the degree value, eigenvector value, distance, network density, and degree density. As the output capacity is the result of the combined influence of five attributes, it can avoid the accuracy of results being influenced by a too-large or too-small single attribute. At the same time, the influence of network structure characteristics on calculation results is reduced considering the network density and the degree density of other nodes, which makes the HCM a strong universal solution. As the HCM incorporates more attribute information, it improves accuracy but also increases time complexity. Future research will focus on how to reduce the time complexity while ensuring the accuracy.