Improved collective influence of finding most influential nodes based on disjoint-set reinsertion

Identifying vital nodes in complex networks is a critical problem in the field of network theory. To this end, the Collective Influence (CI) algorithm has been introduced and shows high efficiency and scalability in searching for the influential nodes in the optimal percolation model. However, the crucial part of the CI algorithm, reinsertion, has not been significantly investigated or improved upon. In this paper, the author improves the CI algorithm and proposes a new algorithm called Collective-Influence-Disjoint-Set-Reinsertion (CIDR) based on disjoint-set reinsertion. Experimental results on 8 datasets with scales of a million nodes and 4 random graph networks demonstrate that the proposed CIDR algorithm outperforms other algorithms, including Betweenness centrality, Closeness centrality, PageRank centrality, Degree centrality (HDA), Eigenvector centrality, Nonbacktracking centrality and Collective Influence with original reinsertion, in terms of the Robustness metric. Moreover, CIDR is applied to an international competition on optimal percolation and ultimately ranks in 7th place.

Researchers have developed numerous measures to evaluate node importance. The most widely used centrality methods include Betweenness centrality 19 , Closeness centrality 20 , PageRank centrality 21 , Degree centrality (HDA) 22 , Eigenvector centrality 23 , and Nonbacktracking centrality 24 . Betweenness centrality 19 is defined to represent a node as the number of shortest paths from all vertices to all other paths that pass through that node. Closeness centrality 20 uses the sum of the length of the shortest paths between the node and all other nodes in a graph as a node's value. PageRank centrality 21 was first proposed by Google to rank websites and works by counting the number and quality of links to a page to determine a rough estimate the importance of a website. Degree centrality (HDA) 22 ranks nodes directly according to the number of connections and recalculates the degree after each removal of the top ranked node. Eigenvector centrality 23 computes the centrality for a node based on the idea that the importance of a node is recursively related to the importance of the nodes pointing to it. A high eigenvector score means that a node is connected to many nodes who themselves have high scores. However, for some graphs, the Eigenvector centrality will produce an echo chamber effect and localization onto a hub. Nonbacktracking centrality 24 modified the standard Eigenvector centrality based on the Nonbacktracking matrix to ignore the reflection mechanism on hubs, therein being asymptotically equivalent to Eigenvector centrality for dense networks and avoiding hub localization on sparse networks.
However, for these methods, the node importance is evaluated by regarding a node as an isolated agent in a non-interacting setting. Consequently, these methods are considered as heuristic methods and fail to provide the optimal solution in the general case of finding a single influential node among multiple spreaders 4,25 . To address the issue, a scalable theoretical framework called the Collective Influence (CI) algorithm, which attempts to find the minimal fraction of nodes that can fragment the network in optimal percolation, was recently proposed 4 . If the influencers are removed in a network, the network will face structural collapse, and a giant connected component G of the graph will be 0. The CI would improve and maximize the collective influence of multiple influencers, and the accurate optimization objective is highly adaptable for giving an optimal set of spreaders for various networks. For networks with millions of nodes, such as massive social media and social networks 22 , CI also performs well in processing centrality efficiently.
The implementation of the CI algorithm contains 2 steps. In the first step, CI calculates the value of each node in the network and removes the node with the highest value according to their importance one by one until the giant component is destroyed. Then, in the second step, CI adds back removed nodes and reconstructs the collapsed network, i.e., reinsertion. Reinsertion 4,22 is the refined post-processing in the CI algorithm and minimizes the giant component G of the graphs for the target G > 0. Although CI has already demonstrated its efficiency in searching for the potential influential nodes in the optimal percolation model, the reinsertion step in CI has rarely been discussed. Until now, the optimal percolation model only addressed the issue of dismantling networks in the first step of CI. The second procedure, reinsertion, is not designed to be optimal, which leads to the fact that the former optimal percolation model is unable to achieve optimal results. Therefore, it is necessary to address this issue by designing a better reinsertion step.
Robustness 26 is a recently proposed measure for quantifying the performance of methods for ranking nodes. This paper improves the reinsertion method with respect to the Robustness metric to find the most influential nodes in CI, and it proposes a new algorithm named Collective-Influence-Disjoint-Set-Reinsertion (CI DR ). CI DR mainly employs disjoint sets 27 as the data structure to optimize reinsertion in the CI algorithm and reorder the removed nodes into a new sequence.
The proposed CI DR method is verified in the International Competition of optimal percolation 28 and ultimately ranks in 7th place. The competition adopts the Robustness metric as the scoring criteria and provides 4 real networks from different fields, i.e., autonomous system networks, Internet networks, road networks and social networks, and 4 classical artificial networks (8 datasets in total). The node counts of these networks range from 0.4 million to 2 million. Therefore, the competition network benchmark quite representative overall. The results of the experiments indicate that the proposed CI DR method outperforms the other 7 methods on 8 competition datasets. The methods include Betweenness centrality 19 , Closeness centrality 20 , PageRank centrality 21 , Degree (HDA) centrality 22 , Eigenvector centrality 23 , Nonbacktracking centrality 24 and Collective Influence with original reinsertion 22 as comparison algorithms.
A total of 4 extra random graphs in the ER model generated locally are also utilized to verify CI DR . The results on the 4 random graphs show that CI DR is also better than the other methods listed in the paper, similar to the results on the 8 above competition datasets.

Results
Difference between reinsertion in CI and Collective-Influence-Disjoint-Set-Reinsertion (CI DR ). The CI algorithm contains 2 steps: removing nodes and reinsertion. For the removing node step, CI calculates the value of node i in Formula 1 and removes the node with the highest value. ) k i is defined as the degree of node i . δB(i, l) is the frontier of the ball centered on node i with Radius l, which refers to the shortest path l from frontier nodes to node i 4 . The newly proposed CI DR method also calculates the value of each node following Formula 1, which is the same as in CI. The difference between CI and CI DR is that they implement different strategies in the reinsertion step. In CI, the original reinsertion step is invoked 22 in Algorithm 1 after the networks are broken down into many pieces through the process of removing nodes. An initial collapsed graph G c is generated after CI removes nodes from the graph G, and then, reinsertion selects the removed nodes to reconstruct the collapsed graph G c .
For the reinsertion step in the improved CI DR , 2 main enhancements are proposed in this paper: • CI DR implements disjoint sets as the data structure 27 to store the indices of the connected components during reinsertion for a collapsed graph. • CI DR considers the rejoined node count instead of the number of rejoined clusters to decide which node will be reinserted. In the original reinsertion in Algorithm 1, CI utilizes brute-force graph traversal to label the indices of the connected components in a dismantled network (Step 1). The time cost is high since the labeling operation will be executed multiple times until all removed nodes are reinserted. In particular, when a dismantled network has nearly reached the completion of the reinsertion process and most of the removed nodes are reinserted, the labeling operation must build up from nothing every time. Considering that deciding which node will be reinserted is performed several times in the original reinsertion process, the information of the reinsertion for each iteration can be reserved and prepared for the next round of decisions on which removed node will be reinserted.
The first enhancement of CI DR implements disjoint sets to optimize the data structure to reduce the computational resource consumption. A disjoint set 27 is a tree structure, where each node stores a pointer to the parent node. If the parent pointer of a node points to itself, this node is the root of a tree and is the representative index of its cluster. In Algorithm 1, the index information of the connected components in the reconstructed graph is abandoned at the end of each iteration for updating indices. Using the disjoint-set data structure, it is possible to maintain the indices of the connected components for a collapsed network when the dismantled nodes are reinserted into the graphs. The disjoint-set data structure provides 2 nearly constant-time operations. The first operation is called the Find operation, which determines which indices of connected components the current nodes stay in. The second operation is the Union operation, which merges several clusters into one.
The Find operation locates which connected components a node belongs to. The operation can follow the parent node continuously in a tree of a cluster until it finds the root node, which denotes the index of a connected component. The Find operation is utilized to replace Step 2 in Algorithm 1 and is capable of retrieving the index set I i of the connected components of the neighboring nodes around node i .
The Union operation merges clusters to which 2 nodes belong into one connected component. This operation uses the Find operation to determine the roots of the trees. If the roots of 2 nodes are distinct, the trees are combined by attaching the root of one to the root of the other node. When newly removed nodes are reinserted, the Union operation is capable of preserving the index information of the connected components in the iteration when updating indices continuously.
Several optimization methods on disjoint sets, such as Path Compression and Union by Size, are applied in the implementation to improve the Find and Union operations 29 . Path Compression flattens the structure of the tree by making every node point to the root when the first Find operation is invoked on the tree. This will speed up and decrease the complexity of future Find operations. Union by Size means that the Union operation attaches the tree with fewer nodes to the root of the nodes containing more elements. This is also another method for flattening the structure of the tree. The size of the cluster is stored in the root node of a tree, and the new size of the cluster following the Union operation is equal to the sum of the sizes of the root nodes of the original trees.
The computational complexity of both the Find and Union operations is O(inverse_oka(n)) when the Path Compression and Union by Size optimization methods are utilized, where inverse_oka(n) represents the inverse Ackermann function. The inverse Ackermann function contains a value inverse_oka(n) < 5 for any very large value of n that can be written in this physical universe. Therefore, inverse_oka(n) in the Find and Union operations is optimal and can essentially be regarded as constant time 29,30 . In the experiment analysis section below, the statistical results also show that utilizing the disjoint-set data structure in CI DR achieves greater efficiency and is faster than the original reinsertion algorithm in CI. For the second enhancement, CI DR considers the number of rejoined nodes instead of the number of rejoined clusters when deciding which removed nodes will be reinserted. This improvement is applied in Step 4 in Algorithm 1. The purpose of the modification is to enhance the final Robustness score of the original reinsertion operation in the CI algorithm. The original reinsertion operation adds back the nodes that rejoin the smallest number of clusters. Nevertheless, the method does not consider the smallest node count of the rejoined clusters globally. In contrast to the number of rejoined clusters, the information about the rejoined node counts is more representative for a connected component. Because the first enhancement implements the disjoint-set data structure and because the Union by Size optimization is enabled, the node count of each connected component is stored in the root node of the corresponding tree in the disjoint set. The smallest node count of rejoined connected components can be conveniently selected from all candidate nodes. Figure 1 is an example of different choices of the candidate reinserted nodes decided by the original reinsertion and CI DR methods. Round nodes have been in the collapsed network, and there are 2 candidate nodes to be reinserted: the square node and the triangle node. If the original reinsertion method in CI is applied, it will reinsert the square node because the number of rejoined clusters is 2; fewer than 3 clusters are reinserted by the triangle node. If CI DR is exploited, it will reinsert the triangle node because there are 3 rejoined nodes, and fewer than 4 rejoined nodes are reinserted by the square nodes. CI DR more strongly considers the global impact of the rejoined nodes on the collapsed network compared with the original reinsertion process. In conclusion, the CI DR algorithm with the first and second enhancements for the reinsertion process is shown in Algorithm 2.

Experiments and comparison of different methods on 8 datasets provide by DataCastle Master
Competition. In this subsection, several centrality methods are verified on 8 datasets provided by the DataCastle Master Competition 28 . The task of the competition is a generic challenge identifying vital nodes in  networks that are important for sustaining connectivity. The competition provides 4 real networks from different fields, i.e., autonomous system networks, Internet networks, road networks and social networks, and 4 classical artificial networks (for a total of 8 datasets). These networks each include 0.4 million to 2 million nodes, and all networks are considered undirected networks. Table 1 reflects the network name and corresponding number of nodes. Robustness 26 is utilized as the scoring criterion in the competition. It is introduced to quantify the performance of the methods for ranking nodes. For the calculation of the Robustness score, refer to Formula 2.
The parameter p is defined as the proportion of removed nodes. δ is the size of the giant component of the remaining networks in proportion after removing a proportion p of the nodes. The δ-p curve can be derived from plotting p on the x-axis and δ on the y-axis. The Robustness is defined as the area under the δ-p curve. δ ( ) i n is the size of the giant component after removing = p i n of the nodes from a network 28 . Generally, the goal of an algorithm that finds the most influential nodes is to give a ranked list of nodes according to their importance, where the top-ranked nodes will have greater importance. Nodes can be removed from a network according to the ranking list. The removal operation breaks down the network into many disconnected pieces. If the size of the giant component is calculated after the removal of each node, the ratio of the giant component will ultimately go to 0. Therefore, a better algorithm for ranking nodes will dismantle networks sooner and produce better Robustness scores.
The Robustness under CI without Reinsertion, with Original Reinsertion and CI DR on 8 competition datasets is presented in Table 2 and Fig. 2. CI without Reinsertion refers to the case in which CI only invokes the process of removing nodes and does not reinsert nodes into networks. CI with Original Reinsertion refers to the case with the node removal step and original reinsertion in Algorithm 1. For various similar CI algorithms, the radius is the required input parameter for the node removal step. A larger radius will optimize removing steps and produce a smaller set of minimal influential nodes when dismantling networks. As a trade-off, the step of removing nodes   will cost more computing resources and take longer. Meanwhile, a larger radius will speed up reinsertion because fewer dismantled nodes are reinserted. The result of using different input radii of 0, 1, and 2 for the various CI methods is also shown in Table 2 and Fig. 2.   Table 4. Robustness of different heuristic algorithms, including Betweenness centrality, Closeness centrality, PageRank centrality, Degree centrality (HDA), Eigenvector centrality and Nonbacktracking centrality, on 8 competition datasets. Table 2 and Fig. 2 show that the total Robustness (the lower, the better) of CI DR on the 8 datasets is better than in the cases without Reinsertion and with Original Reinsertion. This is because in the reinsertion process, the rejoined node count is more representative in vital nodes than the number of rejoined clusters in the Robustness metric. For a radius of 0, the total Robustness score under CI with Original Reinsertion of 1.1863 decreases by 12% to 1.0403 for CI DR . For a radius of 1, the total Robustness score of 1.1099 for CI decreases by 6% to 1.0409 for CI DR . For a radius of 2, the total Robustness score of 1.0701 decreases 4% to 1.0246 for CI DR . For each individual dataset, CI DR performs better than CI in terms of Robustness in 7 of the 8 networks when a radius of 0 is used in CI and CI DR . CI DR ranks second to CI only in the model 3 network. When a radius of 1 and a radius of 2 are adopted in CI and CI DR , CI DR obtains a better score than CI on 5 of the 8 networks. CI DR ranks second behind CI in the model 1, model 2 and model 3 networks and obtains nearly the same and best result.
For different radii as input parameters, the total Robustness values in CI DR are all better than those with Original Reinsertion. Even for the case when the radius is 0 and the process of removing nodes degenerates into that of Degree centrality (HDA) 22 , CI DR is capable of achieving a considerably better score of 1.0403 compared with the previous best result of 1.0701 under Original Reinsertion with a radius of 2. For the smaller radius, CI DR exploits potential performance increases in terms of the Robustness metric, in contrast to Original Reinsertion. Therefore, there is no need to set a higher radius, which increases the complexity of removing nodes in CI. A lower radius is able to achieve nearly the same results in CI DR .   In Table 3 and Fig. 3, the time consumption of Original Reinsertion and CI DR excluding removing nodes on 8 competition datasets is presented. The datasets are verified on the same machine with a 4-core CPU (Intel Xeon E5-2667v4 Broadwell 3.2 GHz) with 8 GB of memory concurrently. For most cases with different radii, the statistics show that CI DR is better in terms of speed than Original Reinsertion, excluding the node removal steps, in CI. For the real 3 dataset with a radius of 2, the time consumption can be reduced 79.3% from 262 s to 54 s. The statistics evidence that implementing the disjoint-set data structure in CI DR is more efficient than the Original Reinsertion algorithm.
Heuristic algorithms, including Betweenness centrality 19 , Closeness centrality 20 , PageRank centrality 21 , Degree centrality (HDA) 22 , Eigenvector centrality 23 and Nonbacktracking centrality 24 , are verified on these datasets as competitors in Table 4 and Fig. 4. The Robustness values of these heuristic algorithms are all worse than CI and CI DR for the 8 competition networks. The worst value of Closeness centrality is only 2.88.
The recently proposed Nonbacktracking centrality 24 is also verified on these 8 competition datasets. Nonbacktracking centrality was introduced by Newman et al. and modified from the standard Eigenvector centrality based on the Hashimoto or Nonbacktracking matrix [31][32][33] . Nonbacktracking centrality is very similar to Eigenvector centrality, where the main improvement is to ignore the echo chamber effect producing localization on a hub. This is asymptotically equivalent to Eigenvector centrality for dense networks and avoids the hub localization on sparse networks introduced by Eigenvector centrality. Therefore, the performance of Nonbacktracking centrality in dense networks will be highly similar to Eigenvector centrality. From Table 4 and Fig. 4, the statistics show that corresponding scores of 2.7725 and 2.7320 for Eigenvector centrality and NonBacktracking centrality, respectively, are quite similar. Both Nonbacktracking centrality and Eigenvector centrality are not superior to CI and the proposed CI DR in terms of Robustness for the 8 competition datasets.

Experiments and comparison of different methods on 4 randomly generated graphs.
In addition to the 8 above-mentioned competition datasets, 4 random graph networks in the ER model 34 are also adopted as additional test cases. Table 5 shows the information about the number of nodes and mean degree for each graph. The Robustness values under CI with Original Reinsertion, CI DR and other heuristic algorithms, including Betweenness centrality, Closeness centrality, PageRank centrality, Eigenvector centrality and Nonbacktracking centrality, are presented in Table 6 and Fig. 5 for 4 random graphs.
The total Robustness of CI DR is 0.3443, 0.3420 and 0.3429 for radii of 0, 1, and 2 on 4 randomly generated graphs, which all outperform the other listed centrality methods. For each individual dataset, the Robustness of CI DR also ranks 1st out of all applied methods. For a radius of 0, CI DR performs better in terms of Robustness than CI with Original Reinsertion for a radius of 2. The same result as that on the 8 above-mentioned competition datasets that a lower radius under CI DR outperforms a higher radius under CI with Original Reinsertion is found.
The NonBacktracking centrality achieves a score of 0.9886, which is slightly better than the score of 1.1089 of the Eigenvector centrality; this is because the former method is based on the latter method. However, a score of 0.9886 is unable to compete with CI and CI DR , similar to the results on the 8 above competition datasets.

Discussion
For 8 competition datasets and 4 local randomly generated graphs under the ER model, the best overall result from the previous algorithms is CI with Original Reinsertion for a radius of 2. After the newly proposed algorithm CI DR is applied, even CI DR employing a radius of 0 (degenerate to HDA) is capable of achieving a better result. This indicator shows that the proposed disjoint-set reinsertion in CI DR is able to achieve better Robustness compared to Original Reinsertion. The recently proposed Nonbacktracking centrality and the other above-mentioned algorithms are also unable to outperform CI DR in terms of Robustness.
CI with Original Reinsertion uses the number of rejoined clusters to decide which node will be reinserted. On the other hand, CI DR considers the rejoined node count in the second proposed enhancement. Nevertheless, CI with Original Reinsertion and CI DR implement different methods; both methods attempt to obtain a score that is capable of representing node i . Therefore, a reinsertion framework derived from CI DR can be extended to a more general model. The Generic Disjoint-set Reinsertion Framework (GDRF) is proposed in Algorithm 3 as a general method for describing the process of disjoint-set reinsertion. Step 4 in Algorithm 3 retrieves the score K i by a specific kernel indicating node i , and the other steps in Algorithm 3 are the same as in the reinsertion in the CI DR Algorithm 2. GDRF employing the number Of Clusters kernel (Algorithm 4) and the number Of Nodes kernel (Algorithm 5) is able to achieve the same Robustness score as CI with Original Reinsertion and CI DR , respectively.
Since the reinsertion of independent post-processing can be combined with any network dismantling method to obtain an improved Robustness score, the greater potential of GDRF can be investigated. For the task of finding the most influential nodes in complex networks, the number Of Clusters kernel and the number Of Nodes kernel are implemented to reinsert the nodes with less importance in priority.
In future work, more kernels can be studied to investigate whether GDRF can be applied to other issues. For instance, reconstructing damaged networks is also a widely studied field, and various repair strategies have been presented to repair collapsed networks [35][36][37][38][39] . A new kernel for GDRF, which would be designed to reinsert the nodes with greater importance in priority, can be developed for the task of recovering attacked networks as soon as possible. The Recover Nodes kernel in Algorithm 6 combined with GDRF is an example of selecting the nodes with greater importance in priority. S i represents the total number of rejoined clusters if the removed node i is reinserted. A larger S i means that recovering node i would connect and repair more connected components in the reinsertion process. Compared with the random reinsertion of removed nodes, the recover Nodes kernel tends to reinsert nodes combining with more connected components, which means that recovering a network to certain giant components G > 0 would need fewer reinserted nodes. Since GDRF will search for the minimum K min value of node min among all removed node values K i , Algorithm 6 would return the reciprocal of S i as K i , and node i with larger values of S i would be reinserted in priority.
The RecoverNodes kernel is only suitable when the nodes in a network, instead of the edges, are attacked. The RecoverNodes kernel will also not modify the original network topology after the recovery process. Because this paper mainly focuses on the influential nodes and not repairing attacked networks, additional research on the performance of the recoverNodes kernel compared with previous algorithms can be conducted in future studies.

Methods
As mentioned above, CI DR calculates the value of each node of a network using Formula 1 and removes the nodes with the highest value. In particular, if the radius is set to 0, the step of removing nodes in CI and CI DR will degenerate to the High Degree Adaptive (HDA) algorithm. The concept of HDA was proposed 4 as a better strategy and is slightly different from the original Degree centrality method. The degree of the remaining nodes in adaptive HDA is recomputed after each node removal.  To verify CI and CI DR on the datasets, 2 implementations of the algorithm are utilized: CI_HEAP 40 and ComplexCi 41 . CI_HEAP was provided by the original paper written in the C language and generates the statistics of the Original Reinsertion method. ComplexCi is newly developed as a C++ implementation and produces the statistics of CI DR . CI_HEAP and ComplexCi share the same parameters as follows: • The start points of the reinsertion in CI_HEAP and ComplexCi are the same. Both start to reinsert a node when the size of the giant component collapses to 1% of the whole network. • The finite fractions of nodes at each reinserted step in CI_HEAP and ComplexCi are the same, and both methods reinsert 0.1% at each step. • The intervals of the computing component in CI_HEAP and ComplexCi are the same. To determine whether 1% of the giant component has been reached, CI_HEAP and ComplexCi both need to compute the size of the giant component periodically. The interval parameter is 1%, which means that they will calculate the giant component after the CI algorithm removes 1% of the network nodes.
There are several differences between CI_HEAP and ComplexCi when implementing their algorithms as follows.
• CI_HEAP uses the Original Reinsertion method, and ComplexCi uses CI DR .
• Compared with the initial proposed CI 4 , CI_HEAP enhances the algorithm by utilizing the max-heap data structure 22 for very efficiently processing the CI values. The computational complexity of CI is O(Nlog N) when removing nodes one by one, made possible through an appropriate data structure for processing CI. The ComplexCi application uses a red-black tree with the STL (Standard Template Library) container SET as a different data structure to store and update the CI values. In the field of C++ programming, the SET and MAP containers in STL are usually implemented as red-black trees, which are a type of self-balancing binary search tree. The average computational complexity of a red-black tree in searching, inserting and deleting nodes is O(log N). Although the red-black tree does not outperform the performance of deleting and updating, in contrast to max-heap, red-black tree is still able to achieve an overall computational complexity of O(Nlog N). In CI DR , Introselect algorithm 42 is used to select the top N qualified nodes without a sort algorithm, therein simply being of O(N). We do not need to know the order of the S min array using full sort; we simply need to know the top N qualified nodes.
For Betweenness centrality 19 , Closeness centrality 20 and PageRank centrality 21 , a complex network python library GraphTools 43 is utilized to obtain the statistics. For Eigenvector centrality and Nonbacktracking centrality, the python tool NetworkX 44 is implemented to generate the statistics. To obtain the Nonbacktracking centrality of a network, if the leading eigenvector of its Nonbacktracking matrix B is computed directly according to the definition, the computational complexity will be high. In practice, a faster computation can be executed by utilizing the so-called Ihara (or Ihara-Bass) determinant formula 31,45,46 . It can be shown that the centralities on the Nonbacktracking matrix are equal to the first n elements of the leading eigenvector of the 2N * 2N matrix in Formula 3: where A is the adjacency matrix, I is the identity matrix, and D is the diagonal matrix, with the degrees of the vertices along the diagonal 24 . As mentioned above, the 8 competition datasets in the paper were obtained from the DataCastle Master Competition 28 , therein providing 4 real networks and 4 classical artificial networks. The 4 extra randomly generated graphs under the ER model are generated locally by the python utility NetworkX 44 .
For the calculation of the Robustness score, the code used in this paper is implemented from the DataCastle Master Competition and can be found at the official DataCastle website 47 .