Identifying vital nodes based on reverse greedy method

The identification of vital nodes that maintain the network connectivity is a long-standing challenge in network science. In this paper, we propose a so-called reverse greedy method where the least important nodes are preferentially chosen to make the size of the largest component in the corresponding induced subgraph as small as possible. Accordingly, the nodes being chosen later are more important in maintaining the connectivity. Empirical analyses on eighteen real networks show that the reverse greedy method performs remarkably better than well-known state-of-the-art methods.


Algorithms
The core of the RG algorithm is the reverse process, which adds nodes one by one to an empty network while minimizes the cost function until all nodes in the considered network are added. Then, nodes are ranked inverse to the order of additions, that is to say, the later added nodes are more important in maintaining the network connectivity. Denote G V E ( , ) the original network under consideration, where V and E are the sets of nodes and edges, respectively. This paper focuses on simple networks, where the weights and directions of edges are ignored, and the self loops are not allowed. The reverse process starts from an empty network G V E ( , ) 0 0 0 , where = ∅ V 0 and = ∅ E 0 . At the + n ( 1) th time step, one node from the remaining set − V V n is selected to add into the current network G V E ( , ) n n n to form a new network of + n ( 1) nodes, say G V E ( , ) n n n 1 1 1 + + + . Note that, all progressive networks G n (  = n N 0, 1, 2, , , with N being the size of the original network G) in the process are induced subgraphs of G. For example, G n is consisted of all edges in G with both two ends belonging to V n . According to the greedy strategy, the selected node i should minimize the size of the largest component in + G n 1 . If there are multi-www.nature.com/scientificreports www.nature.com/scientificreports/ ple nodes satisfying this condition, we will choose the one with the help of another structural feature of the node i in G (e.g., degree, betweenness, and so on). Therefore, the cost function can be defined as is the size of the largest component after adding node i into G n , f i ( ) is a certain structural feature of node i in G, and  is a very small positive parameter and has no effect on the results of RG as long as  * < maxf i ( ) 1 is guaranteed. The parameter  works only when + • G ( ) n 1 max are indistinguishable for multiple nodes. At each time step, we add the node minimizing the cost function into the network, and if there are still multiple nodes with the minimum cost, we will select one of them randomly. This process stops after N time steps, namely all nodes are added with ≡ G G N . An illustration of such process in a small network is shown in Fig. 1.

Data
In this paper, eighteen real networks from disparate fields are used to test the performance of RG, including four collaboration networks (Jazz, NS, Ca-AstroPh and Ca-CondMat), two communication networks (Email-Univ and Email-EuAll), four social networks (PB, Sex, Facebook and Loc-Gowalla), one transportation network (USAir), one infrastructure network (Power), one citation network (Cit-HepPh), one road network (RoadNet-TX), one web graph (Web-Google) and three autonomous systems graphs (Router, AS-733 and AS-Skitter). Jazz 24 is a collaboration network of jazz musicians. NS 25   0 01 (note that,  should be positive yet small enough and it is set to be − 10 7 in the later experiments). Initially, the network is empty. At the first time step, . Therefore, we add node v 6 into the network because cost v ( , 1) 6 is the smallest. In the second time step, , so we only compare three candidates v 1 , v 2 and v 3 . Since www.nature.com/scientificreports www.nature.com/scientificreports/ nodes, the number of edges, the average degree, the clustering coefficient 33 , the assortative coefficient 38 and the degree heterogeneity 39 ) are shown in Table 1.

Results
We apply two widely used metrics to evaluate algorithms' performance. One is called robustness R 41 . Given a network, we remove one node at each time step and calculate the size of the largest component of the remaining network until the remaining network is empty. The robustnessR is defined as 41 Q N 1 where S Q ( ) is the number of nodes in the largest component divided by N after removing Q nodes. The normalization factor 1/N ensures that the values of R of networks with different sizes can be compared. Obviously, a smaller R means a quicker collapse and thus a better performance. Another metric is the number of nodes to be removed to make the size of the largest component in the remaining network being no more than . N 0 01 , denoted by, ρ min . Obviously, a smaller ρ min means a better performance.
Here we use BC, CC, DC, H-index, KS, PR, CI, CI with reinsertion (short for CI+), BPD, CoreHD and EI as the benchmark algorithms (see details about these benchmark algorithms in Methods). Tables 2 and 3 compare R and ρ min of RG and other benchmarks algorithms, respectively. Notice that, we use the random removal (Random) as the background benchmark in order to show the improvement by each method. Both BPD and CoreHD need a refinement process to insert back some removed nodes. In each step, it calculates the increase of the component size after the insertion of a node and select the node that gives the smallest increase. Such process stops when the largest component reaches . N 0 01 . These two methods do not provide a ranking of all nodes, and thus we cannot obtain R for them. For EI, according to 23 , we set = K 6 and the number of candidates being equal to 2000. So the networks with sizes smaller than 2000 fail to apply EI (those cases are marked by N/A in Tables 2-4). Each result of EI is obtained by averaging over 20 independent realizations. The radii of CI and CI+ are set to be 3 only except for AS-Skitter (its radius is set to be 2) because its size is too large and thus = radius 3 will leads to too much computation.
As shown in Table 2, subject to R, CI, CI+, EI and RG perform better than the classical centralities (e.g., BC, CC, DC, H-index, KS and PR) in almost all networks, and RG is overall the best method subject to R for real networks. Figure 2 shows the collapsing processes of four representative networks, resulted from the node removal by RG and other benchmark algorithms. Obviously, RG can lead to faster collapse than all other algorithms, and CI+ is the second best algorithm. As shown in Table 3, subject to ρ min , BPD, CoreHD, EI and RG perform better than the classical centralities and CI/CI+ in almost all networks. For real networks, RG and BPD perform remarkably better than other methods. For random networks, RG is among the best and much better www.nature.com/scientificreports www.nature.com/scientificreports/ than the classical centralities. However, RG is slightly worse than CI+ subject to R and BPD subject to ρ min . In random networks, the topological difference among different nodes is relatively small, and thus RG may mistakenly select some influential nodes as unimportant nodes at the early stage, leading to unsatisfactory results. Table 4 compares the CPU times of the 12 methods under consideration, from which one can see that RG is slower than DC, H-index, KS, PR and CoreHD, similar to CC and BPD and faster than BC, CI/CI+ and EI. Generally speaking, RG is an efficient method.

Discussion
To our knowledge, most previous methods directly identify the critical nodes by looking at the effects due to their removal 10 . In contrast, our method tries to firstly find out the least important nodes, so that the remaining ones are those critical nodes. To our surprise, such a simple idea eventually results in an efficient algorithm that outperforms many well-known benchmark algorithms. Beyond the percolation process considered in this paper, the reverse method provides a novel angle of view that may find successful applications in some other network-based optimization problems related to certain rankings of nodes or edges.
The performance of RG can be further improved by introducing some sophisticated skills. For example, instead of degree, f i ( ) can be different for different networks or set in a more complicated way to improve the algorithm's performance. In addition, the simple adoption of the greedy strategy may bring us to some local optimums. Such shortage can be to some extent overcame by applying the beam search 42 , which searches for the best set of m nodes adding to the network that optimizes the cost function. The present algorithm is the special case for = m 1. Although beam search is still a kind of greedy strategy, it usually performs much better when m is larger. At the same time, the beam search with large m costs a lot on time and space. Therefore, how to find a good tradeoff is also an open challenge in real practice.

Methods
Benchmark centralities. DC 11 of node i is defined as ij is the adjacency matrix, that is, a ij = 1 if i and j are directly connected and 0 otherwise. H-index 12 of node i, denoted by H i ( ), is defined as the maximal integer satisfying that there are at least H i ( ) neighbors of node i whose degrees are all no less than H i ( ). Such index is an extension of the famous H-index in scientific evaluation 43 to network analysis. KS 13 is implemented by the following steps: Firstly, remove all nodes with degree one, and keep deleting the existing nodes until all nodes' degrees are larger than one. All of these removed nodes are assigned 1-shell. Then  www.nature.com/scientificreports www.nature.com/scientificreports/ recursively remove the nodes with degree no larger than two and set them to 2-shell. This procedure continues until all higher-layer shells have been identified and all network nodes have been removed.
PR 14 of node i is defined as the solution of the equations where k j is the degree of node j and s is a free parameter controlling the probability of a random jump. In this paper, s is set to . 0 85. CC 15 of node i is defined as where d ij is the shortest distance between nodes i and j. BC 16 of node i is defined as www.nature.com/scientificreports www.nature.com/scientificreports/ i j ball i j ( , ) where  ball i ( , ) is the set of nodes inside a ball of radius , consisted of all nodes with distances no more than  from node i, and ∂  ball i ( , ) is the frontier of this ball. CI+ is the version of CI with reinsertion. BPD 21 is rooted in the spin glass model for the feedback vertex set problem 44 . At time t of the iteration process, the algorithm estimates the probability q t ( ) i 0 that every node i of the remaining network G t ( ) is suitable to be deleted. The explicit formula for this probability is where x is an adjustable reweighting parameter, and ∂i t ( ) denotes node i's set of neighbors at time t. The quantity is the probability that the neighboring node j is suitable to be deleted if node i is absent from the network G t ( ), while → q t ( ) i j j is the probability that this node j is suitable to be the root node of a tree component in the absence of node i. The node with the highest probability of being suitable for deletion is deleted from network G t ( ) along with all its associated edges. This node deletion process stops after all the loops in the network have been destroyed. Then check the size of each tree component in the remaining network. If a tree component is too large (which occurs only rarely), an appropriately chosen node from this tree to achieve a maximal decrease in the tree size is deleted. Repeat this node deletion process until all the tree components are sufficiently small.
CoreHD 22 simply removes highest-degrees nodes from the 2-core in an adaptive way (updating node degree as the 2-core shrinks), until the remaining network becomes a forest. Furthermore, CoreHD breaks the trees into small components. In case the original network has many small loops, a refined dismantling set is obtained after a reinsertion of nodes that do not increase (much) the size of the largest component.
EI 23 selects m candidate nodes from the set of absent nodes at each step and calculate the score σ i of each of them using the following kernel ( ) where N i represents the set of all connected components linked to node i, each of which has a size C j , and k i eff ( ) is an effective degree attributed to each node (please see 23 for the details). Then the nodes with the lowest scores are added to the network. This procedure is continued until the size of the giant connected component exceeds a predefined threshold. The minus one term in Eq. (9) is used to exclude any leaves connected to node i, since they do not contribute to the formation of the giant connected component and should be ignored.

Data availability
All relevant data are available at https://github.com/ChinaYiqun/network-data.