Super-Spreader Identification Using Meta-Centrality

Super-spreaders are the nodes of a network that can maximize their impacts on other nodes, e.g., in the case of information spreading or virus propagation. Many centrality measures have been proposed to identify such nodes from a given network. However, it has been observed that the identification accuracy based on those measures is not always satisfactory among different types of networks. In addition, the nodes identified by using single centrality are not always placed in the top section, where the super-spreaders are supposed to be, of the ranking generated by simulation. In this paper we take a meta-centrality approach by combining different centrality measures using a modified version of Borda count aggregation method. As a result, we are able to improve the performance of super-spreader identification for a broad range of real-world networks. While doing so, we discover a pattern in the centrality measures involved in the aggregation with respect to the topological structures of the networks used in the experiments. Further, we study the eigenvalues of the Laplacian matrix, also known as Laplacian spectrum, and by using the Earth Mover’s distance as a metric for the spectrum, we are able to identify four clusters to explain the aggregation results.


Terminologies and definitions
In what follows, G(V, E) denotes an undirected, connected weighted network, where V represents the set of nodes, and E = V × V the set of edges, and w ij represents the weight of the edge e(v i , v j ). Let us denote with A and W the adjacency matrices of the network G, where A ij = 1 represents the edge e(v i , v j ) and W ij = w ij represents the weight of the connection.

Centrality measures
Degree and Strength The Degree centrality is defined as the number of incident neighbors of a node, thus C D (i) = |V | j=1 a ij represents the degree of a node i. A straightforward extension to the weighted case, also called strength [1], is given by C S (i) = |V | j=1 a ij w ij , that is the weighted sum of the edge labels. 78 Betweenness and Closeness Betweenness and Closeness centralities make use of the shortest path. The first one calculates the information flow, as the shortest path between each pair, that passes through a node. Thus, the betweenness of node i is defined as C B (i) = s =t σ st (i)/σ st where σ st (i) is the number of the shortest paths between s and t that pass through i, and σ st is the total number of the shortest paths between s and t. The closeness of node i is defined as the reciprocal sum of the distance between i and all the nodes in the network. Formally, C C (i) = [ z d(i, z)] −1 where d(i, z) represents the shortest path distance between i and z. A natural representation of both measures in their weighted versions is obtained using the weighted shortest paths, therefore C w B and C w C denote the weighted betweenness and closeness, respectively.
Eigenvector and PageRank Eigenvector centrality and PageRank use the neighbour scores to calculate the importance of a node. The first one for each node i is defined as C E (i) = λ −1 j A ij C E (j). In a more formal way, the eigenvector centrality is the solution of the equation Ae = λe, where e is an eigenvector of the adjacency matrix A, and λ is a positive eigenvalue (the existence is guaranteed by the Per-ronFrobenius theorem [2]). PageRank centrality has been used by Google to rank web pages in its search engine. The original design was for a direct graph. The PageRank of node i is defined as , where d is a damping factor (conventionally fixed to 0.85) and deg(j) is the degree of node j. In both measures, the weighted versions, C w E and C w P , are obtained by using the adjacency matrix A w . For completeness, the two centralities are calculated using the power iteration method [3].
K-shell K-shell method is based on recursive pruning. The algorithm starts with the pruning of all the nodes with degree k = 1. After this first pruning, if there could be some nodes that still have a degree equal to one, then the pruning process continues until there are no nodes with degree one. All the removed nodes are considered as 1 − shell and labeled as K s = 1. This pruning and labelling procedure is repeated for the nodes with degree K ≥ 2 until all the nodes are assigned to the respective shell.
The weighted version [4] does not only consider the degree as pruning rules but also for each node i assign the value of the connections strength between its neighbours. For instance, k i = k i j w ij , where k i is the degree of i and j w ij is the sum of all its incident links. To follow the previous notation, C K represents the convention K-shell and C KW its weighted version.

Spearman correlation
Spearman rank-order correlation is a non-parametric measure to quantify the correlation between two rankings. Let τ k and τ t be two rankings of a set C with |C| = n, and τ (i) k the position of the item i in the rank τ k (the same for τ t ). The Spearman correlation ρ is defined as: The correlation value is equal to 1 indicating that the two rankings have a perfect monotonic relation. The value equal to 0 implies no correlation. Note that when there are rankings with ties, as it sometimes happens in our case, this formula is valid with an average of the tie values [5].

Simulation ranking details
Algorithm 1 presents the pseudo code of the procedure that assigns a spreading power value to each node trough the Susceptible-Infected (SI) simulation. In the algorithm, AVG(x) is the average value of the vectorx, and SIM(G) represents a single SI simulation run on the network G. The output of the latter is a vector containing the ratios of infected nodes in the network at each time step of the simulation run. The length of these vectors may vary among the 100 runs, since each simulation stops when all the network nodes are infected. This method is chosen in view that there could exist some variations among different simulation runs starting with the same infected node.

Algorithm 1: SI simulation ranking
Data: G = (V, E) Result: A list R of nodes, with the spreading value 1 Initialize R as an empty vector Set v as the infected seed Set all the nodes as susceptible 12 end

SI evaluation
Instead of using the average time of infecting the whole network to measure the spreading power of each node, in our current study, we have adopted a different measurement so as to incorporate more information about the process of spreading propagation. Specifically, we use the average number of infected nodes among the simulations to capture the spreading dynamics. The new measurement is strongly correlated with the average time of full network infection coverage, but at the same time, can also reflect the speed of the infection propagation. The measurement allows for a more even distribution of the values, and thus it is a better characterization of the spreading power.
In the literature, various epidemic models have been used to tackle the problem of super-spreader identification. Two of the most commonly used ones are: Susceptible-Infected (SI) and Susceptible-Infected-Recovered (SIR). In our current study, we have mainly focused on the SI model in our experimental studies, as we believe it allows to adequately characterize the nature of network-based disease propagation. In order to evaluate whether or not the proposed measurement can reflect the average time of full network infection and improve the nodes' spreading power representation, we use all the networks from our data-sets. We run 100 realization of Susceptible-Infected (SI) simulations starting from each node, and we record the average time of full network infection. To simplify our notation, we call the average number of infected nodes (i.e., our measurement) as AV G I , and the average time of full network infection as AV G T . Moreover, we compare the results obtained from the two measurements using Sprearman correlation coefficient. In all networks, we are able to find a very high correlation value, except for Adolescent network where the correlation is slight lower. The exact correlation values can be found in Figs 1-4, where we show binned scatter plots between AV G I and AV G T . Furthermore, we also show the standard deviation (std) values of the two measurements. We can note that the std values are high in all networks, and they have similar values in the Astro-ph, AS, and Metro networks. More detailed results are given in Table 1   In the preceding section, we have presented the results of our evaluation based on the SI epidemic model. Nevertheless, it is also desirable and interesting to evaluate and show our method using the Susceptible-Infected-Recovered (SIR) model where another state (i.e., Recovered) is added to represent nodes that will not spread the infection anymore. In the case of the SIR model, we can characterize the spreading power of a node by averaging the number of Recovered nodes [6] at the end of the epidemic spreading.
In order to evaluate the generality of our method, we test it in all the networks of our data-sets. We run 100 SIR simulations starting from each node. When a node spreads the infection to its neighbors, it will change its state to Recovered. We record the average number of Recovered nodes at the end of each simulation. Using this as the ground truth ranking for the nodes in the network, we test our method. The aggregated results are found to have the overall best predictions about super-spreaders for the tested networks. More detailed results are given in Table 2 and Figs 9-12 of Section 3.

Computational tools
For the centrality measures calculation and all the simulations, we used NetworkX [7], a Python library for network manipulation, except for Expected Force where there was an R code provided by the author. To calculate the Spearman correlation, we used the built-in function of Scipy [8] library that handles rankings with ties. For the calculation of the eigenvalues of the Laplace Matrix, we used the sparse matrix class of the Scipy library [8,9]. As for visualization, we used matplotlib [10] in combination with the seaborn [11] library.

Detailed results
In what follows, we present detailed results of the proposed solutions. In Tables 1 and  2, we show the best singular centrality measures among different values of f and the average mean improvements using the aggregated solutions, while using both SI and SIR models, respectively. In the figures that immediately follow each of the tables, we display different values of f in the x-axes to show how the recognition factor (y-axes) changes; each figure shows four networks used in our experiments. The last figure in this section shows the reordered heat map of the spectrum pair distance matrix (i.e., spectrum plots) and the histogram plots of the Laplacian spectrum, with eigenvalues in x-axis and their frequencies in y-axis (i.e., cluster-map).
8       Figure 9: Different values of f in the x-axes to show how the recognition factor (yaxes) changes, using SIR as spreading model. The following data-sets are evaluated: Astro-ph (a), C. Elegans (b), Rail (c), and US2013 (d).