Eigencentrality based on dissimilarity measures reveals central nodes in complex networks

One of the most important problems in complex network’s theory is the location of the entities that are essential or have a main role within the network. For this purpose, the use of dissimilarity measures (specific to theory of classification and data mining) to enrich the centrality measures in complex networks is proposed. The centrality method used is the eigencentrality which is based on the heuristic that the centrality of a node depends on how central are the nodes in the immediate neighbourhood (like rich get richer phenomenon). This can be described by an eigenvalues problem, however the information of the neighbourhood and the connections between neighbours is not taken in account, neglecting their relevance when is one evaluates the centrality/importance/influence of a node. The contribution calculated by the dissimilarity measure is parameter independent, making the proposed method is also parameter independent. Finally, we perform a comparative study of our method versus other methods reported in the literature, obtaining more accurate and less expensive computational results in most cases.


Supplementary Information
Given two nodes i and j, we are interested in studying how dissimilar are among them, by comparing their respective neighbourhoods. Thus, two nodes with the same neighbourhood would have dissimilarity equal to zero, whereas if two nodes have a very different neighbourhoods, the associated dissimilarity value will be greater than zero (in this work we will use normalized dissimilarities so that its maximum value is 1). We will compare the level of dissimilarity between two nodes through their neighbourhoods, and the measures that allow us to make this comparison will be called structural dissimilarities. Note that given a node i, we can weigh the local importance of each node in the neighbourhood of i through their its dissimilarity level. The greater the dissimilarity between node i and a node j the greater the number new contacts that i will reach through j, so j allows i to spreading information, virus, opinions, roumors, signals, etc., beyond that of its immediate neighborhood, i.e., j "opens the world" to i.
If we have a normalized similarity measure S(·, ·), we can build a normalized dissimilarity measure D(·, ·) taking D(·, ·) = 1 − S(·, ·) In the following, we will test a set of structural dissimilarities popular in literature.

Jaccard Dissimilarity
Jaccard similarity [8] is defined as the proportion of shared nodes between i and j nodes on the number of nodes in both neighbourhoods. Inclusive neighbourhood V + (i) = V (i) ∪ {i} instead of using the simple neighbourhood is necessary because in the case of two nodes connected between them and not connected to any other node, the dissimilarity that considers inclusive neighbourhoods gives us that such nodes are completely similar, while the simple neighbourhood fails to capture the direct connection between nodes and produce a dissimilarity equals to 1. Mathematically, Jaccard dissimilarity is defined by equation (1) as [7]:

Meet/Min Dissimilarity or Topological Overlapping
Topological Overlapping [8] or Meet/Min similarity [9] is defined as the ratio of shared nodes between neighbourhoods of the nodes i and j, and the minimum number of nodes in some of the neighbourhoods. Mathematically we build the associated dissimilarity is Note that although this measure is mathematically like Jaccard dissimilarity, its behaviour is generally quite different. Note that if the inclusive neighbourhood of a node is entirely contained in the inclusive neighbourhood of another, then two nodes have zero dissimilarity, which is not true in the case of the Jaccard's dissimilarity. This shows the importance of the normalization factor when calculating the dissimilarity between two nodes in a network.

Geometric Dissimilarity
Given the nodes i and j in the network, the geometric similarity [8] between these nodes corresponds to the ratio of the square of the number of nodes that share the neighbourhoods of nodes i and j and the product of the number of nodes in each of the neighbourhoods. Mathematically, the dissimilarity will be Note that both the numerator and denominator are different from the above two measures.

Sørensen-Dice Dissimilarity
Given two nodes i and j, the Sørensen-Dice dissimilarity (SD) [10] between these nodes is given by with range between 0 and 1. Again, this measure is like Jaccard, nevertheless behaves different, in fact, one can show that this dissimilarity is not a metric, since does not satisfy the triangle inequality, unlike Jaccard dissimilarity.

Maryland Bridge Dissimilarity
The Maryland Bridge dissimilarity (MB) [11] has been used mainly in the genome classification problem and is defined as Note that the similarity associated with (6) measures the average proportion of the overlap between the respective neighbourhoods.

Czekanovski-Dice Dissimilarity
The Czekanovski-Dice dissimilarity (CD) [12] is given by where is the symmetric difference between inclusive neighbourhoods. It has been mainly has been used in protein-protein interaction data mining.

Korbel Dissimilarity
Given two nodes i and j we define the Korbel dissimilarity [13] as This measure has been used by Korbel in order to take advantage of the information from complete genomes and classify species in phylogeny.

Others possible measures
There are other dissimilarity measure that are not explicitly based in the neighbourhoods of nodes to compare. For example, we can compare (i) the dynamical behaviour of a node in a system of oscillators coupled (in general coupled maps) on the network under study or (ii) consider the dissimilarity of the paths travelled of random walks of k steps that starting from nodes i and j, respectively. In the following, we will discuss these dissimilarity measures.

A Dynamical Dissimilarity
Consider the system of coupled maps on the network: where A is the adjacency matrix and is the coupling strength. Given a random initial condition (initial state vector), if the system is allowed to evolve a time series for each node will be obtained. We can measure how similar are this time series (and therefore the associated nodes), using e.g., Pearson, Spearman or Kendall correlation coefficients [14] or normalized mutual information [15]. Thus, we propose quantify the level of relationship between the dynamics of both nodes particularly using Pearson correlation coefficient, given by where W i (x 0 , t 1 , t 2 ) represents an observation window in the time series associated to node i, with initial condition x 0 and taking values between the time steps t 1 and t 2 , and σ(W i (x 0 , t 1 , t 2 )) is the standard deviation of W i (x 0 , t 1 , t 2 ). The correlation coefficient in (11) ranges in the interval [−1, 1], characterizing the behaviour of a time series based on the behaviour of the other one in the following way: 1. If ρ ij = 1, there is a perfect positive correlation. In this case, the index indicates a total dependency between the two time series known as direct relationship: when one increases, so does the other at a constant rate.
2. If 0 < ρ ij < 1, there is a positive correlation, i.e., the time series have similar but not identical behaviour.
3. If ρ ij = 0, there is no linear relationship. But this does not necessarily imply that the variables are independent: nonlinear relationships may exist between the two variables.
5. If ρ ij = −1, there is a perfect negative correlation. The index indicates total dependence between the two variables called inverse relationship: when one increases the other decreases at a constant rate.
To construct a dissimilarity measure between nodes, it is proposed which can measure how dissimilar two nodes are, through the level of correlation between the associated time series. Note that D ij = 0 if the time series of the corresponding nodes are perfectly correlated and D ij = 1 if there is a perfect negative correlation between time series.

A Random Walk Dissimilarity
Given two nodes in a network, we can try to measure its dissimilarity by comparing its corresponding travelled paths obtained by leaving evolve, as many times as necessary, a uniform random walker that start from each node. Although this strategy has been used in the problem of locating central nodes in networks [26,27] mainly using the frequency of visits as a criterion of centrality, in this paper we propose the use of measures like Pearson coefficient, mutual information or set-theoretic dissimilarities measures over the trayectories in order to feeding centrality methods in complex networks, particularly the proposed in our work. It is noteworthy that this strategy has also been widely exploited in the context of detection of communities [28,29]. Dynamical and random walk dissimilarities measures require a deeper study of their behaviour, because their high computational cost and the number of realizations required could make them impractical for big data analysis over networks, nevertheless we propose them as interesting alternatives to structural dissimilarities.

Comparative analysis of the structural dissimilarity measures
In order to test the performance of each structural dissimilarity measures presented here, we use as benchmark Les Misserables network. In the Table 1, we show the ranking obtained by our method using each one of structural dissimilarity measures.  Note that most measures lead us to similar rankings, however none are equal, as can be observed by a simple inspection of Table 1. From our point of view, the Meet/Min dissimilarity was that produced the worst results. We think the reason is that the normalization factor using the corresponding dissimilarity measure is very permissive, considering two nodes fully similar when one of the inclusive neighbourhood is totally contained in the other. Furthermore, Meet/Min and Korbel dissimilarities does not to detect Javert as the second most important character in the network of Les Miserable, unlike the results obtained with the rest of disimilarity measures presented here.

Datasets
In the following we use as benchmark a set of databases of networks in which is known partially or totally the relevance of each node in the network. The centrality of each node is calculated and the most central nodes are showed in each case. We use the methodology presented in the paper, that is, we calculate the centrality of each node solving the eigenvector problem where W ij = A ij · D ij and D ij is the Jaccard dissimilarity matrix.

Florentine Marriages Network
This network was taken from [1], constructed through data from historical documents on the social relations among renaissance Florentine families. Bases on the analysis of the network, in [1,2] a evidence is provided that support why the Medici were the most powerful family in the early of fifteenth century in Florence. In Figures 1 and 2, we note that there is a difference between the ranking produced by contribution centrality and the other centrality measures, mainly emphasizing the difference between the results obtained with contribution centrality and those obtained with the eigenvector centrality, although both measures have similar heuristics. In Figure 3, the network is illustrated with the different centrality values produced by our method.   In this case, all measurements were able to detect the result obtained by [1,2], i.e., the Medici family is the most important and influential in the network.

Zachary's Karate Club Network
This network was taken from [3] where the nodes were members of a university karate club and the links represented the presence of ties among the members of the club. For our proposes, we take only the topology, obviating the weight of the links.
In Figures 4 and 5, there is significant difference between the ranking produced by contribution centrality and the other centrality measures. Most methods detected the nodes 34, 33 and 1 as the main nodes, being this the correct ranking. Moreover, our method is the only one to point to node 33 above the node 1. We think that node 33 is in fact the second most important node in the network because certainly the student club is lead by the chief administrator given by the node 34 and the sub-officer given by the node 33, who decide to employed a part-time karate instructor, given by node 1, i.e., all nodes except 1 were students and members of the university club before its fission. Note that 33 and 34 belong to the larger community [16,17], and node 33 strongly influences the node 34 due to its direct connection, and it shares much of the neighbourhood of this, i.e., 33 can assume the role of 34 in his absence. In this regard, we consider that 33 is the true top 2. Closeness, Eigenvector and Information placed node 3 above the node 33. The results obtained by contribution, betweenness, communicability and degree centralities indicate that node 3 we should be in position 4. In Figure 6, the network is illustrated with the different centrality values produced by contribution centrality and in Table 3 we show the rankings obtained through different centrality measures here compared.  Information  1  34  1  1  34  34  34  34  2  33  34  3  1  1  1  1  3  1  33  34  33  33  3  3  4  3  3  32  3  3  33  33  5  9  32  9  2  2  2  2  6  32  9  33  4  4  9  32  7  14  2  14  14  32  14  9  8  2  14  20  9  9  4  4  9  31  20  2  32  24  32  14  10  20  7  4  8  14  31 24 Table 3: Comparative Ranking of nodes in the network of Zachary karate club with each centrality measure studied.

Les Miserables Coappearances Network
We take the coappearances network of characters in Victor Hugo's novel "Les Miserables". Nodes represent characters, labeled by their names. Two characters are linked if these appear in the same chapter of the book [4]. We staying only with the topology, ignoring the number of such coappearances. It is well known that the most important characters are Valjean and Javert, having the roles of protagonist and antagonist, respectively. We can see the rankings obtained by each method in the Figures 7 and 8. It is very interesting note that eigenvector are unable to classify Valjean as the most important node, despite its initial heuristic remarkable likeness with contribution centrality. Communicability is another measure that are unable to detect Valjean as the most important. On the other hand, closeness and degree are unable to detect Javert as the second most important node. So, only contribution and information were able to correctly classify Valjean and Javert as the first and second most important characters in this novel. Finally, information centrality ranks Enjolras above Cossette, unlike our method, which is why we believe that contribution centrality generates a better ranking of the nodes in this network, illustrated in Figure 9. In Table 4 we show the rankings according to each centrality measure compared here.

Dolphin social network
This is a social network of bottlenose dolphins where the nodes are the bottlenose dolphins (genus Tursiops) of a community living off Doubtful Sound and a edge between dolphins indicates a frequent association. The dolphins were observed between 1994 and 2001 and was reported in [5]. In this case, it is known that the three most important dolphins are SN4, Grin and Topless. However, betweenness coefficient tells us that Grin and topless have less intermediation than SN4 in the network, appearing for this measure in the 4 rank (see Table 5). Also, for closeness SN4 is more closer to the rest of the other nodes in the network in comparison with Topless and Grin, as shown in Table 5. Moreover, a recent work suggest that SN4 is the most central node in the network, by using the Path Score (PS) [18]. Thus, we see that with the inclusion of dissimilarity measure is achieved detect more subtle details than those achieved with the classical measures. If we sum the rankings for each measure and average them we obtain again that SN4 is the top 1 and if we study the community structure of this network, we see that SN4 is in the bigger community [17]. In Figures 10 and 11, we can see the rankings produced by each centrality measure here explored. In Figure 12, we represent the centrality of each node produced by our method in the network.

Terrorist Network
This network shows the contacts between suspected terrorists involved in the bombing of train in Madrid on March 11, 2004 [19]. A node represents a terrorist and two terrorist are linked if there was a contact between them. The main material author was Jamal Zougam [20], accused and was sentenced to 34,715 years [21]. The second in the ranking is Mohamed Chaoui, halfbrother of Jamal Zougam [22], accused of purchase the thirteen mobile phone SIM cards [23] used to detonate the explosive device. In third position appears Imad Eddin Barakat (also known as Abu Dahdah), sentenced to twenty-seven years in prison for his participation in the September 11th terror attacks. Spanish intelligence officer Rafael Gomez Menor speculated that Imad Eddin Barakat oversaw the planning of the train bombings, as intellectual author [24]. Thus, almost all centrality measures seems to have correctly identified the main members related with the planning and execution of this terrorist attack, with the exception of closeness and betweenness that ranked Imad Eddin Barakat in the positions fifth and fourth respectively, as we can see in Tables 6 and  7. In Figures 13 and 14, we can see more detailed differences between the rankings. In Figure 15 is illustrated the centrality of each node through contribution centrality.   Degree Eienvector Information  1  1  63  1  1  1  1  1  2  3  1  3  3  3  3  3  3  7  3  41  7  7  7  7  4  41  40  7  41  11  41  41  5  11  7  31  11  41  11  11  6  61  31  40  18  24  18  24  7  18  24  24  30  18  30  18  8  40  19  30  15  19  15  19  9  19  61  19  61  31  61  31  10  44  25  11  16  61  16  30  11  24  11  18  58  33  58  33  12  30  21  28

Airport USA 97
This is a model of commercial air traffic network among airports in the United States [6]. The network consist of 332 nodes (representing airports) and 2126 links, where each link is associated connecting airports through a direct flight. It is important to study the centrality of this network, since the hubs are related to points of optimal spread of disease, which is a serious public health problem. Detect such nodes would prevent or control the spread of diseases at local and global scale. The hubs are also related to vulnerability points in the airport network (points that in case of failure hinder all flights) what is of interest for airlines, since the flight delay produce losses. Detecting and preventing problems involve efficient and safe transportation. Only using the underlying topology and without information about number of passengers, charges transported or any other statistics/weights, we observe that contribution centrality detects within top 4 Chicago O'hare Intl., Dallas/Fort Worth Intl. and The William B. Hartsfield Atlanta that are three of the four major airports to date [25], missing only the Los Angeles Intl. In this case, only betweenness centrality obtain two of the four most important airports, meanwhile others have a similar ranking, as we can see in Table 8 where the nodes appears encoded by numbers and the translation of the names in Table 9.

Runtime performance
An analysis of the performance of our method was performed in a set of Barabási -Albert networks [31], with a number of nodes distributed logarithmically from 10 2 to 10 5 and parameter k = 2. For each N , were taken 10 networks and the runtime of our algorithm was averaged. The results obtained are shown in Table 10 and in Figure 18. Runtimes were quite short, with 50 minutes for the largest benchmark network. The software used for such simulations was written in Python, which is well known to have a performance 20 times lower for intensive calculations, compared to languages like C/C++, or Fortran, so we  estimate that analyse a network of 10 5 nodes with a code written and optimized in C/C++ or Fortran will not exceed 3 minutes, allowing process networks in the range of million nodes in a few minutes. After analysing all these networks, it was possible to detect hub as the most important node in each network as we expect.