Locating influential nodes in complex networks

Understanding and controlling spreading processes in networks is an important topic with many diverse applications, including information dissemination, disease propagation and viral marketing. It is of crucial importance to identify which entities act as influential spreaders that can propagate information to a large portion of the network, in order to ensure efficient information diffusion, optimize available resources or even control the spreading. In this work, we capitalize on the properties of the K-truss decomposition, a triangle-based extension of the core decomposition of graphs, to locate individual influential nodes. Our analysis on real networks indicates that the nodes belonging to the maximal K-truss subgraph show better spreading behavior compared to previously used importance criteria, including node degree and k-core index, leading to faster and wider epidemic spreading. We further show that nodes belonging to such dense subgraphs, dominate the small set of nodes that achieve the optimal spreading in the network.

Each plot depicts the cumulative difference of the infected nodes per step achieved by the truss method vs. the core (truss -core) and top degree (truss -degree) methods. Parameter γ of the SIR models is set to γ = 0.8. In all cases, the proposed truss method outperforms the baselines, leading to more effective information spreading.  Observe that, in almost all datasets the truss method achieves higher cumulative number of infected nodes.
In this work we have mainly experimented with social networks, but our results can also be extended to networks of other disciplines. A full description of the datasets used in the paper follows, while some basic properties are demonstrated in Table 1 of the main text. All datasets are publicly available in [1].
(i) EMAIL-ENRON. This email communication network created by email interaction between the members of Enron Corporation, and made public by the Federal Energy Regulatory Commission during its investigation [2]. It covers data from 150 users and a total of half a million messages. Each node represents an email address and an undirected edge was formed between two nodes if at least an address i sent an email to address j.
(ii) EMAIL-EUALL. This email network was collected from email communication of a large research institution [3]. To create the graph, each email address is considered as a node and an edge is created between two nodes if the latter have exchanged messages both ways. Overall there are 3, 038, 531 emails between 287, 755 different email addresses recorded from October 2003 to May 2005.
(iii) EPINIONS. This is a trust-based (who-trusts-whom) online social network between the members of the Epinions.com (www.epinions.com) product review website [4]. The nodes of the network correspond to users of the website and the edges capture trust relationships between them. Although the network is signed, in our experiments we discard this information; we also convert the graph to an undirected one to use it on our experiments.
(iv) WIKI-VOTE. The graph was created from the online encyclopedia Wikipedia (www.wikipedia.org) and more precisely from the elections conducted to promote users to administrators (till January 2008) [5]. The nodes of the social network correspond to Wikipedia users and an edge between users i, j denotes that user i voted for user j.
(v) WIKI-TALK. This network is also created by data imported from Wikipedia [5]. Each user has a talk page where all interested users can edit to update various articles on Wikipedia. The specific dataset contains all discussions between users until January 2008. The nodes of this network correspond to users of Wikipedia and an edge from node i to node j indicates that user i edited a talk page of user j at least once.
(vi) SLASHDOT. Slashdot (slashdot.org) is a technology news website. The nodes of the social network that is created correspond to users and the edges capture friendship relationships among them (till February 2009) [6]. In fact, users are able to tag other users as friends or foes, forming a signed social network with positive and negative types of edges. In our experiments, we do not take into account the type of the edges.

SUPPLEMENTARY NOTE 2: PROPERTIES OF THE K-TRUSS SUBGRAPHS
We have examined the distribution of the node truss numbers t node of the graphs presented in Table 1 of the main text and the results are depicted in Supplementary Fig. 2. Each plot shows the complementary cumulative distribution function (CCDF) of the nodes' truss number in log-log scale. As we can observe, in most of the cases the distribution is skewed, indicating that very few nodes have high truss number; the majority of the nodes belong to "low" K-truss subgraphs, i.e., small values of parameter K of the decomposition. We have also fitted a power-law distribution [7] to the data (red colored line) and the exponent is shown in Supplementary Fig. 2. Here, we do not claim that the truss number distribution is fully captured by a power-law; nevertheless, it corresponds to heavy-tailed distribution and this fact can help us to better understand the underlying properties of the data. In our case, this means that we can reduce the graph into a subgraph with exponentially smaller size and try to locate influential spreaders within this subgraph.

SUPPLEMENTARY NOTE 3: COMPARISON OF MAXIMAL K-TRUSS AND k-CORE SUBGRAPHS
It has been shown that the maximal k-core and K-truss subgraphs (i.e., maximum values for k, K) overlap, with the latter being a subgraph of the former; in fact, K-truss represents the core of a k-core that filters out less important information. Supplementary Fig. 3 shows an example of a graph and its k-core and K-truss subgraphs respectively. The red colored nodes correspond to the set C (i.e., the maximal k-core subgraph of the graph). The edges of the gray shadowed region are the edges that belong to the maximal K-truss of the original graph and the corresponding nodes are those belonging to set T , i.e., the nodes with the maximum node truss number. In our approach, we argue that those nodes correspond to highly influential ones in the graph.
As we discussed in the main text, the K-truss decomposition is computationally a more difficult task compared to the one of k-core decomposition. In our approach, as we are only interested for the maximal K-truss subgraph, we take into account the structure shown in Supplementary Fig. 3. That is, we first compute the maximal k-core subgraph in linear time with respect to the total number of edges (subgraph defined by the red colored nodes in Supplementary  Fig. 3) and then we extract the maximal K-truss subgraph (gray area in Supplementary Fig. 3). That way, the overall complexity of the task is significantly reduced.
Set C Set T 6 SUPPLEMENTARY FIGURE 3. Schematic representation of the maximal k-core and K-truss subgraphs of a graph. The red colored nodes correspond to the 3-core subgraph of the graph (set C); the gray shadowed region indicate the 4-truss subgraph (set T ).

SUPPLEMENTARY NOTE 4: SIMULATING THE SPREADING PROCESS
Supplementary Algorithm 1 presents the steps of the proposed framework for (i) selecting the initial node that will trigger the epidemic (cascade) and (ii) evaluate the impact of this individual node with respect to the epidemic spreading under the SIR model. Initially, we choose a node that belongs to T (i.e., maximal K-truss subgraph) and set it to the infected (I) state. Notice that, we keep track of the infected, susceptible and recovered nodes for each time step of the process. At each time step, an infected node can infect a susceptible neighbor with probability β. Additionally, any node that got infected at previous time steps of the process, can recover with probability γ. The process is repeated until no more infected nodes left. Finally, the algorithm returns M v which is the number of the infected individuals under the cascade triggered by node v. t ← t + 1 7:

Supplementary Algorithm 1 Identify nodes and evaluate spreading process
for each node w ∈ V do

SUPPLEMENTARY NOTE 5: IMPACT OF INFECTION AND RECOVERY RATE ON THE SPREADING PROCESS
As we presented in the main text, parameters β and γ of the SIR model have been set to some constant values; the infection rate β is typically set close to the epidemic threshold of the graph (as defined by the maximum eigenvalue of the adjacency matrix of the graph), while the recovery rate is considered constant and always set to γ = 0.8. Here, we examine the impact of the infection and recovery rate on the epidemic spreading achieved by the proposed method (truss) and the two baseline methods (core and top degree). To that end, we simulate the spreading process for each of the above methods, setting parameters β and γ as follows: (i) Parameter β is set close to the epidemic threshold of the graph, while varying parameter γ ∈ {0.5, 0.8, 1}.
Parameter γ = 1 implies that each infected node moves to the recovered (R) state with probability one, in the next step of the model. (ii) The recovery rate is set to γ = 0.8, while considering different values of parameter β, always above the epidemic threshold of the graph. As we discussed in the main text, if we consider high values of the infection rate β, a relatively high fraction of nodes will be infected, and thus, the spreading capabilities of individual nodes is diminished.
Supplementary Fig. 4 shows the results. In all cases, we have computed the cumulative fraction of infected nodes I t per step of the process, for each of the three methods, along with the standard deviation (depicted as error bars in the plot). As we can observe, while the recovery probability γ decreases, the number of infected nodes increases both during the first time steps of the process, as well as at the end of the epidemic. This behavior is expected since, as we discussed above, with high recovery rate γ most of the nodes will move to the R state, thus being inactive in subsequent iterations of the model. Regarding the performance of the methods, it is evident that the proposed truss outperforms both baselines for all different settings of parameter γ.
In the second case where the recovery rate γ is constant, while the infection probability is increasing, the number of infected nodes naturally increases. However, for higher values of β, the total number of infected nodes is almost the same for all methods. This behavior is rather expected; by increasing the infection rate, the importance of individual nodes in the epidemic process is reduced. For these values of β, the difference between the methods can be observed during the outbreak of the epidemic (i.e., first steps of the process), where the truss method performs qualitatively better. In this section, we briefly discuss extended related work for the problem of identification of influential spreaders. Building upon the fact that the k-core decomposition is an effective (and efficient) measure to capture the spreading properties of nodes, as introduced by Kitsak et al. [8], several extensions have been proposed. The authors of Supplementary Ref. [9] introduced a modified version of the k-core decomposition in which the nodes are ranked taking into account their connections to the remaining nodes of the graph as well as to the removed nodes at previous steps of the process. They showed that the proposed node ranking method is able to identify nodes with better spreading properties compared to the traditional k-core decomposition. Bae et al. [10] extended the metric of k-core number of each node by considering the core number of its neighbors. That way, the ranking produced by the method is more fine-grained in the sense that the effect of assigning the same score (i.e., k-core number) to many nodes is eliminated. Basaras et al. [11] proposed to rank the nodes according to a criterion that combines the degree and the k-core number of a node within an µ-hop neighborhood. In Supplementary Ref. [12] the authors introduced a criterion that combines three previously examined measures, namely degree, betweenness centrality and core number. The intuition was that, most of the widely used centrality criteria produce highly correlated rankings of nodes; combining them in a proper way, we are able to achieve a more accurate indicator of influential nodes. Zhang et al. [13] proposed a method to locate influential nodes taking into account the existence of community structure in networks. In Supplementary Ref. [14], the authors considered real social media data, in order to examine to what extend the structural position of a user in the network allows us to characterize the ability of an individual to spread rumors effectively. Their results indicate that although the most appropriate feature is the degree of a node, only a few such highly-connected individuals exist; however, by considering the k-core number metric, we are able to locate a larger set of individuals that are likely to trigger large cascades. For a detailed review in the area, we refer to the article by Pei and Makse [15]. As we also discussed in the main text, it is important to stress out that most of the above mentioned extensions can also be applied to the proposed K-truss decomposition-based approach.