Evaluating link prediction by diffusion processes in dynamic networks

Link prediction (LP) permits to infer missing or future connections in a network. The network organization defines how information spreads through the nodes. In turn, the spreading may induce changes in the connections and speed up the network evolution. Although many LP methods have been reported in the literature, as well some methodologies to evaluate them as a classification task or ranking problem, none have systematically investigated the effects on spreading and the structural network evolution. Here, we systematic analyze LP algorithms in a framework concerning: (1) different diffusion process – Epidemics, Information, and Rumor models; (2) which LP method most improve the spreading on the network by the addition of new links; (3) the structural properties of the LP-evolved networks. From extensive numerical simulations with representative existing LP methods on different datasets, we show that spreading improve in evolved scale-free networks with lower shortest-path and structural holes. We also find that properties like triangles, modularity, assortativity, or coreness may not increase the propagation. This work contributes as an overview of LP methods and network evolution and can be used as a practical guide of LP methods selection and evaluation in terms of computational cost, spreading capacity and network structure.


S.1 Supplementary material of the spreading results
We evaluate the impact of adding new edges in the spreading capacity of the networks, by considering the most unlikely links according to the LP methods, i.e., the lowest recommendation scores.
In Figure 1 are the results of the spreading capacities and in Figure 2 the normalized distribution of the 20% lowest LP-scores for all the datasets. In these simulations, we consider only the IC model with λ = 0.3, due to the patterns of the spreading capacities of the LP methods are similar among diffusion models and propagation parameters. This point is better discussed in the main paper.
We observe that iAA and iCN little affect the spreading capacity of the evolved networks, opposite to iRP, iSR, iGD, and iJC, in that order ( Figure 1). However, the results when adding 1, 5, 10, and 20% of new links in the real-world networks: (a) Email [5]; (b) Hamsterster [8], (c) Facebook [8]; (d) Advogato [11]; (e) Astrophysics [14], and (f) GooglePlus [12]. We consider the edges with the lowest recommendation score from the methods, as inverse: SimRank (iSR), Rooted Pagerank (iRP), Common Neighbors (iCN), Jaccard Coefficient (iJC), Graph Distance (iGD), and Adamic Adar (iAA). RN is the random addition of new links.  of RN are better than the inverse LP-scores approach. In terms of the LP-score distribution, (Figure 2) iCN and iAA present the lowest dispersion of score, with all the 20% of lowest scores equal to some value in iCN, and in iAA most of the values equal to the median. iSR, iRP, and iJC present more notable dispersion in the 20% of lowest values.

S.2 Network sciences concepts
Several measures have been proposed for network characterization [16]. It is important to consider a set of features of the network, like the heterogeneity and dispersion of the degree distribution, the number of triangles, the proportion of shortest paths, and the community or modular structure, among others.
In a undirected networks G = (V, E), where V is the set of nodes and E is the set of links, the degree connectivity of node i, called as k i , is the number of links or connections incident on i. In addition, we denote Γ(i) as the set of neighbors of node i, in which |Γ(i)| = k i . Hubs are the nodes that have a very high degree in the network. The degree distribution of an undirected network P (k) is the probability of randomly select a node with degree k, or for finite networks, as the fraction of nodes with degree equal to k [16]. Social networks exhibit heavy-tailed degree distribution following a power law [13] in the form P (k) ∼ k −γ , in which most of the individuals have a low degree, and few of them have a very high degree (hubs). When the exponent of the power-law takes values 2 ≤ γ ≤ 3, it is called as a scale-free network [16]. Most of the real-world social networks follow a scale-free degree distribution [15].
We can calculate the moments of the degree distribution according to in which the first and the second moment of the degree distribution are the average degree (for n = 1) k = 2M/N , and the dispersion (for n = 2) k 2 , respectively. The average degree provides information about the density of the network. On the other hand, the second moment of the degree distribution provides a central variance, which diverges in the presence of very heterogeneous networks, capturing large connectivity fluctuation associated with heavily-tailed distributions.
We characterize the level of heterogeneity through the heuristic Network Complexity measure, according to [2]: which is the rate among the second and the first moment of the degree distribution. The normalization of the variance denotes the fluctuations in the network complexity. C is close to zero when the network follows a more regular degree distribution. Opposite, for scale-free networks, the second moment of degree distribution diverges ( k 2 → ∞) in the infinite network size (N → ∞). However, in the case of Poisson distributed networks, Other options to measure the level of disorder or similarity of connection patterns between nodes are the entropy and degree-degree correlation. The normalized version of the Shannon entropy [21] (H) is useful for measuring the entropy of degree distribution, i.e.,H with 0 ≤H ≤ 1. The more the disparity in degree connections, the higher the entropy, with maximum value occurring with a uniform P (k). The lowest possible entropy happens when all nodes have the same degree. The other network property is the degree-degree correlation (or assortativity), in which nodes with similar degree tend to be connected. The level of assortativity (ρ) can be quantified by the degree Pearson correlation coefficient [13]. According to this measure, a network will be classified as assortative, or positively correlated, with ρ > 0; disassortative, low-degree nodes tend to connect with strongly connected pairs, with ρ < 0; or non-assortative, with no connection pattern and ρ ≈ 0.
Concerning the structural analysis, the network can be decomposed into shells or cores [18]. The K-Core (K ∈ N >0 or core of order K) is the maximum subset of nodes that have at least degree k i ≥ K and the K-Core is the highest-order core they belong to [18]. Nodes with the highest coreness are the most central. However, Nodes located at the periphery of the network have lower K-Core centrality, even hubs located in this region. We denote KC as the average of K-Core centrality among the nodes.
The presence of triangles is a common property found in real-world networks [14]. In topology terms, it is the number of cycles of order three on the network [2,14]. These triangles have shown to be a relevant feature for complex systems, and more important to social networks [15,24]. They are related to the similarity or homophily connection between individuals [15,19] and how cohesive are the social circles [6]. Here, we characterized the triangles by the Clustering Coefficient, which is the average of triangles proportion for each node [22] where k i is the degree of node i and t i is the number of triangles centered on i. In the case of k i = {0, 1}, it is assumed the value of the fraction equal to zero. The maximum value happens when all the neighbors of i are interconnected.
Some nodes work as a bridge between clusters or other nodes, and when removed, a structural hole will occur. Nodes that act as structural holes are spanners among communities or groups of nodes without direct connections. These individuals are essential to the connectivity of local regions. The structural holes centrality [3] considers the ego network of each node, ignoring connections no related to it. Nodes with higher degree centrality have low values for structural-holes centrality. The low centrality value is because hubs present more extensive and densely interconnected ego networks, and this factor diminishes the presence of isolated holes. We denote SH as the average of the structural holes centralities of all the nodes.
Concerning the network distances, the shortest path or geodesic path ij is the shortest distance between two nodes. In a global scale, we can compute the average shortest paths ( ) by measuring how close are the nodes to each other. Besides, we have the diameter of the network (max( )), which is the longest of all shortest paths and represent the linear size of a network.
The proportion of shortest paths in which a specific node appears is related to the capacity of information transmission of that node [16]. Thus, the betweenness centrality of a node j is the ratio of the number of shortest paths between all pairs of nodes (i, l) that contain j [4], where i, j and l are different. The betweenness centrality has revealed to be important for understanding many dynamical and complex systems [16]. For example, in social sciences, it was found that intrepid individuals were less likely to cluster, but more likely to feature high betweenness centrality by connecting different communities [6].
We characterize the overall betweenness of the network (B) by performing the average among the nodes, mathematically: where σ i.l is the total number of different shortest paths between i and l, and σ i,l (j) is the number of times j appears in those paths.
Last, the community structure is another critical feature of the networks, which impacts in the performances of the propagation process [20,23]. Communities are sets of densely interconnected nodes and sparsely connected with the rest of the network [13]. One of the most relevant measure for evaluating the community structure of the networks is the Q modularity [13]. It compares the density of intra-community and inter-community edges relative to a non-correlated random network of similar size. Moreover, this measure is employed by several techniques to identify communities in networks systems, especially in divisive and agglomerative approaches [2,5,13].

S.3 Description of the adopted LP methods
Jaccard Coefficient (JC): It is used to account the probability for randomly select a node that is a neighbor of i and j from the set of all neighbors between nodes i and j [17]. Therefore, the higher the probability, the higher the similarity between the nodes.
Common Neighbors (CN): it refers to the size of the set of all common neighbors of both i and j [9].
Adamic Adar (AA): it refines the simple counting of common neighbors by weighting rarer neighbors more heavily [1].
Rooted Pagerank (RP): it is a ranking method that inherently scales according to node distance [25]. Define by s i,j , the probability that a random walker starting from node i locates at node j in the steady state results in [10]: where P is the transition matrix.
SimRank (SR): its a random walk on the collaboration graph: the expected value of ι l , where l is a random variable giving the time at which random walks started from i and j first meet [7].
where s SR i,i = 1 and ι = [0, 1] is the decay factor [10]. Graph Distance (GD): it follows the small world properties and seeks to recommend pairs of nodes that are closely connected: This way, the negative sign ensure the increasing order of the shortest path distances s GD i,j for closer pairs (i, j).