Link Prediction in Evolving Networks Based on Popularity of Nodes

Wang, Tong; He, Xing-Sheng; Zhou, Ming-Yang; Fu, Zhong-Qian

doi:10.1038/s41598-017-07315-4

Download PDF

Article
Open access
Published: 02 August 2017

Link Prediction in Evolving Networks Based on Popularity of Nodes

Tong Wang¹,
Xing-Sheng He¹,
Ming-Yang Zhou^2,3 &
…
Zhong-Qian Fu¹

Scientific Reports volume 7, Article number: 7147 (2017) Cite this article

4969 Accesses
26 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Link prediction aims to uncover the underlying relationship behind networks, which could be utilized to predict missing edges or identify the spurious edges. The key issue of link prediction is to estimate the likelihood of potential links in networks. Most classical static-structure based methods ignore the temporal aspects of networks, limited by the time-varying features, such approaches perform poorly in evolving networks. In this paper, we propose a hypothesis that the ability of each node to attract links depends not only on its structural importance, but also on its current popularity (activeness), since active nodes have much more probability to attract future links. Then a novel approach named popularity based structural perturbation method (PBSPM) and its fast algorithm are proposed to characterize the likelihood of an edge from both existing connectivity structure and current popularity of its two endpoints. Experiments on six evolving networks show that the proposed methods outperform state-of-the-art methods in accuracy and robustness. Besides, visual results and statistical analysis reveal that the proposed methods are inclined to predict future edges between active nodes, rather than edges between inactive nodes.

Proteome-scale discovery of protein degradation and stabilization effectors

Article 20 March 2024

Juline Poirson, Hanna Cho, … Mikko Taipale

Reconstructing the evolution history of networked complex systems

Article Open access 02 April 2024

Junya Wang, Yi-Jiao Zhang, … Yanqing Hu

Spike sorting with Kilosort4

Article Open access 08 April 2024

Marius Pachitariu, Shashwat Sridhar, … Carsen Stringer

Introduction

Networks are effective descriptions of complex systems in society and nature^{1, 2}, with entities denoted as nodes and relations as links, respectively. The organization of real networks evolve under the influence of certain patterns and irregular factors, in principle, only the former can be modeled with physical methodologies. A significant concern about complex networks is link prediction that conduces to explanations of these models and revelations of the hidden driving-mechanisms. Therefore, link prediction has drawn numerous attentions from various fields covering biology, sociology and others^3,4,5,6. For example, in protein-protein interaction experiments in cells, only strong relations between proteins could be detected by limited precision of equipments. It is prohibitive to measure every interaction between all pair proteins due to sharply increasing experimental costs with the size of proteins^{7, 8}, an appropriate approach is to evaluate the likelihood of potential relations and specifically test non-existing relations with the high likelihood. Also, in social contexts, two persons would build friendship in the near future with a high probability if they have many common friends or attributes, which could be utilized to uncover lost friends or predict future friends^9,10,11. Besides, further extensive applications also include personalized recommendations in e-commerce^{12, 13} and aircraft route planning study¹⁴, etc.

The crux of link prediction is to evaluate the likelihood of potential edges, based on which we can rank the potential edges in descending order and edges in the top of ranking list are predicted as underlying or future edges^{15, 16}. The similarity based approaches, which equate likelihood with similarity, are the most common frameworks that argue the prospective edges may exist between similar nodes. To achieve this, traditional attribute based methods measure the likelihood of links by learning how many common features (e.g. common hobbies, ages, tastes, geographical locations) the two endpoints share¹⁷. Many researches on social networks have shown that the pervasive homophily promotes ties between similar humans^{18, 19}. However this kind of methods suffer from the inaccessible and unreliable information of nodes due to the privacy policy in real scenario²⁰. Luckily, the development of the complex network theory provides a new path in which only network topological structure is required regardless of privacy information to solve the problem. When evaluating the similarity between nodes, according to the structure differences, structure based methods could be classified into three categories: local methods, global methods and Quasi-global methods. Local similarity is mainly based on common neighbors, such as the most well-known Common Neighbor (CN) index that counts the number of common neighbor nodes²¹, Adamic-Adar (AA) index and Resource Allocation (RA) index that depress the large-degree neighbor nodes^{22, 23}. For large networks, Cui et al. proposed a fast algorithm for calculating the number of common neighbors²⁴. Global similarity emphasizes the global topology information of network, such as Katz index that counts all of the paths between two nodes²⁵. Quasi-global similarity is a well trade-off of local similarity methods and global similarity methods, such as Local Path (LP) index that only considers the short paths in Katz index²³, Local Random Walk (LRW) index that focuses on the limited random walk in local area²⁶. Beyond that, some algorithms based on maximum likelihood methods and other exquisite models have been proposed. Clauset et al. proposed a Hierarchical Structure Model which presents well performance in hierarchical networks by using a dendrogram²⁷. Lü et al. proposed a Structural Perturbation Method that approximates the observed networks by randomly repeated perturbations. This method outperforms state-of-the-art methods in accuracy and robustness²⁸. In terms of information theory, Xu et al. proposed the Path Entropy index that considers the information entropies of shortest paths and penalizes the long paths²⁹. Tan et al. proposed a Mutual Information (MI) method with the high accuracy and reasonable computation time, which considers the feature of common neighbors and denotes the likelihood of one link as the the conditional self-information of this link existing between the node pair when their common neighbors are given³⁰. Zhu et al. generalized the MI index into Neighbor Set Information that is applicable to multiple structural features to enhance the accuracy³¹.

Real networks are highly dynamic with the come-and-go of nodes and edges³². However, the aforementioned algorithms unexceptionally ignore the temporal aspects of real networks, in particular, the trend of nodes: yesterday active nodes that contacted numerous neighbors may be unpopular today. Inspired by this, we propose a hypothesis that the emergence of future links are not only determined by existing network structure, but also are affected by the popularity of endpoints. For instance, Fig. 1 illustrates the effects of popularity. The red node will enter in the network and connect with one of the existing nodes. In Fig. 1(a), according to the static analysis, node 10 prefers to connect with the large-degree node 1. While the birth time of each edge is given in Fig. 1(b), we can easily know that node 3 is of high popularity because only it attracts edges at the present time t ₂. In practice, the fresh edge will be more likely to occur between node 10 and the active node 3 at the next period t ₃. To comply with this scenario, unlike previous works that predict potential links mostly based on static networks, we propose a popularity based structural perturbation method (PBSPM) and its fast algorithm that integrate popularity of nodes and observed network topology to predict future edges. Experimental results on real-world networks show that the proposed methods outperform the other traditional approaches in accuracy and robustness.

Results

Popularity metrics

The definition of popularity is related to the concepts of temporal trend of nodes that could be obtained through the statistics and analysis of relevant historical information. For two nodes with the same degree, one may connect with its neighbors at early stage and not form any connections later, while the other one develops most of its connections at late stage. Intuitively the latter node would attract more fresh edges with high probability in the near future. Given this, a straightforward approach to evaluate the popularity of a node is counting the edges it recently attracts.

Given an undirected and unweighted network G(V, E) where V and E represent the set of nodes and links, respectively, each link has a time-stamp that represents the entering time. In this work, multi-links and self-loops are not allowed. k _i(t) denotes the degree of nodes i at time t. In the next time span T, node i would attract Δk _i(t, T) new edges,

$${\rm{\Delta }}{k}_{i}(t,T)={k}_{i}(t+T)-{k}_{i}(t)\mathrm{.}$$

(1)

Note that Δk _i(t, T) in Eq. (1) determined by both t and T cannot reflect the relative popularity of node i, since even large degree nodes become inactive, they still attract more fresh edges than nodes of small degree due to the preferential attachment mechanism. To solve this issue, for a dataset spans starting from t _a to t _c, we divide its edges into the fresh set and the old set according to a boundary t _b ∈ (t _a, t _c). If an edge was constructed in (t _a, t _b), it belongs to the old set otherwise the fresh set. The fraction of old edges and fresh edges are denoted as p _older and p _fresher. The p _fresher can be comprehended as the observation length of historical information. Then, the popularity of node i is

$${s}_{i}=\frac{{\rm{\Delta }}{k}_{i}({t}_{b},{t}_{c}-{t}_{b})}{{\rm{\Delta }}{k}_{i}({t}_{a},{t}_{c}-{t}_{a})}=\frac{{k}_{i,fresher}}{{k}_{i,all}},$$

(2)

where k _i,all and k _i,fresher indicate the whole degree and fresher degree of node i. Equation (2) improves the drawbacks of simply counting the new edges and quantifies the popularity in the normalized range. Clearly, if all links of node i locate in the fresh set, s _i = 1. For another case that all links of node i locate in the old set, node i becomes dormant, s _i = 0. Therefore s _i ∈ [0, 1] and a higher s _i means a higher popularity.

Popularity based structural perturbation method

In this section, we propose a hypothesis that the observed network is determined by some latent attractors (e.g. similar hobbies, ages, gender, location) that independently influence the structural properties. For an attractor ${x}_{k}={[{x}_{k,1},{x}_{k,2},\ldots ,{x}_{k,n}]}^{T}$, x _k,i represents the attractiveness of node i for the latent attractor x _k. Inspired by configuration model, the probability p _ij that an edge exists between two node i and j is proportional to x _k,i x _k,j. Supposing that there are m kinds of attractors, probability p _ij is defined as the weighted influence of each attractor,

$${p}_{ij}=\sum _{k=1}^{m}{w}_{k}{x}_{k,i}{x}_{k,j},$$

(3)

where w _k is a tunable parameter to balance the relative influence of each attractor x _k. The problem is how to seek the optimal w _k and x _k,i that make p _ij approximate a _ij at most. Considering a network G with adjacent matrix A = (a _ij)_n × n, a special case is that p _ij = 1 if a _ij = 1, otherwise p _ij = 0. For optimal w _k and x _k,

$${A}_{p}={({p}_{ij})}_{n\times n}=\sum _{k=1}^{m}{w}_{k}{x}_{k}{x}_{k}^{T}.$$

(4)

If m = n in Eq. (4), where n is the size of the network, then Eq. (4) could be comprehended as the matrix decomposition, with w _k and x _k representing eigenvalues and eigenvectors respectively. In practice, many random connections exist in networks, Lü et al. proposed the structural perturbation method (SPM) that can reduce the influence of randomness²⁸. In SPM, a small fraction p ^H of edges ΔA is removed from the network, adjacent matrix A ^R of the remaining network is decomposed into

$${A}^{R}=\sum _{k=1}^{n}{\lambda }_{k}{x}_{k}{x}_{k}^{T},$$

(5)

where λ _k and x _k are the eigenvalues and eigenvectors of A ^R, $|{x}_{k}|=1$. We could use A ^R to evaluate A with

$$\tilde{A}=\sum _{k=1}^{n}({\lambda }_{k}+{\rm{\Delta }}{\lambda }_{k})\,{x}_{k}{x}_{k}^{T},$$

(6)

where ${\rm{\Delta }}{\lambda }_{k}\approx \frac{{x}_{k}^{T}{\rm{\Delta }}A{x}_{k}}{{x}_{k}^{T}{x}_{k}}$ is the coupling influence of x _k on λ _k. Ã actually is a special case of A _p, (λ _k + Δλ _k) and elements of eigenvector x _k represent weight difference and the attractiveness for attractor x _k separately.

As we have argued, the ability for node i to attract new edges is determined by both latent attractors and its current popularity. To better meet practice, an advanced attractiveness ${x}_{k,i}^{^{\prime} }$ is proposed as

$${x}_{k,i}^{^{\prime} }={x}_{k,i}(1+\alpha {s}_{i}),$$

(7)

where α indicates the degree of temporal popularity. Equation (7), a combination of the static attractiveness and popularity, tightly captures both the static features and the temporal information of the evolving pattern. Later in Eq. (6), substituting x _k with x′ _k to predict future links,

$$\mathop{{A}^{^{\prime} }}\limits^{ \sim }=\sum _{k=1}^{n}({\lambda }_{k}+{\rm{\Delta }}{\lambda }_{k}){x}_{k}^{^{\prime} }{x}_{k}^{^{\prime} T}.$$

(8)

Since Eq. (5) degenerates into Eq. (4) if the size m of attractors is less than n. Supposing that $|{\lambda }_{1}| > |{\lambda }_{2}| > \ldots > |{\lambda }_{n}|$, we substitute w _k and x _k in Eq. (4) with λ _k and x _k in Eq. (5). Similar to the same transition from Eq. (5) to Eq. (8), we obtain

$$A^{\prime} ={({p}_{ij})}_{n\times n}=\sum _{k=1}^{m}({\lambda }_{k}+{\rm{\Delta }}{\lambda }_{k}){x}_{k}^{^{\prime} }{x}_{k}^{^{\prime} {{\rm T}}},$$

(9)

which reduces into Eq. (8) if m = n. In the following experiments, we firstly measure the performance of Eq. (8), then show that we could reduce the calculation complexity by using only a few eigenvalues and eigenvectors, that is $m\ll n$ in Eq. (9).

Experiments on real networks

The proposed method PBSPM, integrating the attractiveness x _k,i and popularity s _i, reduces into the original SPM when α = 0. With the increase of α, PBSPM prefers to predict links between popular nodes. Figure 2 gives the performance of PBSPM in contrast to SPM (α = 0) under different p _fresher. The precision values tend to be stable or achieve the best when α brings the static attractiveness and popularity into balance. Clearly, the optimal value of α varies for different networks. For Hypertext, Infec and UcScoci, future links have high likelihood to exist between the active nodes. However, for the Haggle dataset, the temporal trend of nodes are less obvious. Hence, the precision curve is optimized at α = 2, contrast to the other three networks of which the curves finally stabilize when α increases. Overall, when α ∈ [3, 5], PBSPM achieves improved performance compared with SPM in the four networks. Moreover, given the different length of historical information p _fresher, all the curves present different levels of superiority in precision, suggesting a general and robust range of p _fresher. Actually, it is difficult to choose the optimal value, which should follow the principle of keeping the balance between the length of historical information and future information (probe set). With regard to 10% probe set in this experiment, P _fresher = 0.1 is the balanced option because the corresponding curves all show the great improvements.

Reducing the number of eigenvectors could reduce the computation complexity. To address the high computation complexity, we propose the fast PBSPM that takes into account a few eigenvectors with only some large eigenvalues, which can well reflect the backbone structure of networks³³. In practical networks, a huge gap exists in the eigenvalue space. Some eigenvectors with large eigenvalues play more important roles than those with small eigenvalues. Taking Hypertext as example, Fig. 3(a) plots the precision for various m in Eq. (9). Compared with SPM, the curve presents significant improvements and achieves the best at m = 1, meeting the effectiveness of Eq. (9). Figure 3(b) gives the differences between two adjacent eigenvalues ${g}_{m}=|{\lambda }_{m}|-|{\lambda }_{m+1}|$ $(|{\lambda }_{1}| > |{\lambda }_{2}| > \ldots > |{\lambda }_{n}|).$ The distinct g ₁ indicates a huge gap between $|{\lambda }_{1}|$ and $|{\lambda }_{2}|$, while the other gaps (m ≥ 2) are all close to 0, suggesting that the huge gap g ₁ induces the decline of precision when m > 1. Then, we choose m = 1 as the optimal value for Hypertext, analogously, the values for Haggle, Infec and UcSoci are respectively determined as m = 2,19,2 after which the g _m approaches to 0 approximately. In consequence, it only requires O(n ²) time to calculate the top-m eigenvalues and corresponding eigenvectors, and the reconstruction of similarity matrix (Eq. 9) needs O(m × n ²) time. To reduce the randomness, the fast PBSPM repeats the random perturbation for ten times and obtains the averaged similarity matrix with O(10 × (mn ² + n ²)) time. Hence, with $m\ll n$ and the increase in size n, the time complexity of fast PBSPM is O(n ²) in contrast with the time complexity O(n ³) of PBSPM and SPM, where the decomposition and reconstruction consume O(n ³) time. Besides, the time complexity is O(n ²) for local similarity based methods, such as CN, RA, AA, and O(n ³) for Katz and SRW.

Table 1 and Table 2 list the precision values and computation time of different link prediction algorithms. Obviously, the proposed methods achieve remarkable improvements, at most 84.84% for Hypertext, 28.42% for Haggle, 6.19% for Infec, 95.97% for UcSoci. In spite of this, PBSPM suffers from the huge computational cost that limits its extensive applications. Fast PBSPM, a well trade-off of computation complexity and accuracy, has the reasonable computational cost and the high accuracy. Due to the repeated steps in experimental procedures, the fast algorithm still consumes more time than some traditional predictors with the same time complexity. Additionally, the attractors ignored by the fast algorithm contain some secondary information that may either improve the accuracy as useful information or deteriorate the performance as network noise, hence, the precision slightly fluctuates around that of PBSPM. In general, the proposed methods show the high robustness because of the well performance for disparate networks, while other baselines give poor predictions for some networks. Apart from precision improvements, we also try to quantify the physical difference between the age of links selected by various methods, which can be comprehended as the average popularity of endpoints $\overline{s}=\sum \frac{{s}_{i}+{s}_{j}}{\mathrm{2\ast |}{E}^{P}|}$ if edge e _ij is selected by a certain predictor. According to Table 3, links selected by the proposed methods are much older than the others; that is, the potential links prefer to form between the active nodes in the earlier future.

Table 1 Precision of different methods for four networks. All the results are calculated under the optimal cases by adjusting parameters if any.

Full size table

Table 2 Computation time of different methods for four networks.

Full size table

Table 3 Average age of links selected by predictors.

Full size table

In the following, we mainly focus on the performance of SPM and PBSPM to explore underlying reasons of the improvements. To figure out the effect of popularity, four typical nodes from the training set of Hypertext, the large-degree node 1 and 3 (k _1,training = 78,s ₁ = 0.051;k _3,training = 93,s ₃ = 0.032), and the active node 91 and 113 (k _91,training = 29,s ₉₁ = 0.289;k _113,training = 14,s ₁₁₃ = 1) are chosen to analyse their predicted connections and corresponding variation of attractiveness. Figure 4 plots the predicted future links attached to selected nodes by SPM and PBSPM when p _fresher = 0.05 and α = 9. After that, the principal eigenvector x ₁ of A ^R and the advanced ${x}_{1}^{^{\prime} }$ under the optimal case are calculated to quantify the attractiveness for the most weighted attractor. In addition, the principal eigenvector also characterizes the ranking of nodes, i.e. the importance^{34, 35}. In Fig. 4(a), node 1 and 3 (x _1,1 = 0.1715,x _1,3 = 0.1899) with the high importance are much more attractive than node 91 and 113 (x _1,91 = 0.0648,x _1,113 = 0.0329), especially, node 113 with the lowest importance has no connections at all. Contrastingly, the high popularity enhances the active nodes (${x}_{1,91}^{^{\prime} }=\mathrm{0.1158,}{x}_{1,113}^{^{\prime} }=0.1923$) and results in the burst of links connecting to the them in Fig. 4(b), notably the most active node 113. In summary, nodes with the higher popularity are emphasized by PBSPM to attract much more links, whereas the inactive despite their importance are weakened to reduce connections.

The above figures conduce to the understanding of how popularity imposes effects on several typical nodes, but note that, it is a rational speculation that the improvements must result from the advanced attractiveness of all nodes. As above argued, principal eigenvector denotes the attractiveness for the most weighted attractor. Because $({\lambda }_{1}+{\rm{\Delta }}{\lambda }_{1})({x}_{1}{{x}_{1}}^{T})$ occupies the main body of Ã, neglecting constant term Δλ ₁ + Δλ ₁, similarity ã _i,j is mainly determined by eigenvector x ₁. The Pearson correlation coefficient (CC) between principal eigenvector and degree in the probe set, holistically reflecting the extent to which the attractiveness x _1,i coincides with real degree increment k _i,probe, is computed as follows,

$$cc=\frac{1}{n}\sum _{i=1}^{n}(\frac{{x}_{\mathrm{1,}i}-{\overline{x}}_{\mathrm{1,}i}}{{\delta }_{{x}_{\mathrm{1,}i}}})(\frac{{k}_{i,probe}-{\overline{k}}_{i,probe}}{{\delta }_{{k}_{i,probe}}}),$$

(10)

where ${\overline{x}}_{\mathrm{1,}i}$ and ${\overline{k}}_{i,probe}$ are the means of x _1,i and k _i,probe. The CC between advanced ${x}_{1}^{^{\prime} }$ and degree in the probe set is obtained similarly. Table 4 lists the variation of CC after the addition of popularity and the coupling influence Δλ ₁ averaged over ten independent perturbations. The positive ΔCC of four networks suggest attractiveness of some nodes are corrected to meet the degree increment in the future. Furthermore the positive Δλ ₁ also strengthens the improvements of correlations. As a result, the popular nodes are assigned more connecting opportunities to promote the precision.

Table 4 Variation of correlation coefficient ΔCC and coupling influence Δλ ₁.

Full size table

Eventually, to demonstrate the feasibility of the proposed methods in practical applications, we compare the fast PBSPM with time series (TS) based methods on continuous temporal networks, which have been effectively applied to the temporal link prediction^36,37,38. For each network, the dataset is divided into T _N snapshots $({G}_{1},{G}_{2},\mathrm{...},{G}_{{T}_{N}})$ with the length of time period P _length = 7 days. Setting a specified time window T = 5, we use the graph series (G _t, G _t + 1, …, G _{t + T − 1}) and its reduced static graph G _{t~t + T − 1} to predict the links that will occur in G _t + T (t = 1, 2, …, T _N − T). Then the popularity of each node is calculated as:

$${s}_{i}=\frac{{k}_{i,{G}_{t+T-1}}}{{k}_{i,{G}_{t \sim t+T-1}}}=\frac{{k}_{i,fresher}}{{k}_{i,all}}.$$

(11)

During the evolution, certain mechanisms drive the network organization regularly and the structural features keep relatively stable. Hence, we obtain the optimal α and m by the known networks observed between the time period 1 ≤ t ≤ 6 (G _1~5 as the training set, G ₆ as the probe set) and apply them to the subsequent predictions. Figure 5 shows the precision at continuous time steps and the average accuracy of different methods. For LKMLR, though the fast PBSPM falls behind sometimes, its average value shows a slight advantage in precision (Fig. 5(a) (c)). For Wiki, not only does the fast PBSPM gain the upper hand at any time, but it achieves much higher average accuracy compared with TS based methods (Fig. 5(b) (d)). These experimental results demonstrate that the fast PBSPM has prospective applications in evolving networks.

Discussion

In this paper, we propose the PBSPM and its fast algorithm to predict future links. The main contribution is to investigate the popularity (activeness) of nodes in real-world evolving networks and apply it to link prediction. Unlike previous works that calculate temporal effects with complex theories, we infer the popularity of each node by its recently active edges. Then we propose a hypothesis that the future network is influenced by both existing structure and popularity of nodes. By introducing popularity into perturbation method, PBSPM could distinguish active and inactive historical important nodes, and prefer to predict new edges attached to active nodes. Subsequently, the fast method is proposed to get rid of the high computation complexity. Experimental results on real-world evolving networks reveal that compared with traditional methods, the proposed methods achieve better performance in precision and robustness. Besides, further experiments are conducted to uncover the underlying reasons of the improvements.

Definitely, the performance of proposed methods largely depend on the popularity of each node. In other words, the popularity based methods are more applicable for the networks with obvious temporal effects, where the popularity metric can effectively quantify the popularity of each node. Hence, another important issue is that improving popularity performance would enhance the precision of link prediction, which is the future work. Since our work mainly explores prediction in evolving networks, it has possible applications in traffic prediction, airline control, recommendation of social network, and so on.

Methods

Experimental procedures

To predict the future links of evolving networks with PBSPM, there are five detailed steps to follow:

Step 1: We firstly divide the network into the training set E ^T and the probe set E ^P based on the birth time of each edge, the corresponding adjacent matrix are denoted by A ^T and A ^P.

Step 2: The training set is further divided into the old set and the fresh set to calculate the popularity via Eq. (2) or Eq. (11).

Step 3: We perturb the training set by randomly removing a small fraction p ^H = 0.1 of edges ΔA, obviously, A ^T = A ^R + ΔA.

Step 4: We decompose the matrix A ^R and obtain the $\tilde{A^{\prime} }$ via Eq. (7) and Eq. (8).

Step 5: Repeat step 3 and step 4 for ten times. In other words, we implement the perturbations for ten times to obtain the averaged $\langle \tilde{A^{\prime} }\rangle $ where the score $\langle {\tilde{a^{\prime} }}_{ij}\rangle $ represents the existent likelihood of the link between node i and j. Finally, non-observed edges with the top-$|{E}^{P}|$ scores are chosen as potential future edges.

Data description

In this work, six datasets are considered to evaluate the performance of algorithms. (1) Hypertext 2009 (Hypertext): a network of face-to-face contacts of the attendees of the ACM Hypertext 2009 conference from June 30 to July 1, 2009, including 113 nodes and 2196 unique links³⁹. (2) Haggle: an undirected network representing contacts between people measured by carried wireless devices⁴⁰, including 188 nodes and 1947 unique links. The time span is 4 days. (3) Infectious (Infec): a network describing the face-to-face behavior of people during the exhibition INFECTIOUS: STAY AWAY in 2009³⁹, including 301 nodes and 2145 unique links. The time span is 8 hours. (4) UC Irvine messages (UcSoci): a directed network of messages between the users of an online community of students from the University of California, Irvine⁴¹, including 1692 nodes and 13037 unique links. The dataset spans from April 15 to October 25, 2004. (5) Linux kernel mailing list replies (LKMLR): a communication network of the Linux kernel mailing list. The data considered in experiments is from January to June, 2013, including 2907 nodes and 78955 links. (6) Wikipedia elections (Wiki): a network of users from the English Wikipedia that voted for and against each other in admin elections. The data considered in experiments spans from October, 2005 to April, 2006, including 2309 nodes and 23707 links⁴².

To simplified the problem, we ignore the direction and weighted of links, and remove the isolated nodes. What is more, the networks are divided into historical training set and future probe set only according to the timestamps that attach to edges.

Evaluation metric

AUC (Area Under the receiver operating characteristic Curve) and Precision are two standard metrics used to measure the link prediction algorithm^{43, 44}. The former randomly compares the score of a missing link with a non-existent link to evaluate the performance. The latter focuses on the links with top-L scores. When dealing with highly skewed datasets, the precision always gives a more informative picture of algorithms’ performance⁴⁵. Hence, We choose Precision index as the metric to evaluate the accuracy of the proposed method and other baselines. Precision is defined as the ratio of links predicted accurately to all links selected. Namely if we select top-L links in the all ranked non-observed links and only L _r links are predicted correctly in the probe set E ^P, then the accuracy of predictor follows

$${Precision}=\frac{{{L}}_{{r}}}{{L}}.$$

(12)

In our experiments, we select ${L}=|{{E}}^{{P}}|$ and count how many of top-$|{{E}}^{{P}}|$ links really exist in the probe set.

Baselines

For comparison, we briefly introduce five traditional algorithms based on all three kinds of structural similarity.

(1)
Common Neighbors (CN), related to the concepts of the triadic closure, is the most well-known method with an assumption that two target points tend to connect with each other if the new connection may produce much more triangles in the graph.
$${s}_{xy}^{CN}=|{\rm{\Gamma }}(x)\cap {\rm{\Gamma }}(y)|,$$
(13)
where Γ(x) is the set of neighbors of node x and $|{\rm{\Gamma }}(x)\cap {\rm{\Gamma }}(y)|$ represents the set of common neighbors of x and y.
(2)
Adamin-Adar (AA), advanced from CN, restricts the contributions of common neighbors by introducing a penalty factor, i.e., the logarithm of reciprocal of their degree.
$${s}_{xy}^{AA}=\sum _{z\in {\rm{\Gamma }}(x)\cap {\rm{\Gamma }}(y)}\frac{1}{\mathrm{log}\,{k}_{z}},$$
(14)
where k _z denotes the degree of common neighbor z.
(3)
Resource Allocation (RA), motivated by transferring resource between two unconnected nodes, views the common neighbor as the intermediary of which the transfer capability equals to the reciprocal of degree of common neighbors.
$${s}_{xy}^{RA}=\sum _{z\in {\rm{\Gamma }}(x)\cap {\rm{\Gamma }}(y)}\frac{1}{{k}_{z}}.$$
(15)
(4)
Katz index, based on global information of network, counts all the paths connecting two endpoints with weakening the contributions of longer paths exponentially:
$${s}_{xy}^{Katz}=\sum _{l=1}^{\infty }{\alpha }^{l}\cdot |path{s}_{x,y}^{\langle l\rangle }|.$$
(16)

When $|\alpha | < 1/{\lambda }_{{\rm{\max }}}$, it can be rewritten as:
$$S={(I-\alpha \cdot A)}^{-1}-I,$$
(17)
where I is the identity matrix, α > 0 is the tunable parameter, λ _max is the largest eigenvalue of the adjacent matrix A.
(5)
Superposed Random Walk (SRW) considers the summation of local random walks within t steps and degree of two endpoints to emphasize the local properties in real networks²⁶.
$${s}_{xy}^{SRW}(t)=\sum _{\tau =1}^{t}[{q}_{x}{\pi }_{xy}(\tau )+{q}_{y}{\pi }_{xy}(\tau )],$$
(18)
where ${q}_{x}=\frac{{k}_{x}}{2|E|}$ denotes the initial distribution of resources and π _xy(τ) represents the transfer probability from x to y.
(6)
Time series based methods explore the evolution of topological metrics to predict the future links³⁷. It follows the steps below:

Step1: Choose a static-structure method (e.g. CN, RA, Katz, etc);

Step2: Establish the time series by calculating the similarity between unconnected nodes in each time period;

Step3: Compute the final score of unconnected nodes with a forecasting model (e.g. Moving Average, Liner Regression, Simple Exponential Smoothing, etc);

Step4: Measure the algorithms with future links in the next time period.

References

Albert, R. & Barabási, A.-L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47 (2002).
Article ADS MathSciNet MATH Google Scholar
Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998).
Article ADS CAS PubMed MATH Google Scholar
Barzel, B. & Barabási, A.-L. Network link prediction by global silencing of indirect correlations. Nat. Biotechnol. 31, 720–725 (2013).
Article CAS PubMed PubMed Central Google Scholar
Hulovatyy, Y., Solava, R. W. & Milenković, T. Revealing missing parts of the interactome via link prediction. Plos One 9, e90073 (2014).
Article ADS PubMed PubMed Central Google Scholar
Ermiş, B., Acar, E. & Cemgil, A. T. Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Mining and Knowledge Discovery 290, 203–236 (2015).
Article MathSciNet Google Scholar
Stanfield, Z., Coşkun, M. & Koyutürk, M. Drug response prediction as a link prediction problem. Sci. Rep. 7, 40321 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Mamitsuka, H. Mining from protein–protein interactions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2, 400–410 (2012).
Google Scholar
Cannistraci, C. V., Alanis-Lobato, G. & Ravasi, T. From link-prediction in brain connectomes and protein interactomes to the local-community-paradigm in complex networks. Sci. Rep. 3, 1613 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Gong, N. Z. et al. Joint link prediction and attribute inference using a social-attribute network. ACM Trans. Intell. Syst. Technol. 5, 27:1–27:20 (2014).
Article Google Scholar
He, Y.-L., Liu, J. N., Hu, Y.-X. & Wang, X.-Z. Owa operator based link prediction ensemble for social network. Expert Syst. Appl. 42, 21–50 (2015).
Article Google Scholar
Tang, J., Chang, S., Aggarwal, C. & Liu, H. Negative link prediction in social media. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 87–96 (ACM, 2015).
Daminelli, S., Thomas, J. M., Durán, C. & Cannistraci, C. V. Common neighbours and the local-community-paradigm for topological link prediction in bipartite networks. New J. Phys. 17, 113037 (2015).
Article ADS Google Scholar
He, X.-S., Zhou, M.-Y., Zhuo, Z., Fu, Z.-Q. & Liu, J.-G. Predicting online ratings based on the opinion spreading process. Physica A 436, 658–664 (2015).
Article ADS Google Scholar
Guimerà, R. & Sales-Pardo, M. Missing and spurious interactions and the reconstruction of complex networks. P. Natl. Acad. Sci. USA 106, 22073–22078 (2009).
Article ADS Google Scholar
Liben-Nowell, D. & Kleinberg, J. The link-prediction problem for social networks. J. AM. Soc. Inf. Sci. Technol. 58, 1019–1031 (2007).
Article Google Scholar
Wang, P., Xu, B., Wu, Y. & Zhou, X. Link prediction in social networks: the state-of-the-art. Sci. China Inform. Sci. 58, 1–38 (2015).
Google Scholar
Lin, D. An information-theoretic definition of similarity. Proceedings of the Fifteenth International Conference on Machine Learning, 296–304 (Morgan Kaufmann Publishers Inc., 1998).
Yuan, G., Murukannaiah, P. K., Zhang, Z. & Singh, M. P. Exploiting sentiment homophily for link prediction. Proceedings of the 8th ACM Conference on Recommender Systems, 17–24 (ACM, 2014).
Hâncean, M.-G. & Perc, M. Homophily in coauthorship networks of east european sociologists. Sci. Rep. 6, 36152 (2016).
Article ADS PubMed PubMed Central Google Scholar
Lü, L. & Zhou, T. Link prediction in complex networks: A survey. Physica A 390, 1150–1170 (2011).
Article ADS Google Scholar
Newman, M. E. Clustering and preferential attachment in growing networks. Phys.Rew. E 64, 025102 (2001).
Article ADS CAS Google Scholar
Adamic, L. A. & Adar, E. Friends and neighbors on the web. Social Networks 25, 211–230 (2003).
Article Google Scholar
Zhou, T., Lü, L. & Zhang, Y.-C. Predicting missing links via local information. Eur. Phys. J. B 71, 623–630 (2009).
Article ADS CAS MATH Google Scholar
Cui, W. et al. Bounded link prediction in very large networks. Physica A 457, 202–214 (2016).
Article ADS Google Scholar
Katz, L. A new status index derived from sociometric analysis. Psychometrika 18, 39–43 (1953).
Article MATH Google Scholar
Liu, W. & Lü, L. Link prediction based on local random walk. Europhys. Lett. 89, 58007 (2010).
Article ADS Google Scholar
Clauset, A., Moore, C. & Newman, M. E. Hierarchical structure and the prediction of missing links in networks. Nature 453, 98–101 (2008).
Article ADS CAS PubMed Google Scholar
Lü, L., Pan, L., Zhou, T., Zhang, Y.-C. & Stanley, H. E. Toward link predictability of complex networks. P. Natl. Acad. Sci. USA 112, 2325–2330 (2015).
Article ADS MathSciNet MATH Google Scholar
Xu, Z., Pu, C. & Yang, J. Link prediction based on path entropy. Physica A 456, 294–301 (2016).
Article ADS Google Scholar
Tan, F., Xia, Y. & Zhu, B. Link Prediction in Complex Networks: A Mutual Information Perspective. Plos One 9, e107056 (2014).
Article ADS PubMed PubMed Central Google Scholar
Zhu, B. & Xia, Y. An information-theoretic model for link prediction in complex networks. Sci. Rep. 5, 13037 (2015).
Article ADS Google Scholar
Kim, H. & Anderson, R. Temporal node centrality in complex networks. Phys. Rev. E 85, 026107 (2012).
Article ADS Google Scholar
Godsil, C. & Royle, G. Algebraic graph theory (Springer-Verlag, New York, 2001).
Estrada, E. & Rodrguez-Velázquez, J. A. Subgraph centrality in complex networks. Phys. Rev. E 71, 056103 (2005).
Article ADS MathSciNet Google Scholar
Borgatti, S. P. Centrality and network flow. Social Networks 27, 55–71 (2005).
Article ADS Google Scholar
Huang, Z. & Lin, D. K. The time-series link prediction problem with applications in communication surveillance. INFORMS J. Comput. 21, 286–303 (2009).
Article Google Scholar
Soares, P. R. d. S. & Prudêncio, R. B. C. Time Series Based Link Prediction. The 2012 International Joint Conference on Neural Networks, 1–7 (IEEE, 2012).
Güneș, İ., Gündüz-Öğüdücü, Ş. & Çataltepe, Z. Link prediction using time series of neighborhood-based node similarity scores. Data Mining and Knowledge Discovery 30, 147–180 (2016).
Article MathSciNet Google Scholar
Isella, L. et al. What’s in a crowd? Analysis of face-to-face behavioral networks. J. Theor. Biol. 271, 166–180 (2011).
Article MathSciNet PubMed Google Scholar
Chaintreau, A. et al. Impact of human mobility on opportunistic forwarding algorithms. IEEE Transactions on Mobile Computing 6, 606–620 (2007).
Article Google Scholar
Opsahl, T. & Panzarasa, P. Clustering in weighted networks. Social Networks 31, 155–163 (2009).
Article Google Scholar
Leskovec, J., Huttenlocher, D. & Kleinberg, J. Predicting positive and negative links in online social networks. Proceedings of the 19th international conference on World wide web, 641–650 (ACM, 2010).
Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982).
Article CAS PubMed Google Scholar
Herlocker, J. L., Konstan, J. A., Terveen, L. G. & Riedl, J. T. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22, 5–53 (2004).
Article Google Scholar
Davis, J. & Goadrich, M. The Relationship Between Precision-Recall and ROC Curves. Proceedings of the 23rd international conference on Machine learning, 233–240 (ACM, 2006).

Download references

Acknowledgements

This work is jointly supported by the National Natural Science Foundation of China (Nos 61471243, 11547040 and U1301252), Science and Technology Innovation Commission of Shenzhen (Nos JCYJ20160520162743717, JCYJ20150625101524056, JCYJ20140418095735561, JCYJ20150731160834611 and SGLH20131010163759789), the PhD Start-up Fund of Natural Science Foundation of Guangdong Province (2017A030310374), the Young Teachers Start-up Fund of Natural Science Foundation of Shenzhen University.

Author information

Authors and Affiliations

Department of Electronic Science and Technology, University of Science and Technology of China, Hefei, 230027, P. R. China
Tong Wang, Xing-Sheng He & Zhong-Qian Fu
Guangdong Province Key Laboratory of Popular High Performance Computers, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, P. R. China
Ming-Yang Zhou
Physics Department, University of Fribourg, Chemin du Musée 3, Fribourg, CH-1700, Switzerland
Ming-Yang Zhou

Authors

Tong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xing-Sheng He
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Yang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zhong-Qian Fu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.W. and M.Z. conceived and designed the experiments. T.W. and Z.F. performed the experiments. X.H. and Z.F. analysed the data and improved the methods. T.W., X.H. and M.Z. wrote the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Ming-Yang Zhou.

Ethics declarations

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, T., He, XS., Zhou, MY. et al. Link Prediction in Evolving Networks Based on Popularity of Nodes. Sci Rep 7, 7147 (2017). https://doi.org/10.1038/s41598-017-07315-4

Download citation

Received: 09 March 2017
Accepted: 26 June 2017
Published: 02 August 2017
DOI: https://doi.org/10.1038/s41598-017-07315-4

This article is cited by

A comprehensive survey of link prediction methods
- Djihad Arrar
- Nadjet Kamel
- Abdelaziz Lakhfif
The Journal of Supercomputing (2024)
GC-LSTM: graph convolution embedded LSTM for dynamic network link prediction
- Jinyin Chen
- Xueke Wang
- Xuanheng Xu
Applied Intelligence (2022)
Multi-hop Energy-Efficient Reliable Cluster-based Sectoring Scheme Using Markov Chain Model to Improve QoS Parameters in a WSN
- Dhanashri Narayan Wategaonkar
- S. V. Nagaraj
- T. R. Reshmi
Wireless Personal Communications (2021)
Missing Link Prediction using Common Neighbor and Centrality based Parameterized Algorithm
- Iftikhar Ahmad
- Muhammad Usman Akhtar
- Ambreen Shahnaz
Scientific Reports (2020)
Temporal Link Prediction: A Survey
- Aswathy Divakaran
- Anuraj Mohan
New Generation Computing (2020)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.