Introduction

Link prediction aims at revealing missing or potential relations between data entries from large volumes of data sets which are subject to dynamical changes and uncertainty. With the rapid development of studies on complex networks, the problem of link prediction has received extensive attention from researchers in various fields including physics, mathematics, computer science, social science and so on1,2. On the one hand, the investigation of complex networks may provide some novel insights on real-world linking patterns which is helpful to link prediction. On the other hand, by trying to predict the missing or potential links with a high accuracy, link prediction may also help to provide us with a deeper understanding of the organization of real world networks, which is a longstanding challenge in many branches of science3. From a practical point of view, link prediction is also a fundamental issue for many modern world applications in disparate fields such as recommending friends in online social networks, recommending products in e-commerce web sites4,5,6, and uncovering missing parts of social and biological networks7,8,9.

In link prediction, the central issue is how to predict the missing links effectively and accurately. Of the two aspects the accuracy problem is more fundamental from a theoretical point of view when computational cost is not a major concern. In literature this is termed as the “predictability problem”, i.e., to what extent can the missing or potential links be predicted, which to our knowledge was first proposed and investigated in10. Although the exact predictability can never be obtained due to the complexity and uncertainty intrinsic to real world networks, the practical predictability depends on the extent to which the available information can be exploited and the noise be eliminated. In principle, information is represented by something of regularity or consistency in the network structure when the network is dynamically changing due to, say, some evolution or perturbation processes. On the other hand, noise is represented by something irregular or inconsistent which is always unavoidable in real data sets. From this point of view, to improve link predictability, one should first find some proper description of the information and noise in the available data sets and then design some effective method accordingly.

In the context of complex networks which are usually described as graphs, the global topological information always lies in their adjacency matrices, in which nonzero entries denote links between corresponding nodes, while zero ones denote missing or nonexistent links. The adjacent matrix provides fundamental information for link prediction. And many existing link prediction algorithms are actually based on some kind of manipulations of the adjacent matrices or some of their variants. For example, the CN index11 of a node pair is the inner product of their corresponding rows of the adjacent matrix, and the RA index11 of some weighted adjacent matrix whose column sum is assigned as 1. These are local indices that have explicit physical meanings, while the Katz index12 uses the global information obtained from some series of the adjacent matrices. Motivated by these observations, new link prediction methods based on different kinds of manipulations of the adjacent matrices have been developed. Since the network structure can be well reflected by the eigenvectors of its adjacent matrix13, it is natural to make use of them in link prediction. Following the idea that the consistency in network structure can be represented by the eigenvectors of its adjacent matrices, the authors10 proposed a structural perturbation method (SPM) in which a new matrix was constructed for prediction by perturbing the eigenvalues of the adjacent matrix while fixing the eigenvectors. On the other hand, since the real data of complex networks all are always subjected to interfering noise, it is necessary to eliminate the noise to uncover the unobserved links. In14, by introducing the robust PCA technique, the authors developed a novel global information-based link prediction algorithm which decomposes the adjacent matrix into a low rank backbone structure and a sparse noise matrix.

Although the two aforementioned works have achieved considerable improvements compared with many existing methods, they still have some limitations. An important issue is that each of them only focuses on one of the two complementary aspects to accurate link prediction: information completion and noise reduction. Motivated by these observations, here we propose a novel link prediction method that combines these two complementary methods. That is, we first exploit the information from the adjacent matrix by some perturbation on its eigenvalues, then by the decomposition technique from robust PCA, we remove the sparse noise from the resulting matrix to reveal the backbone matrix for final prediction. Furthermore, for weighted directed networks which may have asymmetric adjacent matrices, we extract its symmetric part by introducing some new decomposition technique so that the original SPM is still applicable. Thus, our new method can be extended to weighted directed networks. Experimental studies indicate that this new method achieves considerable improvement when compared to each of the individual method in most of the networks.

Results

Consider a weighted directed network G(V, E, A), where V = {v 1, …, v n }, \(E\subseteq V\times V\) are the set of nodes and links, respectively, and \(A={[{a}_{ij}]}_{i,j=1}^{n}\) is the weighted directed adjacent matrix such that a ij  > 0 if node (v j , v i ) \(\in \)E and a ij  = 0 otherwise. If the network is undirected and unweighted, then A is a real symmetric matrix, i.e., a ij  = a ji for each i, j = 1, 2, …, n. Otherwise, A may be an asymmetric matrix that has complex eigenvalues and is not diagonalizable. In such case the original SPM is not applicable. To test the accuracy of our new prediction algorithm, we randomly divide the link set E into a training set E T and a probe set E P. Here E T is treated as known information or observed information, while E P is considered as the set of missing links. Obviously, \({E}^{T}\cap {E}^{P}=\varnothing \) and \({E}^{T}\cup {E}^{P}=E\). Our purpose is to try to predict the links in E P based on the information in E T.

In the experiment, we first apply the perturbation procedure to A T, which is the adjacent matrix of G(V, E T). To implement the perturbation, we randomly select a fraction of links from E T to constitute the perturbation set ΔE T, whose adjacent matrix ΔA T acts as the perturbation to A T − ΔA T. As in10, in each prediction the final perturbed matrix \({\tilde{A}}_{T}\) is obtained by averaging over 10 independent selections of ΔE T. Then we apply the robust PCA technique to \({\tilde{A}}_{T}\) to obtain the backbone structure \({\tilde{A}}_{B}\). In this procedure, the parameter λ for each network is chosen as the optimal value in the simulations. Particularly, based on some preliminary simulations, we found that is almost all cases, the optimal values of λ always fall in the interval (0, 0.4). Thus in the experiment, we simulate different values of λ from 0.01 to 0.39 at a step 0.01 and select the value that has the best performance. In the final prediction, we use the matrix \({\tilde{A}}_{B}\) as the score matrix, whose entries corresponding to the unconnected nodes are explained as their connection likelihood.

To measure the prediction accuracy of the algorithm, we use two standard metrics: precision 15 and AUC 16 which are defined as follows.

Given the ranking of the non-observed links according to their scores in descending order, if L r of the top-L links, which are taken as the predicted ones, appear in the probe set, then precision = L r /L. In the experiment, we take L as the number of links in the probe set.

Given the ranking of the non-observed links, AUC is the probability that a randomly chosen missing link has a higher score than a randomly chosen nonexistent link. In the algorithmic implementation, instead of computing its exact value, AUC is usually approximated by the comparison of scores between node pairs randomly chosen from the set of missing links and of nonexistent links. If among n times of independent comparisons, there are n′ times the scores of the missing links are higher than those of the non-observed links, and n″ times they are the same, then

$$AUC=\frac{{n}^{{\prime} }+0.5{n}^{{\prime\prime}}}{n}$$
(1)

In the experiment, we test our algorithm on both undirected and directed networks from disparate fields. And we compared the results of our method with the original individual methods SPM and Low Rank (LR), as well as some other typical local indices including Common Neighbors(CN)11, Adamic-Adar (AA)17 and Resource Allocation(RA)11. The prediction results in precision and AUC are respectively presented in Tables 1 and 2 for undirected networks and in Tables 3 and 4 for directed networks, where for each network the result is obtained by averaging over 100 independent runs. Since the original SPM only applies to symmetric matrices, here the results for directed networks are obtained by the generalized one described in stage 1 of our method.

Table 1 The average predicting precision obtained by 100 independent runs on 19 real undirected networks. The training set contains 90% of total connections.
Table 2 The average predicting AUC obtained by 100 independent runs on 19 real undirected networks.
Table 3 The average predicting precision obtained by 100 independent runs on 17 real directed networks.
Table 4 The average predicting AUC obtained by 100 independent runs on 17 real directed networks.

From the experimental results, it can be seen that compared to each of the original individual methods, the new method achieves considerable improvements both in precision and in AUC. It gives the best results in most of the networks, especially in the directed ones. Even in the case it is not the best, it is still quite near the best in most cases. In some sparse networks such as CollegeMsg, the precision given by our method is more than two times of those given by others. It is also remarkable to note that in some cases, even when one or both of the two individual methods performs very poor, our method still works very well. This implies that the new method has not only a very competitive but also a very robust performance, which in our opinion is at least partly due to the complementary nature of SPM and LR and justifies the significance of their combination.

Discussion

In this work, by generalizing and combining two previous link prediction methods which are of complementary nature, we propose a new algorithm for link prediction via perturbation and decomposition of the adjacent matrices of the networks. By exploiting the useful information and eliminating the interfering noise simultaneously, the new method takes advantage of both previous methods and robustly achieves considerable improvements on most of real world networks from disparate fields in experimental studies.

Beyond its competitive performances, the significance of this work, as well as that of those previous ones which it depends upon, is that they opened up a new direction for link prediction by directly manipulating the adjacent matrices as a whole. Compared to the classical methods such as similarity indices, link prediction algorithm in this new direction can no doubt take advantage of the rich tools and results available in matrix theory. And it is predictable that many new works in this direction will be done in the near future.

Since the new method is a combination of two existing methods, its computational complexity is roughly the summation of theirs. At stage 1, the time-consuming part is the computation of the eigenvalues and eigenvectors of the adjacent matrix whose complexity is O(n 3)18. At stage 2, it is the singular value decomposition (SVD) of the perturbed matrix whose complexity is O(kn 2)14 where k is the estimated rank of the matrix. Thus in summary, the computational complexity of the proposed algorithm is O(n 3).

Despite these advantages, the new method also faces some difficulties, among which the major one is how to determine the parameter λ. As many other parameterized methods, the parameter plays an important role in the performance of the algorithms, yet there is no explicit rule to determine its optimal value in advance. In the experiment we can choose for each network an optimal value based on the empirical simulations, yet this is not realistic for real-world applications. In that case, as in some learning algorithms, we can only obtain an estimated value of λ based on the training data. That is, we can divide the existent links into training set and probe set as we do in the experiment. Then we can obtain the “optimal” value of λ based on the simulations. Although this value generally is not the true optimal of λ, it should be at least an acceptable approximation, especially when the network is large enough. Moreover, to be safer, when it is possible, we can repeat this process many times and then determine an optimal value of λ based on the distribution of the outputs.

Methods

Given a weighted directed network G(V, E, A), where V, E, and A are defined as before. In the following, we consider undirected networks as special cases of directed networks and will present the method in terms of directed networks in general.

In summary, our method consists of two stages: the perturbation stage and the decomposition one, which are described as follows.

Stage 1: Structural perturbation

This stage can be divided into the following three steps.

Step 1: Preprocessing. Given the weight matrix A of a directed network, to apply the structural perturbation method to it, we first decompose it into the following two parts:

$$A={A}^{S}+{A}^{AS}$$
(2)

where A S = (A + A Τ)/2 is the symmetric part of A, while A AS = (A − A Τ)/2 is the antisymmetric part of A, with A Τ being the transpose of A. Intuitively, we explain the entries of A S as the average linking tendency between corresponding nodes, while the entries of A AS as the bias in the distribution of this tendency to the two links in opposite directions.

Step 2: Structural perturbation. Apply the structural perturbation method to A S as described in10, which for integrity will be briefly presented as follows. Since A S is a symmetric matrix, it can be written as

$${A}^{S}=\sum _{k=1}^{N}\,{\lambda }_{k}{x}_{k}{x}_{k}^{{\rm{T}}}$$
(3)

where λ k , x k , k = 1, 2, …, n are the eigenvalues and corresponding eigenvectors of A S, respectively.

After some perturbation ΔA S to A S, then the eigenvalue of A S + ΔA S will change to λ k  + Δλ k and its corresponding eigenvector to x k  + Δx k . Thus we have

$$({A}^{S}+{\rm{\Delta }}{A}^{S})\,({x}_{k}+{\rm{\Delta }}{x}_{k})=({\lambda }_{k}+{\rm{\Delta }}{\lambda }_{k})\,({x}_{k}+{\rm{\Delta }}{x}_{k}),\quad k=1,\ldots ,n{\rm{.}}$$
(4)

Left-multiplying the eigenfunction by \({x}_{k}^{{\rm{T}}}\), and neglecting second-order terms \({\rm{\Delta }}{\lambda }_{k}{x}_{k}^{{\rm{T}}}{\rm{\Delta }}{x}_{k}\) and \({x}_{k}^{{\rm{T}}}{\rm{\Delta }}{A}^{S}{\rm{\Delta }}{x}_{k}\), we can obtain

$${\rm{\Delta }}{\lambda }_{k}\approx \frac{{x}_{k}^{{\rm{T}}}{\rm{\Delta }}{A}^{S}{x}_{k}}{{x}_{k}^{{\rm{T}}}{x}_{k}},\quad k=1,\ldots ,n{\rm{.}}$$
(5)

Fixing the eigenvectors and using the perturbed eigenvalues, we can obtain the perturbed matrix,

$${\tilde{A}}^{S}=\sum _{k=1}^{n}\,({\lambda }_{k}+{\rm{\Delta }}{\lambda }_{k}){x}_{k}{x}_{k}^{{\rm{T}}}.$$
(6)

Step 3: Postprocessing. Add the antisymmetric part of A to \({\tilde{A}}^{S}\) to get the final perturbed matrix:

$$\tilde{A}={\tilde{A}}^{S}+{A}^{AS}.$$
(7)

Here we fix the antisymmetric part of A based on the idea that the difference between the linking tendency of opposite directions are not changed obviously during the perturbation process.

Stage 2: Noise reduction

In this stage we will remove the supposed noise from the perturbation matrix \(\tilde{A}\) and recover the backbone structure for prediction. For this purpose we introduce the robust principal component analysis (robust PCA) as in14. For the sake of integrity we briefly present it in the following. If the network is highly regularly organized, then its backbone structure should have some low rank property, and the noise should be sparse. Thus we should decompose the matrix \(\tilde{A}\) into two parts: a low rank part \({\tilde{A}}_{B}\) as the backbone structure and a sparse part \({\tilde{A}}_{N}\) as the noise. Mathematically, this can be transformed into the following optimization problem:

$$\mathop{{\rm{\min }}}\limits_{{\tilde{A}}_{B},{\tilde{A}}_{N}}\,{\rm{rank}}({\tilde{A}}_{B})+\gamma {\parallel {\tilde{A}}_{N}\parallel }_{0}\quad {\rm{subject}}\,{\rm{to}}\quad \tilde{A}={\tilde{A}}_{B}+{\tilde{A}}_{N},$$
(8)

where rank(·) denotes the rank of a matrix, \({\parallel \cdot \parallel }_{0}\) is the l 0-norm of a matrix, and γ is the parameter that balances these two expressions. Since this is a highly nonconvex optimization problem that is hard to solve, we use some approximate solution based on robust PCA19, which is the solution of the following optimization problem:

$$\mathop{{\rm{\min }}}\limits_{{\tilde{A}}_{B},{\tilde{A}}_{N}}\,{\parallel {\tilde{A}}_{B}\parallel }_{* }+\lambda {\parallel {\tilde{A}}_{N}\parallel }_{1}\quad {\rm{subject}}\,{\rm{to}}\quad \tilde{A}={\tilde{A}}_{B}+{\tilde{A}}_{N},$$
(9)

where \({\parallel \cdot \parallel }_{* }\) is the nuclear norm of a matrix, \({\parallel \cdot \parallel }_{1}\) is the l 1-norm, and λ is the parameter that balances the two expressions.

At last, we predict some missing or potential links based on the approximated backbone structure matrix \({\tilde{A}}_{B}\) as in14. That is, we take the entries in \({\tilde{A}}_{B}\) corresponding to the unobserved links as their similarity scores and sort them in a descending order. Then we select the top L links as our prediction result, where L is determined by some other rules.

Data

For experimental studies, we have collected 36 real-world networks from disparate fields, including 19 undirected networks and 17 directed ones. These networks were carefully selected to cover a wide range of properties, including different sizes, average degrees, clustering coefficients, and heterogeneity indices. The basic topological features of the networks are summarized in Tables 5 and 6, respectively. A brief description of these networks are as follows:

Table 5 The basic topological features of 19 real undirected networks.
Table 6 The basic topological features of 17 real directed networks.

Undirected network

  • Karate20: The network of relationship among the members in the karate club.

  • Football21: The network of American football games consisting of Division IA colleges during the regular season, Fall in 2000.

  • Dolphin22: network of bottlenose dolphins living in Doubtful Sound (New Zealand).

  • Everglades23: A network of foodweb in Everglades Graminoids during wet season.

  • WorldTrade24: the network of miscellaneous manufactures of metal among 80 countries in 1994.

  • Macaca25: cortical networks of the macaque monkey.

  • FWM26: the food web in Mangrove Estuary during the wet season.

  • BUP27: A network of political blogs.

  • WorldAdj28: An adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens.

  • FWF29: the network of predator-prey interactions in Florida Bay in the dry season.

  • Jazz30: jazz musician network, the link denotes the relationship between two persons if they used to play together in the same band at least once.

  • Contact31: a contact network between people measured by carried wireless devices.

  • C. elegans32: A neural network of the nematode worm C. elegans compiled by D. Watts and S. Strogatz.

  • USAir23: A network of US air transportation system, which contains 332 airports and 2126 airlines.

  • INF27: A network of face-to-face contacts in an exhibition.

  • Metabolic32: the metabolic network of the nematode worm C. elegans.

  • Email33: the network of email interchanges between the members of the University of Rovira I Virgili.

  • PB34: the network of hyperlinks between weblogs on US politics.

  • Yeast35: the network of protein-protein interaction.

Directed network

  • Bison36: This directed network contains dominance between American bisons in 1972 on the National Bison Range in Moiese (Montana).

  • Cattle37: This directed network contains dominance behaviours observed between dairy cattles at the Iberia Livestock Experiment Station in Jenerette, Louisiana.

  • Football38: the network of American football games consisting of Division IA colleges during the regular season, Fall in 2000;

  • Gramdry39: the network of predator-prey interactions in Everglades Graminoids in the dry season.

  • Gramwet39: the network of predator-prey interactions in Everglades Graminoids in the wet season.

  • Cypdry39: the network of predator-prey interactions in Cypress in the dry season.

  • Cypwet39: the network of predator-prey interactions in Cypress in the wet season.

  • Mangdry39: the network of predator-prey interactions in Mangrove Estuary in the dry season.

  • Mangwet39: the network of predator-prey interactions in Mangrove Estuary in the wet season.

  • Polbooks40: A network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com.

  • Baydry39: the network of predator-prey interactions in Florida Bay in the dry season.

  • Baywet39: the network of predator-prey interactions in Florida Bay in the wet season.

  • C. elegans32: A neural network of the nematode worm C. elegans compiled by D. Watts and S. Strogatz;

  • USAir23: A network of US air transportation system, which contains 332 airports and 2126 airlines;

  • Email-Eu41: the network was generated using email data from a large European research institution.

  • PB34: A directed network of hyperlinks between weblogs on US politics, recorded in 2005 by Adamic and Glance.

  • CollegeMsg42: This dataset is comprised of private messages sent on an online social network at the University of California, Irvine.

Benchmarks

For comparison, we introduce three benchmarks similarity indices based on structural information, including Common Neighbors(CN), Adamic-Adar Index(AA), and Resource Allocation Index(RA).

  • Common Neighbors(CN). It supposes that two nodes are more likely to be connected if they have more common neighbors, so the number of their common neighbors can be regarded as a measurement of their similarity. Let Γ(x) denote the set of neighbors of x, |Q| denote the cardinality of the set Q, then CN is defined as

    $${S}_{xy}^{CN}=| {\rm{\Gamma }}(x)\cap {\rm{\Gamma }}(y)| .$$
    (10)
  • Adamic-Adar(AA). It can be seen as a refined CN index by assigning different weights to different nodes in the set of common neighbors. The larger degree of the common neighbor, the less weight it can contribute, then AA can be calculated by

    $${S}_{xy}^{AA}=\sum _{z\in {\rm{\Gamma }}(x)\cap {\rm{\Gamma }}(y)}\,\frac{1}{{\rm{log}}(| {\rm{\Gamma }}(z)| )}{\rm{.}}$$
    (11)
  • Resource Allocation(RA). It is similar to AA, but motivated by the resource allocation process on complex networks. It models the transmission of resources between two unconnected nodes through neighborhood nodes, then RA can be written as

$${S}_{xy}^{RA}=\sum _{z\in {\rm{\Gamma }}(x)\cap {\rm{\Gamma }}(y)}\,\frac{1}{| {\rm{\Gamma }}(z)| }{\rm{.}}$$
(12)