Link prediction in complex networks via matrix perturbation and decomposition

Link prediction in complex networks aims at predicting the missing links from available datasets which are always incomplete and subject to interfering noises. To obtain high prediction accuracy one should try to complete the missing information and at the same time eliminate the interfering noise from the datasets. Given that the global topological information of the networks can be exploited by the adjacent matrix, the missing information can be completed by generalizing the observed structure according to some consistency rule, and the noise can be eliminated by some proper decomposition techniques. Recently, two related works have been done that focused on each of the individual aspect and obtained satisfactory performances. Motivated by their complementary nature, here we proposed a new link prediction method that combines them together. Moreover, by extracting the symmetric part of the adjacent matrix, we also generalized the original perturbation method and extended our new method to weighted directed networks. Experimental studies on real networks from disparate fields indicate that the prediction accuracy of our method was considerably improved compared with either of the individual method as well as some other typical local indices.

Link prediction aims at revealing missing or potential relations between data entries from large volumes of data sets which are subject to dynamical changes and uncertainty. With the rapid development of studies on complex networks, the problem of link prediction has received extensive attention from researchers in various fields including physics, mathematics, computer science, social science and so on 1,2 . On the one hand, the investigation of complex networks may provide some novel insights on real-world linking patterns which is helpful to link prediction. On the other hand, by trying to predict the missing or potential links with a high accuracy, link prediction may also help to provide us with a deeper understanding of the organization of real world networks, which is a longstanding challenge in many branches of science 3 . From a practical point of view, link prediction is also a fundamental issue for many modern world applications in disparate fields such as recommending friends in online social networks, recommending products in e-commerce web sites [4][5][6] , and uncovering missing parts of social and biological networks [7][8][9] .
In link prediction, the central issue is how to predict the missing links effectively and accurately. Of the two aspects the accuracy problem is more fundamental from a theoretical point of view when computational cost is not a major concern. In literature this is termed as the "predictability problem", i.e., to what extent can the missing or potential links be predicted, which to our knowledge was first proposed and investigated in 10 . Although the exact predictability can never be obtained due to the complexity and uncertainty intrinsic to real world networks, the practical predictability depends on the extent to which the available information can be exploited and the noise be eliminated. In principle, information is represented by something of regularity or consistency in the network structure when the network is dynamically changing due to, say, some evolution or perturbation processes. On the other hand, noise is represented by something irregular or inconsistent which is always unavoidable in real data sets. From this point of view, to improve link predictability, one should first find some proper description of the information and noise in the available data sets and then design some effective method accordingly.
In the context of complex networks which are usually described as graphs, the global topological information always lies in their adjacency matrices, in which nonzero entries denote links between corresponding nodes, while zero ones denote missing or nonexistent links. The adjacent matrix provides fundamental information for link prediction. And many existing link prediction algorithms are actually based on some kind of manipulations of the adjacent matrices or some of their variants. For example, the CN index 11 of a node pair is the inner product of their corresponding rows of the adjacent matrix, and the RA index 11 of some weighted adjacent matrix whose column sum is assigned as 1. These are local indices that have explicit physical meanings, while the Katz index 12 uses the global information obtained from some series of the adjacent matrices. Motivated by these observations, new link prediction methods based on different kinds of manipulations of the adjacent matrices have been developed. Since the network structure can be well reflected by the eigenvectors of its adjacent matrix 13 , it is natural to make use of them in link prediction. Following the idea that the consistency in network structure can be represented by the eigenvectors of its adjacent matrices, the authors 10 proposed a structural perturbation method (SPM) in which a new matrix was constructed for prediction by perturbing the eigenvalues of the adjacent matrix while fixing the eigenvectors. On the other hand, since the real data of complex networks all are always subjected to interfering noise, it is necessary to eliminate the noise to uncover the unobserved links. In 14 , by introducing the robust PCA technique, the authors developed a novel global information-based link prediction algorithm which decomposes the adjacent matrix into a low rank backbone structure and a sparse noise matrix.
Although the two aforementioned works have achieved considerable improvements compared with many existing methods, they still have some limitations. An important issue is that each of them only focuses on one of the two complementary aspects to accurate link prediction: information completion and noise reduction. Motivated by these observations, here we propose a novel link prediction method that combines these two complementary methods. That is, we first exploit the information from the adjacent matrix by some perturbation on its eigenvalues, then by the decomposition technique from robust PCA, we remove the sparse noise from the resulting matrix to reveal the backbone matrix for final prediction. Furthermore, for weighted directed networks which may have asymmetric adjacent matrices, we extract its symmetric part by introducing some new decomposition technique so that the original SPM is still applicable. Thus, our new method can be extended to weighted directed networks. Experimental studies indicate that this new method achieves considerable improvement when compared to each of the individual method in most of the networks. , 1 is the weighted directed adjacent matrix such that a ij > 0 if node (v j , v i ) ∈ E and a ij = 0 otherwise. If the network is undirected and unweighted, then A is a real symmetric matrix, i.e., a ij = a ji for each i, j = 1, 2, …, n. Otherwise, A may be an asymmetric matrix that has complex eigenvalues and is not diagonalizable. In such case the original SPM is not applicable. To test the accuracy of our new prediction algorithm, we randomly divide the link set E into a training set E T and a probe set E P . Here E T is treated as known information or observed information, while E P is considered as the set of missing links. Obviously,

Consider a weighted directed network
Our purpose is to try to predict the links in E P based on the information in E T .
In the experiment, we first apply the perturbation procedure to A T , which is the adjacent matrix of G(V, E T ). To implement the perturbation, we randomly select a fraction of links from E T to constitute the perturbation set ΔE T , whose adjacent matrix ΔA T acts as the perturbation to A T − ΔA T . As in 10 , in each prediction the final perturbed matrix T Ã is obtained by averaging over 10 independent selections of ΔE T . Then we apply the robust PCA technique to T Ã to obtain the backbone structure Ã B . In this procedure, the parameter λ for each network is chosen as the optimal value in the simulations. Particularly, based on some preliminary simulations, we found that is almost all cases, the optimal values of λ always fall in the interval (0, 0.4). Thus in the experiment, we simulate different values of λ from 0.01 to 0.39 at a step 0.01 and select the value that has the best performance. In the final prediction, we use the matrix Ã B as the score matrix, whose entries corresponding to the unconnected nodes are explained as their connection likelihood.
To measure the prediction accuracy of the algorithm, we use two standard metrics: precision 15 and AUC 16 which are defined as follows.
Given the ranking of the non-observed links according to their scores in descending order, if L r of the top-L links, which are taken as the predicted ones, appear in the probe set, then precision = L r /L. In the experiment, we take L as the number of links in the probe set.
Given the ranking of the non-observed links, AUC is the probability that a randomly chosen missing link has a higher score than a randomly chosen nonexistent link. In the algorithmic implementation, instead of computing its exact value, AUC is usually approximated by the comparison of scores between node pairs randomly chosen from the set of missing links and of nonexistent links. If among n times of independent comparisons, there are n′ times the scores of the missing links are higher than those of the non-observed links, and n″ times they are the same, then In the experiment, we test our algorithm on both undirected and directed networks from disparate fields. And we compared the results of our method with the original individual methods SPM and Low Rank (LR), as well as some other typical local indices including Common Neighbors(CN) 11 , Adamic-Adar (AA) 17 and Resource Allocation(RA) 11 . The prediction results in precision and AUC are respectively presented in Tables 1 and 2 for  undirected networks and in Tables 3 and 4 for directed networks, where for each network the result is obtained by averaging over 100 independent runs. Since the original SPM only applies to symmetric matrices, here the results for directed networks are obtained by the generalized one described in stage 1 of our method.
From the experimental results, it can be seen that compared to each of the original individual methods, the new method achieves considerable improvements both in precision and in AUC. It gives the best results in most of the networks, especially in the directed ones. Even in the case it is not the best, it is still quite near the best in most cases. In some sparse networks such as CollegeMsg, the precision given by our method is more than two times of those given by others. It is also remarkable to note that in some cases, even when one or both of the two individual methods performs very poor, our method still works very well. This implies that the new method has not only a very competitive but also a very robust performance, which in our opinion is at least partly due to the complementary nature of SPM and LR and justifies the significance of their combination.

Discussion
In this work, by generalizing and combining two previous link prediction methods which are of complementary nature, we propose a new algorithm for link prediction via perturbation and decomposition of the adjacent matrices of the networks. By exploiting the useful information and eliminating the interfering noise simultaneously, the new method takes advantage of both previous methods and robustly achieves considerable improvements on most of real world networks from disparate fields in experimental studies. Beyond its competitive performances, the significance of this work, as well as that of those previous ones which it depends upon, is that they opened up a new direction for link prediction by directly manipulating the adjacent matrices as a whole. Compared to the classical methods such as similarity indices, link prediction algorithm in this new direction can no doubt take advantage of the rich tools and results available in matrix theory. And it is predictable that many new works in this direction will be done in the near future.
Since the new method is a combination of two existing methods, its computational complexity is roughly the summation of theirs. At stage 1, the time-consuming part is the computation of the eigenvalues and eigenvectors of the adjacent matrix whose complexity is O(n 3 ) 18 . At stage 2, it is the singular value decomposition (SVD) of the   perturbed matrix whose complexity is O(kn 2 ) 14 where k is the estimated rank of the matrix. Thus in summary, the computational complexity of the proposed algorithm is O(n 3 ). Despite these advantages, the new method also faces some difficulties, among which the major one is how to determine the parameter λ. As many other parameterized methods, the parameter plays an important role in the performance of the algorithms, yet there is no explicit rule to determine its optimal value in advance. In the experiment we can choose for each network an optimal value based on the empirical simulations, yet this is not realistic for real-world applications. In that case, as in some learning algorithms, we can only obtain an estimated value of λ based on the training data. That is, we can divide the existent links into training set and probe set as we do in the experiment. Then we can obtain the "optimal" value of λ based on the simulations. Although this value generally is not the true optimal of λ, it should be at least an acceptable approximation, especially when the network is large enough. Moreover, to be safer, when it is possible, we can repeat this process many times and then determine an optimal value of λ based on the distribution of the outputs.

Methods
Given a weighted directed network G (V, E, A), where V, E, and A are defined as before. In the following, we consider undirected networks as special cases of directed networks and will present the method in terms of directed networks in general.
In summary, our method consists of two stages: the perturbation stage and the decomposition one, which are described as follows.
Stage 1: Structural perturbation. This stage can be divided into the following three steps.
Step 1: Preprocessing. Given the weight matrix A of a directed network, to apply the structural perturbation method to it, we first decompose it into the following two parts:

is the antisymmetric part of A, with
A Τ being the transpose of A. Intuitively, we explain the entries of A S as the average linking tendency between corresponding nodes, while the entries of A AS as the bias in the distribution of this tendency to the two links in opposite directions.
Step 2: Structural perturbation. Apply the structural perturbation method to A S as described in 10 , which for integrity will be briefly presented as follows. Since A S is a symmetric matrix, it can be written as where λ k , x k , k = 1, 2, …, n are the eigenvalues and corresponding eigenvectors of A S , respectively.
After some perturbation ΔA S to A S , then the eigenvalue of A S + ΔA S will change to λ k + Δλ k and its corresponding eigenvector to x k + Δx k . Thus we have Left-multiplying the eigenfunction by x k T , and neglecting second-order terms λ Fixing the eigenvectors and using the perturbed eigenvalues, we can obtain the perturbed matrix, Step 3: Postprocessing. Add the antisymmetric part of A to S Ã to get the final perturbed matrix: S AS = + .
Here we fix the antisymmetric part of A based on the idea that the difference between the linking tendency of opposite directions are not changed obviously during the perturbation process.
Stage 2: Noise reduction. In this stage we will remove the supposed noise from the perturbation matrix Ã and recover the backbone structure for prediction. For this purpose we introduce the robust principal component analysis (robust PCA) as in 14 . For the sake of integrity we briefly present it in the following. If the network is highly regularly organized, then its backbone structure should have some low rank property, and the noise should be sparse. Thus we should decompose the matrix Ã into two parts: a low rank part B Ã as the backbone structure and a sparse part Ã N as the noise. Mathematically, this can be transformed into the following optimization problem: SCIEntIfIC REPORTs | 7: 14724 | DOI:10.1038/s41598-017-14847-2 min rank ( ) subject to , where rank(·) denotes the rank of a matrix, 0 ⋅ is the l 0 -norm of a matrix, and γ is the parameter that balances these two expressions. Since this is a highly nonconvex optimization problem that is hard to solve, we use some approximate solution based on robust PCA 19 , which is the solution of the following optimization problem: where * ⋅ is the nuclear norm of a matrix, ⋅ 1 is the l 1 -norm, and λ is the parameter that balances the two expressions.
At last, we predict some missing or potential links based on the approximated backbone structure matrix Ã B as in 14 . That is, we take the entries in B Ã corresponding to the unobserved links as their similarity scores and sort them in a descending order. Then we select the top L links as our prediction result, where L is determined by some other rules.

Data
For experimental studies, we have collected 36 real-world networks from disparate fields, including 19 undirected networks and 17 directed ones. These networks were carefully selected to cover a wide range of properties, including different sizes, average degrees, clustering coefficients, and heterogeneity indices. The basic topological features of the networks are summarized in Tables 5 and 6, respectively. A brief description of these networks are as follows: Undirected network.
• Karate 20 : The network of relationship among the members in the karate club.   Table 5. The basic topological features of 19 real undirected networks. |V| and |E| represent the total numbers of nodes and links, respectively. C is the clustering coefficient 43 and r denotes the assortative coefficient 21 . 〈k〉 is the average degree, and 〈d〉 is the average shortest distance. H is the degree heterogeneity denoted as H = 〈k 2 〉/〈k〉.  Table 6. The basic topological features of 17 real directed networks. |V| and |E| represent the total numbers of nodes and links, respectively. r denotes the assortative coefficient. 〈k〉 is the average degree. H is the degree heterogeneity denoted as H = 〈k 2 〉/〈k〉.