Introduction

Link prediction algorithms aim at estimating the tendency of the existence of a link between two nodes, based on observed links, attributes of nodes, or dynamical correlations1,2,3. As our knowledge on many biological networks is very limited (e.g., most of the molecular interactions in cells are still unknown4), using predicting results to guide the laboratorial experiments rather than blindly checking all possible interactions will greatly reduce the experimental costs5,6. Besides, such predicting results for online social networks can be considered as friend recommendation7. Actually, how to recommend products to a target user in online e-commerce web sites is also a sub-problem of link prediction in bipartite networks where the prediction is for the target user8. Similar algorithms and techniques can be further applied in detecting spurious links under noisy environment9, in evaluating different network models by mapping evolving mechanisms into link prediction algorithms10 and more interestingly, in predicting the U.S. Supreme Court votes11.

The missing link prediction problem1 and the spurious link identification problem9 are illustrated by Figs 1 and 2, respectively, which are networks of 4 nodes. In Supplementary Fig. 1 of the Supplementary Information (SI) we give the ensemble of all four-node networks. The total number of four-node networks is 26 = 64, where 6 = 4 × 3/2 is the number of all possible links in the network of 4 nodes. For the missing link prediction, the task is to estimate the existence tendency of all the non-observed links based on the known network topology and nodes attributes (if we have such information). Specifically, consider an undirected network or graph G(V, E), where V is the set of nodes and E is the set of links. Multiple links and self connections are not allowed. Denoted by U, the universal set contains all possible links. Then, the set of nonexistent links is U − E. We assume that there are some missing links (or the links that will appear in the future) in the set U − E and the task of link prediction is to find out these links. Generally, we do not know which links are the missing or future links, otherwise we do not need to do prediction. Therefore, to test the algorithm’s accuracy, the observed links, E, is randomly divided into two parts: the training set, ET, is treated as known information, while the probe set (i.e., validation subset), EP, is used for testing and no information in this set is allowed to be used for prediction. Clearly, and . Take Fig. 1 as an example, the true network contains four nodes and four links, while the link (1, 3) is missing in the observed network AO. Then this missing link constitutes the probe set EP and the rest observed links constitute the training set ET. The set of non-observed links is U − ET.

Figure 1
figure 1

Illustrating network (graph) G(V, E) with nodes and links for predicting missing links.

Figure 2
figure 2

Illustrating network (graph) G(V, E) with nodes and links for identifying spurious links.

For spurious link identification, the task is to evaluate the reliability of all the observed links based on the known network topology and nodes attributes (if we have such information). Specifically, consider an undirected network G(V, E), where V is the set of nodes and E is the set of links. Multiple links and self connections are not allowed. Then, the set of observed links is E. We assume that there are some spurious links in the set E and the task of spurious link identification is to find out these links. Of course, we do not know which links are the spurious link, otherwise we do not need to do identification. Therefore, to test the algorithm’s accuracy, we will randomly add some nonexistent links which will constitute the probe set EP and the given network (we may say it is the true network E) together with the probe set constitute the training set ET. Clearly, ET − EP = E and . Take Fig. 2 as an example, the true network contains four nodes and four links, while the spurious link (1, 4) was added to the network to construct the training set. In reality, the training set can be considered as the real observed network which contains the errors and the true network presented here is actually unknown for us. However, to test the algorithm’s performance we assume that the given networks are all true, otherwise we cannot make any comparison.

Traditional methods or models for predicting missing links and identifying spurious links can be roughly divided into two classes: the probabilistic models and the similarity-based algorithms: the former include the probabilistic relational model12, the probabilistic entity relationship model13 and the relational model14, which usually require, in addition to the observed network structure, the information about node attributes; the latter assign a similarity score to every pair of nodes and rank all non-observed links according to their scores. How to define the similarity is a nontrivial challenge: it could be simple like the common-neighbor-based indices15,16,17 or complicated such as random-walk-based indices18,19 and iteratively defined indices20,21.

Recently, some novel algorithms related to the likelihood analysis were proposed9,22,23 and shown to be more accurate than many similarity-based methods. These algorithms usually presuppose certain organizing rules of networks. In despite of detailed differences, parameters associated with the organizing rules are often learned from the observed structure and then the network ensemble is built up, accordingly, a large number of networks could be sampled out to further determine the appearing probability of each link. Representative examples include the hierarchical structure model22, the stochastic block model9 and the Kronecker graphs model23.

This paper introduces an algorithmic framework where a network’s probability is calculated according to a predefined structural Hamiltonian and a non-observed link is scored by the conditional probability of adding the link to the network. The Hamiltonian is defined according to some reasonable organizing principles so that an observed network is usually of lower Hamiltonian than its randomized version. Here we consider a general principle called clustering mechanism, which declares that two nodes will have a high probability of making a link between them if they share some common neighbors or are connected by short paths. This mechanism gets direct supportive evidence from the high clustering coefficient of disparate networks24,25. In this paper, the clustering mechanism is explained as the high appearing probability of a link if its two nodes are connected by a large number of short paths and thus the corresponding Hamiltonian is defined according to the closed walk process. Numerical simulations on seven real networks showed remarkably higher accuracy of the proposed algorithm than the state-of-the-art methods in both uncovering missing links and detecting spurious links.

Results

Common neighbor similarity performs very well in many networks16,17, indicating that the three-order loops (i.e., triangles) are preferred in the network formation. We here generalize this idea to high-order loops and define a structural Hamiltonian:

where A is the N × N adjacency matrix of the network with nodes and βk are the temperature parameters. When k > 2, the number of loops of length k that start and end at node i is . Note that, a loop is counted several times since each of its nodes can be the starting node and given the starting node, it is counted twice by its two opposite directions. Since the loops counted here are not self-avoiding, it is more complex when a loop contains sub-loops. Roughly speaking, TrAk is 2k times the number of loops of length k, while to determine the exact number is not feasible. The approximated factor 2k can be taken into account by the parameter βk and the cases of k = 1 and k = 2 are trivial since TrA1 is 0 and TrA2 is simply twice the number of total links, so we only consider the terms TrAk for k ≥ 3. As k → ∞, TrAk+1/TrAk → λ1, i.e. TrAk grows exponentially with the leading eigenvalue λ1. Thus we take the logarithm to rescale each term in H(A) to the same magnitude.

For a large k, the increase of TrAk is simply determined by the leading eigenvalue λ1 and TrAk contains less information about the local organizations, so we introduce a cutoff kc. Actually, even for large networks, usually the small-world property still holds and nodes may reach others within several steps26. Moreover, recent studies reveal that based only on some local information, it’s sufficient to reproduce closely many real world networks27. Thus a relatively small value of kc is usually sufficient for many networks. How to determine kc is introduced in S6 of SI and the present results correspond to the optimal kc.

The structural Hamiltonian can be rewritten as:

Note that we have rewritten the Hamiltonian in terms of the eigenvalues. Diagonalize the adjacency matrix as A = UTΛU, where U is the matrix with eigenvectors in each column and Λ the diagonal matrix of eigenvalues. Then we have . Then the Hamiltonian in equation (2) can be obtained.

Given an ensemble , where the observed network (here AO = A − AP and AP is the adjacency matrix of the probe set) and the probability of the appearance of AO is28,29:

where is the partition function. Such model is named exponential random graph model in social science literatures30. The parameters βk are then chosen to maximize the probability in equation (3), see more details in Supplementary Methods.

After determining the parameters βk, the score of a non-observed link (x, y) U − ET is assigned to be the conditional probability of the appearance of the link (x, y) based on the observed network:

where is the observed network by adding the link (x, y) and Zxy is a normalization factor which defined as . Here we assume adding the single link (x, y) to AO will not largely change the topological structure and thus the parameters βk for is approximately the same to those for AO. Sxy can be regarded as a kind of similarity index, so all the non-observed links will be ranked by Sxy for prediction: links with higher scores are more likely to exist. Obviously, the partition function Zxy plays no role in producing the prediction.

In the spurious link identification problem, the score of a link (x, y) AO, to be spurious can be estimated by the conditional probability of the absence of this link, namely,

where is the observed network AO by removing the link (x, y) and . Note that, different from the missing link prediction problem, here AO = A + AS, where AS is the adjacency matrix of the spurious set. Higher value of indicates a higher probability that the link (x, y) is a spurious link. The higher the value of , the lower reliability this link (x, y) is. A summary of notations used for the method is shown in S3 of SI.

For comparison, we introduce some benchmark methods1, including similarity-based algorithms and likelihood models. The simplest similarity index is the Common Neighbors (CN) index15, where two nodes, x and y, are more likely to have a link if they have more common neighbors, namely, , where Γ(x) denotes the set of neighbors of x. Two refined versions of CN are Adamic-Adar (AA) index31 and Resource Allocation (RA) index16,32 . Very recently, Cannistraci, Alanis-Lobato and Ravasi6 simultaneously taked into account the number of common neighbors and the number of local community links (links connecting common neighbors) and proposed a series of similarity indices, including CAR, CPA, CAA, CRA and CJC indices (see details in Table 1 of ref. 6) for link prediction in brain connectomes and protein connectomes.

Table 1 The basic topological features of seven real networks.

Different from the aforementioned local similarity indices, Katz index33 makes use of global topological information by summing over the collection of paths with exponentially damping according to path lengths with a parameter α, which reads and can be rewritten in a compact form, as S = (I − αA)−1 − I, where I is the identity matrix. In our experiments, the performance of Katz index corresponds to the optimal α.

We also consider two likelihood models, the Hierarchical Structural Model (HSM)22 and the Stochastic Block Model (SBM)9. HSM is based on the fact that many real networks are hierarchically organized, where nodes can be divided into groups, further subdivided into groups of groups and so forth. SBM is one of the most general network models, where nodes are partitioned into groups and the connecting probability of two nodes depends solely on the groups they belong to.

To quantify the accuracy of proposed methods, we adopt two standard metrics. The first one is called the area under the receiver operating characteristic curve (AUC value for short)34, which can be interpreted as the probability that a randomly chosen link in EP (i.e., a missing link that indeed exists but is not observed yet) is ranked higher than a randomly chosen link in U − E (i.e., a nonexistent link). If all the link scores are generated from an independent and identical distribution, the AUC value should be about 0.5. Therefore, the degree to which the value exceeds 0.5 indicates how much the algorithm performs better than pure chance. The second one is called precision35, which is defined as the ratio of relevant elements to the number of selected elements. That is to say, if we take the top-L links as predicted links, among which Lr links are right (i.e., there are Lr links in the probe set EP), then the precision equals Lr/L.

These two metrics can also be used to quantify the performance on detecting spurious links. In such a case, a number of spurious links are completely randomly generated that constitute the probe set EP (these links are also added to E). In contrast to the predicting algorithm, a detecting algorithm gives an ordered list of all observed links according to their scores. The AUC value in this task becomes the probability that a randomly chosen links in EP (i.e., a spurious link) is ranked lower than a randomly chosen link in E (i.e., an existing link). And if we pick up the last L links, among which Ls links are spurious, then the precision equals Ls/L. Calculations of AUC and precision for some simple illustrative networks are given in S2 of SI.

Seven different networks from various research fields are tested. (i) Jazz36: The network of Jazz musicians. (ii) Metabolic37: The metabolic network of the nematode worm C. elegans. (iii) C. elegans38: The neural network of C. elegans. (iv) US Air39: The network of the US air transportation system. (v) FWF40: The food web in Florida Bay during wet season. (vi) FWM41: The food web in Mangrove Estuary during wet season. (vii) Macaca42: cortical networks of the macaque monkey. The basic topological features of such networks are summarized in Table 1. The parameters of a network include the clustering coefficient C38 and the assortative coefficient r43.

For each of the seven networks, the training set ET contains 90% of the links and the remaining 10% of links constitutes the probe set EP. To calculate precision, we set , which means the number of selected elements equals the number of relevant elements. Under this specific choice of L, precision is equal to another metric recall that is formally defined in35. All the data points are obtained by averaging over 10 implementations with independently random divisions of training set and probe set. The prediction accuracies measured by precision and AUC are shown in Tables 2 and 3, respectively. For each network, the bold number in the corresponding row emphasizes the highest accuracy. Very surprisingly, for all the seven real networks, our method performs best among all state-of-the-art algorithms, usually remarkably better than the second best. The standard deviations of the prediction accuracy can be found in SI. In Figs 3 and 4, we further show that such result is not sensitive to the size of the probe set, which is the fraction of in Table 1.

Table 2 The prediction accuracy measured by precision for the seven real networks.
Table 3 The prediction accuracy measured by AUC for the seven real networks.
Figure 3
figure 3

Predicting missing links for different sizes of probe set.

The prediction accuracy is measured by precision.

Figure 4
figure 4

Predicting missing links for different sizes of probe set.

The prediction accuracy is measured by AUC.

We next consider the identification of spurious links, where spurious links are those links being observed but not really existent, which may be resulted from experimental errors or data noise. Prediction of missing links and identification of spurious links are considered to be equally important and highly challenging in the reconstruction of networks9,44. The framework and method proposed in this paper can also be applied to identify spurious links.

To test the validity of the algorithms, we randomly add some links to each real network, which constitute the spurious set EP and the adjacency matrix of the spurious set is AS. Analogously, for spurious link identification, the AUC value can be interpreted as the probability that the spurious score of a randomly chosen link in EP is higher than that of a randomly chosen link in E. The precision is defined as the ratio of the successfully identified spurious links to the top-L selected links with the highest spurious scores. In the experiments, we set and all the data points are averaged over 10 independent runs with different randomly generated spurious sets. The accuracies of spurious link identification measured by precision and AUC are shown in Tables 4 and 5, respectively (see SI for standard deviations). Again, our method is remarkably better than all other state-of-the-art methods and not sensitive to the size of training set, see Figs 5 and 6.

Table 4 The accuracy of spurious link identification measured by precision for the seven real networks.
Table 5 The accuracy of spurious link identification measured by AUC for the seven real networks.
Figure 5
figure 5

Identifying Spurious links for different sizes of probe set.

The prediction accuracy is measured by precision.

Figure 6
figure 6

Identifying Spurious links for different sizes of probe set.

The prediction accuracy is measured by AUC.

Now we apply our method to analyze the network of macaque monkey brain, where nodes are the cortical areas and links are projections between them45. There are three kinds of links in the focal network, namely confirmed existing, confirmed absent and uncertain links. The reported links are based on neuroanatomical experiments and the uncertain links are owning to conflict reports in the literature. The original network is directed, we eliminate the directions of the link by treating a link as uncertain if it is bidirectional uncertain and as confirmed existing if it is confirmed in either direction. The undirected network consist of 32 nodes with 194 confirmed existing links, 90 confirmed absent links and 212 uncertain links. We then use our algorithm to estimate the probability of the uncertain links. To test the validity of the algorithm, we randomly hide 10% of the confirmed existing links as the probe set and the prediction task now is to find out these hidden confirmed existing links. When calculating the prediction accuracy, we consider both the case when uncertain links are included and excluded from the candidates of missing links.

As shown in Table 6, our method can successfully find out the hidden confirmed existing links. Notably, although the number of uncertain links is much greater than confirmed absent links, there is no significant drop of the accuracy when they are included. We find that the probability that a hidden confirmed link has a higher score than an uncertain link is 0.796, indicating that uncertain links are indeed less reliable generally. Besides, we also test that the probability that an uncertain link has a higher score than a confirmed absent link is 0.634, implying that there must be some missing links in the set of uncertain links.

Table 6 The accuracy of missing link prediction of the macaque brain network.

Now an interesting problem arises, that is among the 212 uncertain links, which are more likely to be exist. Here we use the full knowledge of the confirmed links to make predictions, see the most likely latent links predicted by our method in Table 7. ref. 42 also gave a prediction on those uncertain links. After comparison, among the top-16 predicted links shown in Table 7 there are only two links are different—PIP-V3A and PIP-V4t—which are predicted to be absent by ref. 42. While with the new progress of the studies on macaque monkey brain, the data is increasingly extended and improved46. The data in ref. 46 provides us an opportunity to better evaluate the algorithms. Surprisingly, in the new data set, the two controversial links are shown to be confirmed. That is to say, our structural-based method give much accurate prediction than the spatial-based method proposed in ref. 42. Besides these two links, there are three links (emphasized by bold) are also confirmed by data in ref. 46, while the other 11 predicted links are waiting for the test by real experiments in the near future.

Table 7 The 16 most likely latent links among the uncertain links and their corresponding values of the Hamiltonian.

Discussion

Prediction is a core issue in network science, which is the only solid way to check whether our understanding of network evolution is right10. It covers a variety of problems, such as the prediction of missing links1, future links1, vanishing nodes47, reciprocal relationships48, spurious links9 and so on. In this paper, we used an algorithmic framework, where a network’s probability is estimated according to a predefined structural Hamiltonian and the existence score of a non-observed link is quantified by the conditional probability of adding the focal link to the network while the spurious probability of an observed link is quantified by the conditional probability of deleting the link.

Since the homophily49 and social recommendation50 mechanisms ruling the real network formation both exhibit local clustering property, we define a Hamiltonian according to the closed walk process that can well take into account the structural localization. For both missing link prediction and spurious link identification, the present method performs surprisingly well, much better than all state-of-the-art methods under consideration. Notice that, although this method can find applications in small networks or some small parts (e.g., communities) of a network, it is very time-consuming and cannot be directly applied to large-scale networks. One strategy to overcome this computational limit is to use parallel algorithms. Since individual runs of the matrix diagonalization are completely independent, parallelizing the algorithm is straightforward. And also, the diagonalization of symmetric matrices itself can be parallelized51. In addition, we found that when estimating the model parameters, it’s not necessary to toggle all matrix elements, but roughly a 10% is sufficient to obtain the same accuracy. After determining the parameters, matrix perturbation technics can be used to compute the scores of the links. We found that by using the perturbation approximations, the algorithm still gives good predictions for many networks.

The present method can be further used to explore underlying network evolving mechanisms. For example, we can transform different evolving mechanisms into different Hamiltonians to indirectly check which mechanism could best capture the network organization principle, with a potential assumption that the mechanism corresponding to the highest link prediction accuracy is the best. We can also fix the Hamiltonian to see whether there are some sudden changes in the evolving mechanism. Such changes usually occur in technological networks such as power grid and Internet driven by the applications of some new techniques, like the new Internet Protocol for AS (autonomous system) level routers, or in online social networks according to the changes of rules and interfaces in the web sites. In S7 of SI, we show successful applications in some artificial generated networks with sudden changes in network evolution.

Methods

Lacking an exact solution for the partition function, we apply the maximum pseudo-likelihood method52 to estimate the parameters βk. For any node pair (x, y), denoting Ac(x, y) the matrix with all elements the same as AO but the element unknown, then the ratio of the conditional existence probability of the link (x, y) to the conditional nonexistence probability does not depend on the partition function, as

where . So the conditional existence probability is: .

According to the Hammersley-Clifford Theorem53 we can replace the joint likelihood of the links of the network with the product over the conditional probability of each link, given the rest of the network. Then the temperature parameters βk can be estimated by maximizing the log-likelihood:

where the summation is over all node pairs. This is a convex optimization problem and we apply the gradient ascent method to estimate βk. Detailed steps of the parameter estimating algorithm are shown in the SI.

Additional Information

How to cite this article: Pan, L. et al. Predicting missing links and identifying spurious links via likelihood analysis. Sci. Rep. 6, 22955; doi: 10.1038/srep22955 (2016).