Predicting missing links and identifying spurious links via likelihood analysis

Real network data is often incomplete and noisy, where link prediction algorithms and spurious link identification algorithms can be applied. Thus far, it lacks a general method to transform network organizing mechanisms to link prediction algorithms. Here we use an algorithmic framework where a network’s probability is calculated according to a predefined structural Hamiltonian that takes into account the network organizing principles, and a non-observed link is scored by the conditional probability of adding the link to the observed network. Extensive numerical simulations show that the proposed algorithm has remarkably higher accuracy than the state-of-the-art methods in uncovering missing links and identifying spurious links in many complex biological and social networks. Such method also finds applications in exploring the underlying network evolutionary mechanisms.

The missing link prediction problem [1] and the spurious link identification problem [2] are described in the first three paragraphs of the main article and illustrated by Fig. 1 and Fig. 2 in the main article, respectively, which are networks of 4 nodes. The total number of networks with 4 nodes is 64 as shown in Supplementary Fig. 1

in this Supplementary Information (SI).
Supplementary Figure 1: The ensemble of four-node networks. The number inside each network is the total number of different networks (including the original network) that can be generated from this network via rotation and reflection along the vertical line through the center of the network. The sum of numbers inside networks in this figure is 64.

S2. Evaluation Metrics
Two standard metrics are used to quantify the accuracy of the algorithms' performances: Precision [3] and area under the receiver operating characteristic curve (AUC) [4].

A. Missing link prediction
In principle, a link prediction algorithm provides an ordered list of all non-observed links (i.e., U − E T ) or equivalently gives each non-observed link, say (x, y) ∈ U − E T , a score S xy to quantify its existence tendency. The AUC evaluates the algorithms performance according to the whole list while the precision only focuses on the L links with top ranks or highest scores. A detailed introduction of these two metrics is as follows.
(i) AUC: Provided the rank of all non-observed links, the AUC value can be interpreted as the probability that a randomly chosen missing link (i.e., a link in E P ) is given a higher score than a randomly chosen nonexistent link (i.e., a link in U − E). In the algorithmic implementation, we usually calculate the score of each non-observed link instead of giving the ordered list since the latter task is more time consuming. Then, at each time we randomly pick a missing link from E P and a nonexistent link from U − E to compare their scores, if among n independent comparisons, there are n ′ times the missing link having a higher score and n ′′ times they have the same score, the AUC value is If all the scores are generated from an independent and identical distribution, the AUC value should be about 0.5. Therefore, the degree to which the value exceeds 0.5 indicates how better the algorithm performs than pure chance. (ii) Precision: Given the ranking of the non-observed links, the precision is defined as the ratio of relevant items selected to the number of items selected. That is to say, if we take the top-L links as the predicted ones, among which L r links are right (i.e., there are L r links in the probe set), then the precision equals L r /L. Usually, L is no lager than |E P |. An alternative metric is recall [3]. For the |E P | hidden links, if R r of them are ranked in the top-|E P |, recall is defined as R r /|E P |. So when L = |E P |, precision is equal to recall. Clearly, higher precision means higher prediction accuracy.
We take Fig. 1 of the main article as an example to show how to calculate AUC and precision of predicting missing links. In this simple graph, there are four nodes, four existent links ((1, 2), (1,3), (2,3) and (3,4))and two nonexistent links ((2, 4) and (1,4)). To test the algorithms accuracy, we need to select some existent links as probe links. For instance, we pick (1, 3) as probe link (i.e., missing link), which is presented by red line in the right plot. Then, an algorithm can only make use of the information contained in the training graph (see the middle plot). Common neighbor algorithm assigns scores of all non-observed links as S 13 = 1, S 24 = 1, S 14 = 0. To calculate AUC, we need to compare the scores of a probe link and a nonexistent link. There are two pairs in total: S 13 = S 24 and S 13 > S 14 . Hence, the AUC value equals (1 × 1 + 1 × 0.5)/2 = 0.75. For precision, if L = 1, the predicted link is either (1,3) or (2,4). Clearly, the former is right while the latter is wrong, and thus the precision equals 0.5. For more detail information on link prediction problem, please see the review article ref. [1].

B. Spurious link identification
Similarly, we also use AUC and precision to evaluate the algorithm's performance on spurious link identification. In such case, a number of spurious links are completely randomly generated to constitute the probe set E P presented by adjacency matrix A S . Now the observed network is A O = A + A S . In contrast to the predicting algorithm, a detecting algorithm gives an ordered list of all observed links (E + E P ) according to their scores of existing. The AUC value in this task becomes the probability that a randomly chosen links in E P (i.e., a spurious link) is ranked lower than a randomly chosen link in E (i.e., an existing link). At each time we randomly pick a spurious link and a existent link to compare their scores, if among n independent comparisons, there are n ′ times the spurious link having a lower score and n ′′ times they have the same score, the AUC value is And if we pick up the last L links, among which L s links are spurious (i.e., there are L s links in the set E P ), then the precision equals L s /L. We take Fig. 2 of the main article as an example to show how to calculate AUC and precision of identifying spurious links. In this simple graph, there are four nodes, four existent links ((1, 2), (1, 3), (2, 3) and (3,4)) and two nonexistent links ((2, 4) and (1,4)). To test the algorithm's accuracy, we randomly add one spurious link, say (1, 4), as probe link which is presented by green dashed line in the right plot. Then, an algorithm can only make use of the information contained in the training graph (i.e., E T = E + E P , see the middle plot). Common neighbor algorithm assigns scores of all observed links as S ′ 12 = 1, S ′ 13 = 2, S ′ 14 = 1, S ′ 23 = 1 and S ′ 34 = 1. To calculate AUC, we need to compare the scores of a probe link and a existent link. There are four pairs in total: and S ′ 14 = S ′ 34 . Hence, the AUC value equals (1 × 1 + 3 × 0.5)/4 = 0.625. For precision, if L = 1, the predicted link can be any one among (1, 2) (1, 4), (2,3) and (3,4). Clearly, only (1, 4) is right, and thus the precision equals 0.25.

S3. Table of notations
The definition of some notations used in this paper are summarized in Supplementary Table. 1.

S4. The algorithm accuracy with standard deviations
The standard deviations of algorithm accuracy are shown in Supplementary Table 2 and Supplementary Table 3 for predicting missing links on seven networks, and in Supplementary Table 4 and Supplementary Table 5 for spurious link identification on seven networks. Each number is obtained by averaging over ten independent implements. In each Table, the upper part is about the comparison of our approach with CN, AA, RA, Katz, HSM, and SBM approaches defined in the main article, and the lower part is about comparison of our approach with CAR, CPA, CAA, CRA, and CJC defined in Table 1 of [5].

Notations Description
A The adjacency matrix of the true network A O The adjacency matrix of the observed network A P The adjacency matrix of the missing links A S The adjacency matrix of the spurious links E The set of edges of the true network E T The training set, the set of edges of the observed network E P The probe set, the missing or spurious links The adjacency matrix by adding the link (x, y) to The adjacency matrix by deleting the link (x, y) from A O Sxy The conditional probability of a node pair (x, y) to be a latent link S ′ xy The conditional probability of a node pair (x, y) to be a spurious link Supplementary Table 1

S5. The algorithm for parameter estimation
Instead of maximize the entire probability of the observed network, we maximize the logarithm of the product over the marginal probability of each link: here and   Taking the gradient of the likelihood function with respect to β k , we obtain where φ k (xy) is the statistical difference of the kth term of the Hamiltonian when A O xy is toggled from 1 to 0 while holding the rest of the observed network unchanged. Now the parameters can be updated using the gradient ascent rule: where λ controls the updating rate. The steps of the algorithm are listed as follows:  (i) For all node pairs (xy), calculate φ k (xy) , which is defined as the statistical difference of the kth term of the Hamiltonian when A O xy is toggled from 1 to 0 while the rest of the observed network remains static. (ii) Initial the parameters β k , and update according equation (S7), till convergence.

S6.Determining the parameter k c
Empirically each network has a specific optimal k * c which depends on the structure of the network intricately, and it is difficult to get an analytical solution on k * c , so that we can only numerically estimate the optimal k * c based on the observed structure (E T ). In our experiments, we know exactly the probe set, therefore the optimal k * c subject to the highest accuracy can be obtain via varying k c in the experiments. Supplementary Fig. 2 shows the dependence of prediction accuracy on the cutoff k c . We name this k * c as true value. However, in practice we don't know what are the missing links or future links, what we can use is just the observed information. Therefore, a realistic way is to learn the optimal k * c based on the observed network, namely A O , and this k * c will be further applied to predict missing links or identifying spurious links of A O . Specifically, for missing link prediction, after the network is divided into the training set (i.e., E T contains 0.9|E| links) and the probe set (i.e., E P contains 0.1|E| links), we will learn k * c by proceeding the link prediction algorithm on training set. Firstly, we hide another set B consists of 0.1|E| links from the training set. Then we use the remaining 0.8|E| links in training set (denote by set D) to predict the newly hidden links in B under different k c . Finally, the k * c will be obtained when the highest accuracy is reached. The results in Tables 2 and 3 in the main text are obtained by using the training set E T to predict the links in probe set E P under k * c . Here we assume that k * c calculated from Case 1 (D as training set and B as probe set) and Case 2 (E T as training set and E P as probe set) are approximately the same to each other. In real applications, such a procedure can be used to choose the k * c with calculating from known links. For spurious link identification, the procedure is almost the same. By adding the probe set E P to the network, we obtain the training set E T . Then a fraction of 0.1|E| links from E T will be hidden as set B, and then the remaining links constitute set D. Similar to the case for predicting missing link, here we firstly use D to predicting missing links in B under different k c , and the k * c is obtained subject to the highest accuracy. The results in Tables 4 and 5 in the main text are obtained by identifying the spurious links in probe set E P based on the training set E T under k * c . Note that k * c for precision and AUC are not always consistent, in this Letter we also choose k * c for precision to present the results of AUC. The approximate k * c for precision obtained by the above described procedure and the true optimal k * c (as indicated in Supplementary Fig. 2) are shown in Supplementary Table 6 for comparison. It can be found that the value of the approximate k * c as well as its performance are all near the true optimal cases. To estimate the optimal k * c will result in additional computational costs, however, as shown in Supplementary  Fig. 2, except for too small or too large k c , the algorithms performance varies slightly for a considerable range of k c .
For example, if we fix k c = 8 for every network, we can get almost the same prediction accuracy as the optimal one shown in Table 2 to Table 5 in the main manuscript. Unlucky, we cannot obtain the tendency of optimal k * c versus network size, since the present method is very time-consuming. In a word, this is indeed a limitation, but may be not a very critical problem in the practical application.   3, there are six different cases for sudden switch between two mechanisms, and for each of them, the present method is able to capture the jump at T = 6. From above numerical experiments, we can see that our method can successfully detect the sudden change points of the network evolving mechanism. Since the Hamiltonian is a comprehensive description of the network, the method does not require any prior knowledge about the mechanism itself.