Measuring the robustness of link prediction algorithms under noisy environment

Link prediction in complex networks is to estimate the likelihood of two nodes to interact with each other in the future. As this problem has applications in a large number of real systems, many link prediction methods have been proposed. However, the validation of these methods is so far mainly conducted in the assumed noise-free networks. Therefore, we still miss a clear understanding of how the prediction results would be affected if the observed network data is no longer accurate. In this paper, we comprehensively study the robustness of the existing link prediction algorithms in the real networks where some links are missing, fake or swapped with other links. We find that missing links are more destructive than fake and swapped links for prediction accuracy. An index is proposed to quantify the robustness of the link prediction methods. Among the twenty-two studied link prediction methods, we find that though some methods have low prediction accuracy, they tend to perform reliably in the “noisy” environment.

where ǫ is a free parameter.
(2)Katz Index. This index is based on the ensemble of all paths, which directly sums over the collection of paths and is exponentially damped by length to give the shorter paths more weights. The mathematical expression reads as where paths <l> xy | is the set of all paths with length l connecting x and y, and β is a free parameter (i.e., the damping factor) controlling the path weights.
(3)Salton Index. It is defined as where k x is the degree of node x. The Salton index is also called the cosine similarity in the literature.
(4)Søensen Index. This index is used mainly for ecological community data, and is defined as S Sørenson (5)Hub Depressed Index (HDI). Analogously to the above index, we also consider a measurement with the opposite effect on hubs, defined as (6)Hub Promoted Index (HPI). This index is proposed for quantifying the topological overlap of pairs of substrates in metabolic networks, and is defined as Under this measurement, the links adjacent to hubs are likely to be assigned high scores since the denominator is determined by the lower degree only. (7)Leicht-Holme-Newman Index (LHN1). This index assigns high similarity to node pairs that have many common neighbors compared not to the possible maximum, but to the expected number of such neighbors. It is defined as (8)Adamic-Adar Index (AA). This index refines the simple counting of common neighbors by assigning the less-connected neighbors more weight, and is defined as (9)Preferential Attachment Index (PA).The mechanism of preferential attachment can be used to generate evolving scale-free networks, where the probability that a new link is connected to the node x is proportional to k x . The probability that this new link will connect x and y is proportional to k x k y . Motivated by this mechanism, the corresponding similarity index can be defined as (10)Local Naive Bayes form of CN(LNBCN).
where s = P (A 0 ) P (A 1 ) ,A 0 and A 1 is the class variables of connection and disconnection respectively,and R w is the role function of node w.
where s = P (A 0 ) P (A 1 ) ,A 0 and A 1 is the class variables of connection and disconnection respectively,and R w is the role function of node w.
where D is the degree matrix with D xy =λ xy k x and φ (0<φ<1) is a free parameter. (13)Average Commute Time (ACT).Denote by m(x, y) the average number of steps required by a random walker starting from node x to reach node y, the average commute time between x and y is (14)Cosine based on L + . This index is an inner-product-based measure.And the cosine similarity is defined as the cosine of the node vectors, namely (15)Random Walk with Restart (RWR). This index is a direct application of the PageRank algorithm, and it is defined as where q xy is the probability this random walker locates at node y from node x in the steady state.
(16)Local Random Walk (LRW). To measure the similarity between nodes x and y, a random walker is initially put on node x and thus the initial density vector π (0) = e x .The LRW index at time step t is thus defined as where q is the initial configuration function.
(17)Superposed Random Walk (SRW). Similar to the RWR index, Liu and Lü proposed the SRW index, where the random walker is continuously released at the starting point, resulting in a higher similarity between the target node and the nodes nearby. The mathematical expression reads where t denotes the time steps.
(18)Matrix Forest Index (MFI). This index is defined as where the similarity between x and y can be understood as the ratio of the number of spanning rooted forests such that nodes x and y belong to the same tree rooted at x to all spanning rooted forests of the network.
(19)CN based on transferring similarity(TSCN).This method is CN Index with the transferring similarity,and the similarity is defined as where S T r xy is the transferring similarity. We report the link prediction algorithms' robustness versus their AUC in Fig. S1-S4.
We can see that the LRW, SRW, RA and LNBRA have the highest AUC. However, their indicating they are very sensitive to the noisy links in the network. On the other hand, LRW and SRW have almost as high robustness as the PA method which only uses node degree for link prediction and thus is very little affected by noise. However, when R + and R e are considered, the methods with high AUC tend to have low R + and R e . This indicate that in these cases, one has to sacrifice the some AUC in order to improve the robustness of the prediction results. The detailed values for AUC, R + , R − , R e are reported in the Table   S2 and S3.     Figure S3. The dependence of the robustness of the algorithms R on the fraction of missing and noisy links in four real-world networks. The training set ratio in this figure is 80%.    Figure S5. the dependence of the robustness of the algorithms R on the fraction of missing and noisy links in four real-world networks. The results are obtained with the 10-fold cross validation.  Figure S7. The link prediction algorithms' robustness versus their AU C when applied to the USAir network.  Table S1. The robustness of link prediction algorithms in ten real networks. R − , R + and R e are respectively the robustness of the algorithms with missing links, noisy links and swapped links.

III. SUPPLEMENTARY TABLES
The fraction of changed links here is 40%. The highest value for each network is highlighted in bold.  Method