Similarity-based future common neighbors model for link prediction in complex networks

Link prediction aims to predict the existence of unknown links via the network information. However, most similarity-based algorithms only utilize the current common neighbor information and cannot get high enough prediction accuracy in evolving networks. So this paper firstly defines the future common neighbors that can turn into the common neighbors in the future. To analyse whether the future common neighbors contribute to the current link prediction, we propose the similarity-based future common neighbors (SFCN) model for link prediction, which accurately locate all the future common neighbors besides the current common neighbors in networks and effectively measure their contributions. We also design and observe three MATLAB simulation experiments. The first experiment, which adjusts two parameter weights in the SFCN model, reveals that the future common neighbors make more contributions than the current common neighbors in complex networks. And two more experiments, which compares the SFCN model with eight algorithms in five networks, demonstrate that the SFCN model has higher accuracy and better performance robustness.

they proposed the resource allocation index (RA) 23 . Motivated by the resource allocation dynamics on complex networks, the RA index can effectively improve the accuracy by restraining the contributions of large-degree common neighbors. Additionally, Liu et al. proposed a Local Naive Bayes (LNB) model 24 , which insists that different common neighbors play different roles and make different contributions. Based on the LNB model, they improved the CN, RA and AA index. As the similarity indexes can predict links in networks, we can apply the similarity indexes to evaluate the evolving mechanisms for the evolving networks 1 .
Obviously, similarity-based algorithms for link prediction can predict the future links by using the current common neighbors information 25 . However, on above principles of the similarity indexes, some nodes, currently not common neighbors, can turn into the common neighbors in the future. More importantly, these nodes raise a series of new questions worth exploring. First of all, it is whether these nodes contribute to the current prediction between node pair. Although the previous algorithms have proved that the current common neighbors can promote two nodes to connect, people still doubt whether nodes, which are currently not a common neighbor but can become a common neighbor in the future, are also helpful in the current link prediction. Second, if they do make contribution, then how we can locate these nodes and measure their contribution, simultaneously. Previous algorithms can easily count the number of the current common neighbors by only analyzing the network topology. However, the nodes described above have not yet become the common neighbors, and they can get different topology structures when they are together with the node pairs and their surrounding nodes. These lead to the challenge of locating these nodes and measuring their contributions via a simple method.
To address the above problems clearly, firstly, we define nodes, which are currently not common neighbors but can turn into the common neighbors in the future, as the future common neighbors and divide them into three types according to their topology structure with other nodes. Second, we propose the similarity-based future common neighbors (SFCN) model for link prediction. The SFCN model accurately finds out all the future common neighbors, besides the current common neighbors. And simultaneously, it can also measure their contributions by only using the existing similarity indexes. We also design and observe three MATLAB simulation experiments. First, we conduct a priori experiment on α and β in FWFB network. The results provide strong evidence that the future common neighbors have more positive contribution than the common neighbors in complex networks. Second, by comparing the SFCN model with eight similarity-based algorithms in five networks, we find that the SFCN model has higher prediction accuracy from the whole perspective. Third, the experiments, where we change the ratio of the training set to the probe set in five networks, also demonstrates that the SFCN model has better performance robustness. So, the proposed SFCN model has higher accuracy and performance robustness than popular algorithms, and the future common neighbors is necessary to be considered for link prediction in evolving networks.

Network and problem description. A network can be represented by an undirected network G(V, E)
without self-connections and multiple links between node pair. In G(V, E), V is the set of nodes, and E is the set of links. Then |V| represents the quantity of nodes in V. Define the fully connected network as U that contains (|V|(|V| − 1))/2 links. So, U-E is the set of the nonexistent links. To evaluate the prediction accuracy of algorithms, we divide the observed link set E into the training set E T and the probe set E P randomly. E T is the known information while E P is the unknown information. Obviously, E = E P ∪ E T , and φ = E P ∩ E T . Accurately detecting the missing links or the future links from U-E is the purpose of link prediction. Give the link between node pair (x, y) in U a score (s x,y ), which is calculated by the link prediction algorithm. All the nonexistent links are sorted in descending order according to their scores, and the links at the top are most likely to exist.
The future common neighbors. Most similarity-based algorithms for link prediction predict the future links by using the current common neighbors. However, on the prediction principles of the above similarity indexes, some nodes, which are currently not common neighbors, can turn into the common neighbors in the future. To analyze whether such nodes are factors that contribute to the current prediction between node pair, and in order to accurately locate these nodes to measure their contribution, we define them as the future common neighbors and propose the similarity-based future common neighbors model for link prediction in evolving networks.
The future common neighbors are nodes that are currently not the common neighbors but can turn into the common neighbors in the future on the principle of the similarity index. According to their topology with other nodes, the future common neighbors are divided into three types shown in Fig. 1, where x and y are the target node pair for link prediction. The first future common neighbor, like node i in Fig. 1(a), has a direct link with x while no direct link with y. Currently, the similarity score between i and y is s i,y . According to the prediction principle of the similarity algorithm, i and y may form a link in the future (the greater the s i,y , the greater the probability of forming a link). Therefore, i has connected with y and turn into the common neighbor between x and y in a future time. The second future common neighbor, like i in Fig. 1(b), has direct link with y while no direct link with x. The third future common neighbors does not connect with both x and y, seen node i in Fig. 1(c). According to the existing similarity indexes for link prediction, if s x,i and s i,y are great enough, i are will form links with both x and y. Thus, the i in Fig. 1(c) are also a common neighbor between x and y in a future time.
Similarity-based future common neighbors model. Combining the future common neighbors topology with the similarity-based indexes, this paper designs the similarity-based future common neighbors model. The model is to accurately find out all the future common neighbors in complex networks and effectively measure their contributions.
Taking the chaotic network in Fig. 2 as an example, for node i (i = 1, 2, 3, …, |V|), we assume that s x i C , 2 is a similarity score between x and i, and s x i Therefore, s x i C , 2 also symbolizes the possibility of forming a link between x and i on the principle of C2 when the network is evolving. We make r i,y indicate whether i and y are connected (r i,y = 1 if i and y are connected, otherwise r i,y = 0), which can be obtained from the observed networks. Similarly, s i y C , 2 is a similarity score between i and y calculated by algorithms C2; r x,i represents whether x and i are connected.
The SFCN model identifies the above three types of the future common neighbors from chaotic network by employing their topological rules. (1) i is the first type of the future common neighbor only when ⋅ ≠ s r 0 . It is necessary to note that we set = s 0 and r i,y = 0 when i = x or i = y in order to keep non self-connections. (2) The rest rules can be deduced by analogy, i is the second type of the future common neighbor if ⋅ ≠ r s 0 (3) And i is the third type of the future common neighbor if and only if ⋅ ≠ s s 0  To accumulate the contributions of the future common neighbors, which meet the above rules, we construct four vectors for x and y in eqs 1, 2, 3, 4.
where the superscript T denotes matrix transposition, and the black highlighted parts are the row vectors or column vectors. Γ x stores the connections of x to all nodes. S x C2 stores the similarity scores of x to all nodes. Similarly, (Γ y ) T stores the connections of y to all nodes. And ( ) stores the similarity scores between y and all other nodes.
Therefore, we get the similarity-based future common neighbors model as eq. 5: where s x y C , 1 is the similarity score between x and y, and s x y C , 1 is calculated by any similarity algorithm that we temporarily mark as C1. C1 and C2 are two similarity algorithms, and they can be the same or different. The two free parameters, α and β, is to adjust the contributions of the current common neighbors and the future common neighbors, respectively. When α ≠ 0 and β = 0, the model only considers the current common neighbors contributions. When α = 0 and β ≠ 0, the model only considers the future common neighbors contributions. Both C1 and C2 are the. A special case, when both C1 and C2 are represented by CN algorithm and only the first class of the future common neighbors are considered, the model degenerates into the LP index. In a word, the SFCN model, employed in evolving networks, takes into account the contributions of the future common neighbors besides the current common neighbors. Example 1. This section gives an example of how to find the future common neighbors between node pair and how to measure the contributions of three future common neighbors. Suppose C1 and C2 are the LHN and RA algorithms, respectively. Take the network in Fig. 2 as an example and treat nodes (1, 3) as the target node pair (x, y). Then we can get four vectors: From the calculation process (eq. 10), it is easy to observe that only node 4 is the first type of the future common neighbors. And the contribution of node 4 to (1, We can also observe that only node 5 is the second type of the future common neighbors from the eq. 11: 3 2 At last, we can check out that 8 and 10 are the third type of the future common neighbors through eq. 12: Evaluation Metrics. In the Experiments, we introduce two standard metrics to quantify the prediction accuracy: the AUC 26 (area under the receiver operating characteristic curve) and precision 27 . The AUC evaluate the algorithms performance according to the whole list. The AUC is comprehended as the probability that a link randomly chosen from set E T has a much higher score than a link randomly chosen from nonexistent link U-E. In the n times independent comparisons, we select a link from E T and U-E respectively. Define their similarity scores as S1 and S2. When S1 > S2, set n′ = n′ + 1; when S1 = S2, set n″ = n″ + 1 (n′ and n″ are initialed as 0, n = n′ + n″). So, the AUC can be defined as eq. 13: Different from the AUC, precision focuses on the links with top ranks or highest scores. It is the ratio of correct links recovered out of the top L links in the candidate list generated by each link predictor. Assume L r links are accurately predicted among the top-L links. Then the precision can be defined as eq. 14: = .
L L Precision (14) r Datasets of real networks. In order to compare the prediction accuracy of the SFCN model with the eight mainstream indexes mentioned in this paper, we do MATLAB simulation experiments in five real networks: the network of scientific communication (NS) 28 , the US political blogs network (PB) 29 , the protein interaction network (Yeast) 30 , the neural network of C.elegans (CE) 31 , the food web network of florida bay (FWFB) 32 . All datasets of the five networks can be seen in the electronic supplementary material. The basic features of those networks are summarized in Table 1.
The metrics that characterize the networks can be seen in the caption of Table 1. We find that NS, PB and CE have similar characteristics, including the high clustering coefficient. Nevertheless, for FWFB network, the relation between predator and prey makes the network have a larger average degree and a shorter average distance between node pair.
Existing similarity indexes based on topological structure. Here, we introduce eight mainstream similarity indexes to compare with the SFCN model.
• CN. Let Γ x be the set of neighbors of x. The CN index proposes that node pair (x, y) are more likely to connect if they have more common neighbors, namely: x y CN x y , • Salton 20 . It is defined as: where k x is the degree of node x. • RA 23 . The RA index assumes each transmitter has a unit of resource and will equally distributed to all its neighbors, concluded as: x y RA z z , x y • HPI 22 . It is defined as: x y HDI x y x y , • Leicht-Holme-Newman index (LHN) 21 . The LHN index is defined as: • LNBRA 24 . The LNBRA index is an improvement in RA index based on the LNB model, defined as: x y LNBRA z z z , 2 2 x y where η and R z are defined as: z z z where N Δz and N ▽z are respectively the numbers of connected and disconnected node pairs which have a common neighbor z. • Local Path (LP) 33 . This index considers the number of different orders, defined as: where α is an adjustable parameter and A is the adjacency matrix of network. (A i ) x,y represents the quantity that the order length is equal to i between x and y.
Experiments and performance analysis. In this section, we do three experiments and make corresponding analysis for three purposes. In the first and second experiments, the E T contains 90% of links, while the remaining 10% of links constitute the E P . In addition, all the following results are returned with the average over 100 independent experiments. For the first experiment, to verify whether the contribution of the future common neighbors is necessary, we conducted priori experiments on α and β in FWFB network. Step 1, since the training set (E T ) is known, we divide E T into the sub-training set (E T1 ) and the sub-probe set (E P1 ) to learn the values of α and β in step 2.
Step 2, we apply the SFCN model to the sub-training set in order to obtain the similarity scores of the sub-probe set and get the AUC that varies with α and β. In this way, it is easy to select the numerical values of α and β with high AUC for the SFCN model. The experimental results are shown in Fig. 3. Before that, it is necessary to consider the two limit problems. When α = 0, the current common neighbors in the SFCN model do not make any contribution, and only the future common neighbors make contributions. When β = 0, there are only contribution from the current common neighbors, and the future common neighbors do not make any contributions for link prediction. We can get two results from the Fig. 3. First, the AUC when β = 0 is much lower than that when β≠0. Second, the SFCN model can obtain highest AUC when α and β are adjust to a suitable value. For example, for the SFCN-CN-RA, SFCN-Salton-HDI, and SFCN-LNBRA-LHN algorithms, we should set α smaller and β larger to get a higher prediction accuracy in FWFB network. These two results illustrate the important contribution of the future common neighbors.
Therefore, in the second and the third experiments later, we set α = 9 and β = 1, which meet the above condition. The second experiment is to compare the SFCN model with other eight similarity-based indexes, including the CN, HDI, HPI, LP, RA, Salton, LNBRA and LHN index. The prediction results of AUC and precision are listed in Tables 2 and 3 for details, respectively. Most comparative experiments, in the Table 2, clearly demonstrate that the SFCN model has the best or close to the best AUC, especially in the FWFB and Yeast networks. Taking the FWFB network as an example for analysis, we can see that there are 2075 links but only 128 nodes from the Table 1. And the average degree is as high as 32.422 while the average aggregation coefficient is low to 0.3346, which indicate that there are many random connections and high obscure similarity between the clusters in the FWFB network. These are the reasons why all nodes in the FWFB network have the tendency to gather and form some unknown clusters with the network evolving. The SFCN model takes into account the network evolution tendency via the principle of similarity index. In detail, the model has greatly improved the AUC in FWFB network by regarding the future common neighbors as the evolution direction. Moreover, Table 3 demonstrates that 90% of the precision results, predicted based on the SFCN model, are equal to or higher than that predicted based their original algorithm. For example, the precision results of the SFCN-HDI-RA algorithm are much higher than those of the original HDI algorithm in most network, because the contributions of the future common neighbors are taken into account. Finally, in order to explore the robustness, we change the ratio of training set to probe set in the third experiment. The lower the ratio, the more links information that should be predicted 34 . That is to say, there are less number of the known connected links and more number of the unknown links when the ratio is small. It is obviously to obtain two results from the Fig. 4. On the one hand, when the ratio is the same, the algorithms based on SFCN model have higher prediction accuracy results (measured by AUC) than their corresponding original algorithms. For instance, the SFCN-LHN-RA, SFCN-LHN-LP and SFCN-LHN-HDI algorithms have higher AUC compared with the original LHN algorithms when the ratio is the same. On the other hand, even when the ratio is low, the algorithms based on SFCN model still get high AUC, which indicates that the SFCN model has higher stability. Therefore, the SFCN model has better performance in prediction accuracy and stability even when there is few links information.

Discussion
Exploring what factors can provide a positive impact on link prediction is an important and challenging problem. In this paper, we firstly discover the existence of the future common neighbors, which are classified into three types according to their topological structure with other nodes. Then, to investigate whether the future common neighbors can make positive contribution for current link prediction, we propose the similarity-based future common neighbors model (SFCN), which accurately locates all the future common neighbors and effectively measure their contributions in complex networks, besides the current common neighbors.
We design three simulation experiments via the MATLAB for three different purposes. First, we conduct priori experiments on α and β in FWFB network. The results provide strong evidence that the future common neighbors can make great contribution than current common neighbors in complex networks. In the second experiments, we compare the SFCN model with eight algorithms in five networks, finding that the SFCN model has higher prediction accuracy, especially the AUC in the FWFB and Yeast networks. Third, in order to verify whether the SFCN model can get great accuracy when the known link information is little, we change the ratio of the training set to the probe set in five networks. And the experiment results show that the SFCN model has better performance robustness, even when the ratio is low to 0.45, compared with eight similarity-based algorithms. Therefore, the proposed SFCN model has higher accuracy and performance robustness than popular similarity-based algorithms, and the future common neighbors make more positive contribution than the current common neighbors that is widely used nowadays.
Some extensions of this work deserve further exploration. One is that we are limited to the current common neighbors and the future common neighbors in evolving networks. It is meaningful to research the contribution of the future nodes and the future links. For example, current path-based algorithms only consider the contribution of the existing paths currently, so it is significative to further exploit whether and how much the future paths, which are not existing currently but will exist after once prediction, can make a positive impact on current link prediction.

Methods
Algorithm of the SFCN model for link prediction. The adjacency matrix E is a sparse matrix of the complex network. And the pseudocode of the SFCN model is presented in algorithm 1. Complexity analysis. This part give a simple complexity analysis of the proposed SFCN model. The most time-consuming part occurs in computing the contribution of the future common neighbors. The time cost of (S y C2 ) is O(|V||V|), and the time cost of (S x C2 ) is O(|V||V|). Thus the total time cost of the future common neighbors is 3⋅O(|V||V||V|). Since complex network can be simplified as an sparse matrix, the final computational complexity is much less than 3⋅O(|V||V||V|).