A Scalable Similarity-Popularity Link Prediction Method

Link prediction is the task of computing the likelihood that a link exists between two given nodes in a network. With countless applications in different areas of science and engineering, link prediction has received the attention of many researchers working in various disciplines. Considerable research efforts have been invested into the development of increasingly accurate prediction methods. Most of the proposed algorithms, however, have limited use in practice because of their high computational requirements. The aim of this work is to develop a scalable link prediction algorithm that offers a higher overall predictive power than existing methods. The proposed solution falls into the class of global, parameter-free similarity-popularity-based methods, and in it, we assume that network topology is governed by three factors: popularity of the nodes, their similarity and the attraction induced by local neighbourhood. In our approach, popularity and neighbourhood-caused attraction are computed directly from the network topology and factored out by introducing a specific weight map, which is then used to estimate the dissimilarity between non-adjacent nodes through shortest path distances. We show through extensive experimental testing that the proposed method produces highly accurate predictions at a fraction of the computational cost required by existing global methods and at a low additional cost compared to local methods. The scalability of the proposed algorithm is demonstrated on several large networks having hundreds of thousands of nodes.


Implementation
All link prediction methods used in the experimental evaluation are implemented in C++, and all experiments are conducted in a Linux environment: • We implemented in C++ the proposed algorithms and all the topological similarity methods used in the experiments. The description of the topological ranking methods used in the experimental evaluation is given in Table 1.
• For HRG 9 , we used the code provided by the authors 1 .
• For SBM 10 , we used the C code provided by the authors 2 . This algorithm has a single parameter which is the number of iterations, and in our case we set it to the default value 10000.
• For FBM 11 , we translated the Matlab code provided by the authors 3 into C++. This algorithm has a single parameter which is the number of iterations, and in our case we set it to the default value 50.
• For HyperMap (HYP) 12,13 , we used the code provided by the authors 4 . This algorithm has five parameters which are all set according to the guidelines provided by the authors: 1. m: represents the average number of nodes to which new nodes connect and is set to the minimum node degree.
2. L: represents the average number of nodes to which old nodes connect and is set to L = ( k − 2m) /2, where k is the average node degree.
3. γ: is the exponent of the power-law degree distribution. In our case, it is estimated using plfit, a C++ implementation of Clauset, Shalizi and Newman 14 method for fitting power law distributions written by Tamas Nepusz 5 .
4. T : controls the average clustering and is set to 0.8.

Data
Several real networks have been used in the experimental evaluation of the proposed method. These networks are publicly available through different data repositories [2][3][4][5][6][7][8] . Table 2 contains the description and some important structural properties of these networks. Statistics on the CAIDA AS relationships networks are shown in Table 3. Notice that, whereas some networks like Zakary's Karate Club or DNA citation are small in size, networks such as Amazon, Twitter Follows or US Patents contain hundreds of thousands of nodes. These large networks obviously cannot be used to test the performance of methods such as SBM or HRG due to their high computational requirements, and are therefore only used to only evaluate scalable methods. Each data point in the plots included in the main paper is the average result of several test runs. In each test run, 10% of the edges are randomly removed from the network. The number of tests is fixed to 1000 for small networks (having less than 1000 nodes) and 100 for networks with higher nodes count.

Detailed results
This section contains detailed, per-network results of the experiments reported in the paper. Table 4, 5 and 6 show respectively the AUPR, AUROC and top-precision obtained using different values of the horizon cutoff on small networks. Table 7 shows the top-precision obtained using different values of the horizon cutoff on large networks. Table 8 shows the results per network for the comparison of Algorithm 1 against global link prediction methods on 18 small network. Table 12 shows the comparison of the proposed algorithm against local methods on 40 small networks. Results on 40 large networks are shown in Table 10, whereas Table 11 shows the results on 26 very large networks.

Further results
This section contains additional experimental results. Namely, we report the results obtained with small networks for all 12 methods we are comparing against. Table 12 contains the average top-precision obtained for each network as well as the average significant rank, whereas Table 13 shows the associated statistical significance tests.

Comparison with the repulsion-attraction pre-weighting rules
In 1 , authors present fast algorithms for embedding large graphs in the hyperbolic circle. They propose the repulsion-attraction rule (RA), that uses the neighbourhood topological information to estimate the distances between the connected nodes. These distances are used to pre-weight the network links. The repulsive force acts between adjacent nodes with high degrees and low number of common neighbours because they represent hubs and should be geometrically far. On the contrary, the attractive force acts between adjacent nodes with high number of common neighbours because they most likely are similar and should be geometrically close. Formally, given the network's adjacency matrix A, the weight w of the link (i, j) can be computed using one of the following formulas: where κ i is the degree of node i; e i is the external degree of node i (links neither to Γ i j nor to j); Γ i j common neighbors of nodes i and j. We compare the performance obtained when using RA1 and RA2 to pre-weight the networks links instead of the rule used in Algorithm 1. We use 38 real networks to conduct the comparison. Table 14 contains the average top-precision obtained for each network as well as the average significant rank. Table 15 shows the associated statistical significance tests. The results show that although the RA1 and RA2 rules perform better than a random predictor, meaning they do possess a certain predictive power, they rarely outperform the proposed rule. Furthermore, the difference is statistically significant and is considerable in most networks.

Method
Score assigned to the couple (i, j) Adamic-Adar index (ADA) , where Γ i j is the set of nodes adjacent to both i and j, and κ k is the degree of node k. Common neighbours (CNE) Cannistraci-Hebb index (CH) where Γ k is the set of nodes adjacent to node k.
Preferential attachment index (PAT) Resource allocation index (RAL) Sorensen index (SOI)  Macaque Neural 39 The network represents the macaque brain network. Each node corresponds to a brain region and each link represents a long-distance connection between two brain regions.
The dataset is available at http:// www.pnas.org/content/107/30/13485.full (the network is named Macaque LongDistance Network connectivity.edgelist by the authors).         Table 4. Effect of the horizon cut-off h on the performance of Algorithm 1 in small networks (AUPR). We report the AUPR for different values of h averaged over 1000 test runs where 10% of the edges are removed while keeping the network connected. For every network, the results having the best significant rank with p = 0.05 are shown in bold. The last row shows the average significant rank over all networks (the lower the better). The columns n and m contain the number of nodes and edges in the network.  Table 5. Effect of the horizon cut-off h on the performance of Algorithm 1 in small networks (AUROC). We report the AUROC for different values of h averaged over 1000 test runs where 10% of the edges are removed while keeping the network connected. For every network, the results having the best significant rank with p = 0.05 are shown in bold. The last row shows the average significant rank over all networks (the lower the better). The columns n and m contain the number of nodes and edges in the network.  Table 6. Effect of the horizon cut-off h on the performance of Algorithm 1 in small networks (top-precision). We report the top-precision for different values of h averaged over 1000 test runs where 10% of the edges are removed while keeping the network connected. For every network, the results having the best significant rank with p = 0.05 are shown in bold. The last row shows the average significant rank over all networks (the lower the better). The columns n and m contain the number of nodes and edges in the network. Average significant rank 5.000 5.000 5.000 5.000 5.000 5.000 5.000 5.000 5.000 Table 7. Effect of the horizon cut-off h on the performance of Algorithm 1 in large networks. We report top-precision for different values of h averaged over 100 test runs where 10% of the edges are removed while keeping the network connected. For every network, the results having the best significant rank with p = 0.05 are shown in bold. The last row shows the average significant rank over all networks (the lower the better). The columns n and m contain the number of nodes and edges in the network.   Table 9. Comparison of Algorithm 1 with local link prediction methods on small networks. We report top-precision averaged over 1000 test runs where 10% of the edges are removed at each run and used as test set. For every network, the results having the best significant rank with p = 0.05 are shown in bold. The last row shows the average significant rank over all networks (the lower the better). The columns n and m contain the number of nodes and edges in the network.      Comparison of the proposed edge weight method against the RA1 and RA2 rules. We report top-precision averaged over 100 test runs where 10% of the edges are removed at each run and used as test set. For every network, the results having the best significant rank with p = 0.05 are shown in bold. The last row shows the average significant rank over all networks (the lower the better). The columns n and m contain the number of nodes and edges in the network.