Similarity-based link prediction in social networks using latent relationships between the users

Social network analysis has recently attracted lots of attention among researchers due to its wide applicability in capturing social interactions. Link prediction, related to the likelihood of having a link between two nodes of the network that are not connected, is a key problem in social network analysis. Many methods have been proposed to solve the problem. Among these methods, similarity-based methods exhibit good efficiency by considering the network structure and using as a fundamental criterion the number of common neighbours between two nodes to establish structural similarity. High structural similarity may suggest that a link between two nodes is likely to appear. However, as shown in the paper, the number of common neighbours may not be always sufficient to provide comprehensive information about structural similarity between a pair of nodes. To address this, a neighbourhood vector is first specified for each node. Then, a novel measure is proposed to determine the similarity of each pair of nodes based on the number of common neighbours and correlation between the neighbourhood vectors of the nodes Experimental results, on a range of different real-world networks, suggest that the proposed method results in higher accuracy than other state-of-the-art similarity-based methods for link prediction.

Different methods have been suggested to determine the similarity score, S ij , between a pair of nodes v i and v j . The number of common neighbours between two nodes is the best-known measure of similarity score. Based on this measure, the likelihood of formation of e 24 in Fig. 1 is higher than the likelihood of formation of e 45 , because nodes 2 and 4 have one common neighbour whereas nodes 4 and 5 have no common neighbour, hence, S 24 = 1 > 0 = S 45 . Although computing the number of common neighbours is highly time-efficient, this measure cannot capture the similarity between two nodes accurately. Different measures [14][15][16][17] have been proposed to improve the accuracy of this measure by combining the number of common neighbours with additional information. However, these measures also suffer from low accuracy. In fact, as will be demonstrated in the next section, relying on the number of common first-order neighbours between two nodes, similarity-based methods cannot capture well the topological similarity between a pair of nodes. Beyond direct relationships, latent relationships between two nodes, such as indirect connectivity, may be important in predicting future relationships. This observation motivates the work in this paper.
To build the argument of the paper, some real-world networks are first analysed to demonstrate the limitation of methods that rely on common first-order neighbours between the nodes as a similarity measure. To address this limitation, a measure is then proposed to take common second-order neighbours into account. Common second-order neighbours indicate a latent relationship between a pair of users. In this paper, we apply the Pearson correlation coefficient to capture the latent relationship between a pair of nodes. Based on the Pearson correlation coefficient, a new measure to estimate the similarity score for link prediction in social networks is proposed.
In the rest of the paper, the motivation for the proposed method is presented in the next section, followed by an overview of related work. Next, the proposed method is described in detail, followed by experimental evaluation. Finally, the paper is concluded with some suggestions for future work.

Motivation
As suggested by Ke-ke et al. 18 , the number of common neighbours between a pair of nodes reveals structural similarity between the nodes and has a straight relationship with the link between the pair. However, as already mentioned, the number of common neighbours may be a simple and time-efficient method for link prediction, but it suffers from low accuracy and cannot provide comprehensive information to estimate the likelihood of link formation between the nodes. To demonstrate this, we examine nine different real-world networks including Zachary karate club (KRT) 19 , Hamsterster (HAM) 20 , Dolphins (DLN) 21 , US Airline (UAL) 22 , NetScience (NSC) 23 , Infectious (INF) 24 , Yeast (YST) 25 , email (EML) 26 and KHN 27 (detailed characteristics of these networks are summarized later in the paper, in Table 1). There are two key observations, which suggest that relying only on first-order neighbours is not an effective approach to estimate the likelihood of link formation. .8%, respectively. The suggestion is that considering common first-order neighbours may not always be a good predictor of future links. Depending on the network, methods whose prediction relies on common first-order neighbours alone may result in low accuracy. • Observation 2: Sorting all existing links in a network (included in the set E), as well as all hypothetical links that may be formed between nodes without a link (defined as the set of non-existing links, E N ), by frequency for the same number of neighbours, we realize that there is a significant overlap. Consider, for example, Fig. 2. Although the set of (non-existing) links E N tends to have fewer common neighbours, on average, than the set E, there is a significant overlap between the two sets and, in some cases (say, around 8 common neighbours for the sets INF, EML, YST) the chance of an existing versus a non-existing link for that number  www.nature.com/scientificreports/ of neighbours is essentially split in half. This is another suggestion that the number of common neighbours may not be a good indicator for link prediction.
In general, it appears that many links may exist between nodes that share no common neighbours at all, while, other nodes may share a large number of common neighbours without a direct link between them. Although it is true that various methods 14,16,17 have been proposed to improve the accuracy of link prediction based on the number of common neighbours, the key limitation is that they still rely mostly on common first-order neighbours.
Based on the above, it seems there is scope to depart from common first-order neighbours. For example, two nodes may not have a common first-order neighbour, but they may still have many common second-order neighbours. That is to say, the number of common neighbours shows an explicit relationship between two nodes but there might be a relationship between two nodes which is not captured using common first-order neighbours. This kind of relationship is termed latent relationship in this paper. As suggested by observation 1 and 2, such latent relationship cannot be fully appreciated using simply common neighbours between the nodes. Considering the neighbourhood of two nodes may more accurately capture latent relationships between the nodes. For instance in the network shown in Fig. 1, nodes 4 and 5 have no common neighbours, but the correlation between their neighbours, i.e., nodes 2 and 3, may reveal a latent relationship between the two nodes, which correlates with the possibility of a future link between them. This kind of latent relationship should be considered for link prediction.
The above is what, essentially, motivates the research in this paper: • Hypothesis 1: If there is no common neighbour between the nodes connected to a future link, but the nodes have a significant latent relationship, link formation can be predicted. • Hypothesis 2: Considering latent relationships helps justify differences in existing and non-existing links between pairs of nodes that may still have the same number of common neighbours.

Related work
There is a plethora of similarity-based methods for link prediction in the literature 4,7 . These methods essentially differ on what approach they use to estimate the similarity score between two nodes, which is then used to compute the likelihood of each non-existing link. Some methods estimate similarity based on neighbourhood, www.nature.com/scientificreports/ i.e., they are based on local structural information, while other methods may consider paths of different length between the nodes to take semi-local information into account or may first need to traverse the whole graph for global structural information and then estimate the likelihood of non-existing links based on this information. Some of the most commonly used methods (which will also be used later for evaluation) are discussed below: • Common Neighbours 8 : In this method, the number of common neighbours between each pair of nodes is considered as their similarity score. Thus, the common neighbour similarity score between the pair of nodes v i and v j is calculated according to Eq. (1).
• Preferential Attachment Index 10 : The degree of two nodes determines the likelihood of link formation. Thus, Eq.
(2) is used to determine the similarity score between a pair of nodes v i and v j .
• Jaccard Index 11 : In this method, the similarity score between a pair of nodes v i and v j is calculated with the help of Eq. (3).
• Hub Promoted Index 28 : The ratio of the number of common neighbours to the minimum degree of nodes v i and v j is defined as the similarity measure. The similarity score of these nodes is calculated with the help of Eq. (4).
• Common Neighbours Degree Penalization 15 : Penalization of common neighbours is considered in this method. The number of common neighbours for each pair of common neighbours of the two nodes is taken into account for this purpose. Then, the similarity score of nodes v i and v j is calculated using Eq. (5), where CN • Node-Coupling Clustering 17 : In this method, the clustering coefficient is used to determine the contribution of each common neighbour and the similarity between each pair of nodes. The similarity score between v i and v j is calculated using Eq. (6), where C z is the clustering coefficient of node v z .
• Parameterized Algorithm 16 : In this method, the number of common neighbours and the closeness of two nodes are both taken into account to estimate the similarity between a pair of nodes. The parameterized similarity score between v i and v j is calculated by Eq. (7), where α is a tunable parameter and d ij is the shortest distance between nodes v i and v j .
• Higher-Order Path Index 29 : Based on the common neighbours, the significance of paths between two nodes is taken into account to propose an iterative method. Summing up the significance of the paths between two nodes determines the likelihood of link formation between them. For this purpose, the significance of a path of length 2 between nodes v i and v j is calculated using Eq. (8).
The significance of paths of length l > 2 between nodes v i and v j is calculated based on the significance of its constituent edges using Eq. (9).
where f 1 and f 2 denote the significance of the constituent edge and the significance of the path of previous iteration, and α is a tunable parameter. Apart from these methods, various other local and semi-local methods have been used to estimate similarity between a pair of nodes. Local methods include: Adamic Adar index 30 , Sorensen index 10 , resource allocation www.nature.com/scientificreports/ index 31 , node clustering coefficient 32 , node and link clustering coefficient 33 , heterogeneity index 34 and tie connection strength index 35 . Semi-local methods, which estimate the likelihood of link formation between a pair of nodes on the basis of the paths between them, include: effective paths index 36 , significant paths index 37 , penalizing non-contribution links index 38 , local paths 39 and friend link 40 .
In this paper, a novel method is proposed, which goes beyond the number of common neighbours by taking into account local information from both first-and second-order neighbourhood of the nodes.

A novel method for link prediction based on latent relationships
In this section, we propose a novel method for similarity-based link prediction, which we call Direct-Indirect Common Neighbours (DICN). This method takes into account latent relationships between nodes as will be described next. The idea is first to estimate the impact of common second-order neighbours between each pair of nodes. Then, this is combined with the impact of common first-order neighbours to estimate the similarity between the pair.
In order to determine the impact of common second-order neighbours, a neighbourhood vector N i is first defined for each node i with | V | entries as in Eq. (10). The zth entry of this vector corresponds to node z. When z = i , we set N i [i] = d i , that is, the degree of node i. If node z is a second-order neighbour of node i (in this case, by definition, node z is not a first-order neighbour of node i), we set the corresponding vector entry, N i [z] , to CN iz (see Eq. (1)), whereas, if node z is a first-order neighbour of node i, we add 1 to this quantity. Finally, if node z is not a first-or second-order neighbour of node i, they do not have any common neighbour, so N i [z] = 0.
In order to estimate the likelihood of link formation between nodes v i and v j , the union neighbourhood set, UN ij , for these nodes is calculated using Eq. (11).
Greater correlation between the union neighbourhood set, UN ij , of the vectors N i and N j indicates higher structural similarity between nodes i and j. Thus, the correlation coefficient between the union neighbourhood set of the vectors is then calculated to determine the correlation between two nodes. We use Pearson correlation coefficient for this purpose, thus, the correlation between the union neighbourhood set of the vectors N i and N j is calculated using Eq. (12).
In Eq. (12), N i is the mean of the values in the union neighbourhood set of vector N i ; it is calculated using Eq. (13).
In our method, two nodes that do not have common neighbours may still have significant structural similarity. Thus, a relationship may be detected through correlation between their neighbours. Take, for example, the links e 31 and e 38 in the network shown in Fig. 3. Based on Eq. (12), nodes 3 and 1 have higher structural similarity, because Corr 38 ∼ = 0.32 and Corr 31 ∼ = 0.01 . When the neighbours of two nodes are highly correlated a latent relationship between the nodes is implied. Thus, in Eq. (12), greater correlation between two nodes shows higher www.nature.com/scientificreports/ indirect similarity between the nodes and formation of a link between them can be regarded as likely. Direct similarity between two nodes is calculated based on the number of common first-order neighbours. We combine indirect and direct similarity in Eq. (14) to calculate the Direct-Indirect Common Neighbours (DICN) similarity score of nodes i and j.
Pseudo-code to implement the proposed method is shown in Algorithm 1. In lines 1-5 of the algorithm, the neighbourhood vector, N i , for each node v i is calculated. The likelihood of formation of each non-existing link between nodes v i and v j is calculated in lines 6-10, whereas the union neighbourhood set and the indirect similarity between the nodes are calculated in lines 7 and 8, respectively. The link formation likelihood is computed in line 9 resulting in the DICN similarity score.

Experimental results
Setting. In order to evaluate the performance of the proposed DICN method, this method and another 8 representative methods from the literature were implemented in Java and executed on a PC with an i5 2.3 GHz processor and 8 MB memory. The eight methods used for comparison are: Common Neighbours (CN) 8 , Preferential Attachment Index (PA) 10 , Jaccard Index (JC) 11 , Hub Promoted Index (HPI) 28 , Common Neighbours Degree Penalization (CNDP) 15 , Node-coupling Clustering (NCC) 17 , Parameterized Algorithm (CCPA) 16 and Significance of Higher-Order Path Index (SHOPI) 29 .
Nine different real-world networks with a variety of features were used in the experiments. Zachary karate club (KRT) 19 and Hamsterster (HAM) 20 are social networks. Dolphins (DLN) 21 is an animal network. US Airline (UAL) 22 is an airport traffic network. NetScience (NSC) 23 and KHN 27 are co-authorship networks. Infectious (INF) 24 is a network of face-to-face contacts in an exhibition. Yeast (YST) 25 is a biological network. U. Rovira i Virgili email (EML) 26 is an email communication network. Specific characteristics for each of the networks are shown in Table 1.
We follow an evaluation strategy, which is in line with the evaluation strategies used in other related work 16,17 . For each network, the set of existing edges, E, is randomly divided into two sets: the set of training edges E T and the set of test edges E P , where E T ∩ E P = and E T ∪ E P = E . We randomly select β percent of edges as E T and the remaining, 1 − β percent of edges, as E P . To increase the confidence of the obtained results, the process is repeated 15 times and the average of the obtained results is reported in each experiment. The metric Area Under the receiver operating characteristic Curve (AUC ), widely applied in the relevant literature 1 , is used to assess the accuracy of methods. The AUC is computed by picking an edge from E P and an edge from the set of non-existing edges, E N , and calculating the similarity score between the pair of nodes connected to each of the edges. This process is repeated n times and AUC is calculated using Eq. (15).

Results.
Four different experiments are performed. Their objective is, respectively, to: (1) assess the accuracy of DICN when compared to other methods; (2) assess the robustness of DICN, with different sizes of training data; (3) and (4) validate Hypothesis 1 and 2 described earlier in the motivating section.

Experiment 1.
In the first experiment, we consider a value of β equal to 80, as this is a value commonly used in other related experiments 9,16 . Then, for each of the nine methods and each of the nine networks, we calculate the value of AUC . The results are shown in Table 2. It can be seen that in eight of the nine networks, DICN outperforms all other methods. Even for the UAL network, DICN's accuracy is very close to the best accuracy. As it relies on both the number of common neighbours and the correlation between the neighbours, DICN takes into account both direct and indirect similarity between the nodes which leads to better accuracy in distinguishing the links in E P and E N than other methods.

Experiment 2.
In the next experiment, the robustness of the different methods with respect to the size (that is, the value of β ) of the training set E T , is evaluated. For this purpose, the value of β is varied from 50 to 90 in steps of 10, a range where some reasonably good accuracy is expected and is in line with other studies 9 . The Table 1. Characteristics of the nine networks used in the experiments showing the number of nodes (| V | ), the number of edges (| E | ), average clustering coefficient ( C ), average degree ( d ) and degree assortativity (r).  www.nature.com/scientificreports/ accuracy of different methods for each value of β is calculated by AUC . As all networks tend to follow a similar trend where higher values of β tend to increase accuracy, we show results in Fig. 5. Although, for small values of β , DICN does not have the best accuracy for some networks, this method is consistently best when the value of β is 70 or higher in seven of the nine networks. This is because, when the training set is smaller it is harder to detect the latent relationship between the nodes due to the lower correlation between them. So DICN may not be so accurate in networks with a relatively small training set. However, in the presence of a large training set the correlation between the nodes is detected more accurately and the latent relationship is estimated by DICN more accurately. It is also interesting to observe that in some networks DICN outperforms all other methods significantly, something that could be investigated further to document the advantages of DICN. www.nature.com/scientificreports/ Experiment 3. This experiment is dedicated to the validation of Hypothesis 1, which relates to the ability of the methods to distinguish links between nodes with no common neighbours. To do so, for each of the nine networks we take the set of test edges, E P and the set of non-existing edges, E N . From these two sets, we select those edges that connect nodes that have no common neighbours and the degree of these nodes is greater than 1. Then we calculate the similarity score for each of these edges for our proposed method DICN and all other methods. We note that, with the exception of PA, CCPA and SHOPI, all other methods will result in a similarity score of zero, as the edges we selected are between nodes that have no common neighbour; hence, these methods are omitted for further analysis. The AUC of PA, CCPA, SHOPI and DICN methods is shown in Table 3. It can be seen that DICN is more accurate than other methods when distinguishing links between nodes with no common neighbours for five of the nine networks, while it has an accuracy very close to the best for the remaining four networks. In this experiment, by default the value of direct similarity in Eq. (14) is zero for all compared edges. Still, DICN can accurately distinguish the test and non-existing edges. Once again, this experiment suggests that calculating the correlation between neighbourhood vectors provides a good accuracy to detect indirect similarity between nodes when there are no common neighbours between them.

Experiment 4.
This experiment is dedicated to validation of Hypothesis 2, which relates to assessing the ability of the methods to distinguish links between nodes with the same number of common neighbours. To do so, for each of the nine networks we take again the set of test edges, E P and the set of non-existing edges, E N . From these two sets, we select the edges that connect nodes with the same number of common neighbours. Then we calculate the similarity score for each of these edges using our proposed method DICN, and the best performing methods from Experiment 2: NCC, CNDP, CCPA and SHOPI. The AUC of each method is shown in Table 4. Once again, the ability of DICN to consider latent relationships leads to higher accuracy in five of the nine networks. In the KRT, DLN and YST networks, DICN has results that are close to the best method. Only in the UAL network the NCC, CNDP and SHOPI methods significantly outperform DICN. Overall, the results obtained in this experiment confirm that assessing correlation using a neighbourhood vector for nodes is an accurate way to distinguish the test and non-existing edges of nodes with an equal number of common neighbours.

Conclusion
The prediction of future links and the identification of missing links have attracted significant research in social networks analysis. Different methods have been proposed for it, many of which are based on the number of common neighbours. The idea behind this paper has been that latent relationships between the nodes are not captured by the number of common neighbours. Thus, to take into account such latent relationships, a correlation-based measure was proposed and its accuracy was compared to other related methods, giving superior accuracy results. Table 3. Ability of methods to distinguish links between nodes with no common neighbours. The best result in each network is shown with bold face. www.nature.com/scientificreports/ Further work can look into more elaborate experimentation and networks with varying characteristics, including directed and weighted networks. In addition, the definition of latent relationship can be expanded beyond second-order relationships, for example including correlation with the number of paths between the nodes or global properties, such as centrality of the nodes, and so on.