Abstract
Social network analysis has recently attracted lots of attention among researchers due to its wide applicability in capturing social interactions. Link prediction, related to the likelihood of having a link between two nodes of the network that are not connected, is a key problem in social network analysis. Many methods have been proposed to solve the problem. Among these methods, similarity-based methods exhibit good efficiency by considering the network structure and using as a fundamental criterion the number of common neighbours between two nodes to establish structural similarity. High structural similarity may suggest that a link between two nodes is likely to appear. However, as shown in the paper, the number of common neighbours may not be always sufficient to provide comprehensive information about structural similarity between a pair of nodes. To address this, a neighbourhood vector is first specified for each node. Then, a novel measure is proposed to determine the similarity of each pair of nodes based on the number of common neighbours and correlation between the neighbourhood vectors of the nodes Experimental results, on a range of different real-world networks, suggest that the proposed method results in higher accuracy than other state-of-the-art similarity-based methods for link prediction.
Similar content being viewed by others
Introduction
Social networks are getting lots of attention to capture people’s interactions, partly as a result of the increased use of social media platforms. The large amount of data that may be associated with social networks has motivated research in a number of topics. Among these topics, the identification of missing links and prediction of future links is an important branch of social network analysis1. Link prediction is defined as the estimation of the likelihood of link formation between each pair of nodes for which a link does not exist. It has applications in a number of areas, such as, prediction of evolution in dynamic networks2, providing recommendation for friends in social networks3, finding latent links in an area of concern for security4, or finding missing links in networks5,6.
Different methods for the link prediction problem have been proposed4,7. In similarity-based methods8,9,10,11, structural similarity between a pair of nodes is taken into account to estimate the probability of link formation between the nodes. Nodes with high similarity tend to have a future relationship. Conversely, in probabilistic methods12,13, information beyond structure, such as behaviour of users and link features are required. However, the lack of sufficient and/or accurate information4 about such features has motivated researchers to focus primarily on similarity-based methods and how to estimate structural similarity from which the likelihood of link formation between each pair of nodes can be derived.
A social network can be modelled as a graph G(V, E), where \(V=\{v_1,v_2,v_3,\dots ,V_{\mid V\mid }\}\) denotes the set of nodes (users) and \(\mid V\mid \) the number of nodes. The set \(E \subset V \times V\) is a set of links indicating the relationships between nodes. If there is a link between two nodes \(v_i\) and \(v_j\), it is denoted by the edge \(e_{ij}\), and the nodes are considered as neighbours or friends. Here, we use \(\Gamma _i\) and \(\Gamma _i^{(2)}\) to denote the set of first- and second-order neighbours of node \(v_i\), i.e., \(\Gamma _i=\{v_j \mid e_{ij} \in E \}\) and \(\Gamma _i^{(2)}=\{v_k \mid e_{ij} \in E ,\, e_{jk} \in E ,\, e_{ik} \notin E\}\), respectively. The size of \(\Gamma _i\) represents the degree of node \(v_i\), i.e., \(d_i=\mid \Gamma _i \mid \). Link prediction aims to estimate the probability of existence (or formation) of each of the non-existing links in the network in order to identify a set of missing or future links between the users. The set of non-existing links is denoted by \(E^{N}=U-E\), where U is the universal set of the links in the network, i.e., \(U=V \times V\). For example, consider the network shown in Fig. 1. In this network, \(V=\{v_1,v_2,v_3,v_4,v_5\}\), \(\mid V\mid =5\), \(E=\{e_{12},e_{23},e_{25},e_{34}\}\). The set of non-existing edges is \(E^{N}=\{e_{13},e_{14},e_{15},e_{24},e_{35},e_{45}\}\). The problem is to estimate the likelihood of formation for each of the links in \(E^{N}\). In similarity-based methods, the likelihood of formation of a non-existing edge is estimated using a similarity score, which, for each pair of nodes, captures structural similarity of the nodes.
Different methods have been suggested to determine the similarity score, \(S_{ij}\), between a pair of nodes \(v_i\) and \(v_j\). The number of common neighbours between two nodes is the best-known measure of similarity score. Based on this measure, the likelihood of formation of \(e_{24}\) in Fig. 1 is higher than the likelihood of formation of \(e_{45}\), because nodes 2 and 4 have one common neighbour whereas nodes 4 and 5 have no common neighbour, hence, \(S_{24}=1>0=S_{45}\). Although computing the number of common neighbours is highly time-efficient, this measure cannot capture the similarity between two nodes accurately. Different measures14,15,16,17 have been proposed to improve the accuracy of this measure by combining the number of common neighbours with additional information. However, these measures also suffer from low accuracy. In fact, as will be demonstrated in the next section, relying on the number of common first-order neighbours between two nodes, similarity-based methods cannot capture well the topological similarity between a pair of nodes. Beyond direct relationships, latent relationships between two nodes, such as indirect connectivity, may be important in predicting future relationships. This observation motivates the work in this paper.
To build the argument of the paper, some real-world networks are first analysed to demonstrate the limitation of methods that rely on common first-order neighbours between the nodes as a similarity measure. To address this limitation, a measure is then proposed to take common second-order neighbours into account. Common second-order neighbours indicate a latent relationship between a pair of users. In this paper, we apply the Pearson correlation coefficient to capture the latent relationship between a pair of nodes. Based on the Pearson correlation coefficient, a new measure to estimate the similarity score for link prediction in social networks is proposed.
In the rest of the paper, the motivation for the proposed method is presented in the next section, followed by an overview of related work. Next, the proposed method is described in detail, followed by experimental evaluation. Finally, the paper is concluded with some suggestions for future work.
Motivation
As suggested by Ke-ke et al.18, the number of common neighbours between a pair of nodes reveals structural similarity between the nodes and has a straight relationship with the link between the pair. However, as already mentioned, the number of common neighbours may be a simple and time-efficient method for link prediction, but it suffers from low accuracy and cannot provide comprehensive information to estimate the likelihood of link formation between the nodes. To demonstrate this, we examine nine different real-world networks including Zachary karate club (KRT)19, Hamsterster (HAM)20, Dolphins (DLN)21, US Airline (UAL)22, NetScience (NSC)23, Infectious (INF)24, Yeast (YST)25, email (EML)26 and KHN27 (detailed characteristics of these networks are summarized later in the paper, in Table 1). There are two key observations, which suggest that relying only on first-order neighbours is not an effective approach to estimate the likelihood of link formation.
-
Observation 1: In real world-networks, a significant percentage of links may exist where the nodes connected by these links have no common neighbours. A quick check of the nine networks above reveals that this may indeed be a significant percentage. For example, 53.7% of the edges of the YST network have no common neighbour. In networks DLN, EML and KHN this value is 23.9%, 22.4% and 28.2% respectively. In the KRT network 14.1% of the links have no common neighbour. Finally, only in INF, NSC, UAL and HAM networks, this percentage is rather small: 4.9%, 4%, 3.1% and 3.8%, respectively. The suggestion is that considering common first-order neighbours may not always be a good predictor of future links. Depending on the network, methods whose prediction relies on common first-order neighbours alone may result in low accuracy.
-
Observation 2: Sorting all existing links in a network (included in the set E), as well as all hypothetical links that may be formed between nodes without a link (defined as the set of non-existing links, \(E^N\)), by frequency for the same number of neighbours, we realize that there is a significant overlap. Consider, for example, Fig. 2. Although the set of (non-existing) links \(E^N\) tends to have fewer common neighbours, on average, than the set E, there is a significant overlap between the two sets and, in some cases (say, around 8 common neighbours for the sets INF, EML, YST) the chance of an existing versus a non-existing link for that number of neighbours is essentially split in half. This is another suggestion that the number of common neighbours may not be a good indicator for link prediction.
In general, it appears that many links may exist between nodes that share no common neighbours at all, while, other nodes may share a large number of common neighbours without a direct link between them. Although it is true that various methods14,16,17 have been proposed to improve the accuracy of link prediction based on the number of common neighbours, the key limitation is that they still rely mostly on common first-order neighbours.
Based on the above, it seems there is scope to depart from common first-order neighbours. For example, two nodes may not have a common first-order neighbour, but they may still have many common second-order neighbours. That is to say, the number of common neighbours shows an explicit relationship between two nodes but there might be a relationship between two nodes which is not captured using common first-order neighbours. This kind of relationship is termed latent relationship in this paper. As suggested by observation 1 and 2, such latent relationship cannot be fully appreciated using simply common neighbours between the nodes. Considering the neighbourhood of two nodes may more accurately capture latent relationships between the nodes. For instance in the network shown in Fig. 1, nodes 4 and 5 have no common neighbours, but the correlation between their neighbours, i.e., nodes 2 and 3, may reveal a latent relationship between the two nodes, which correlates with the possibility of a future link between them. This kind of latent relationship should be considered for link prediction.
The above is what, essentially, motivates the research in this paper:
-
Hypothesis 1: If there is no common neighbour between the nodes connected to a future link, but the nodes have a significant latent relationship, link formation can be predicted.
-
Hypothesis 2: Considering latent relationships helps justify differences in existing and non-existing links between pairs of nodes that may still have the same number of common neighbours.
Related work
There is a plethora of similarity-based methods for link prediction in the literature4,7. These methods essentially differ on what approach they use to estimate the similarity score between two nodes, which is then used to compute the likelihood of each non-existing link. Some methods estimate similarity based on neighbourhood, i.e., they are based on local structural information, while other methods may consider paths of different length between the nodes to take semi-local information into account or may first need to traverse the whole graph for global structural information and then estimate the likelihood of non-existing links based on this information.
Some of the most commonly used methods (which will also be used later for evaluation) are discussed below:
-
Common Neighbours8: In this method, the number of common neighbours between each pair of nodes is considered as their similarity score. Thus, the common neighbour similarity score between the pair of nodes \(v_i\) and \(v_j\) is calculated according to Eq. (1).
$$\begin{aligned} CN_{ij}=\mid \Gamma _i \cap \Gamma _j \mid \end{aligned}$$(1) -
Preferential Attachment Index10: The degree of two nodes determines the likelihood of link formation. Thus, Eq. (2) is used to determine the similarity score between a pair of nodes \(v_i\) and \(v_j\).
$$\begin{aligned} PA_{ij}=d_i \cdot d_j \end{aligned}$$(2) -
Jaccard Index11: In this method, the similarity score between a pair of nodes \(v_i\) and \(v_j\) is calculated with the help of Eq. (3).
$$\begin{aligned} JC_{ij}=\frac{\mid \Gamma _i \cap \Gamma _j \mid }{\mid \Gamma _i \cup \Gamma _j \mid } \end{aligned}$$(3) -
Hub Promoted Index28: The ratio of the number of common neighbours to the minimum degree of nodes \(v_i\) and \(v_j\) is defined as the similarity measure. The similarity score of these nodes is calculated with the help of Eq. (4).
$$\begin{aligned} HPI_{ij}=\frac{\mid \Gamma _i \cap \Gamma _j \mid }{min \{d_i,d_j\}} \end{aligned}$$(4) -
Common Neighbours Degree Penalization15: Penalization of common neighbours is considered in this method. The number of common neighbours for each pair of common neighbours of the two nodes is taken into account for this purpose. Then, the similarity score of nodes \(v_i\) and \(v_j\) is calculated using Eq. (5), where \(CN_z^{(2)}=\{\Gamma _z \cap \Gamma _i \cap \Gamma _j\} \cup \{v_i, v_j\}\).
$$\begin{aligned} CNDP_{ij}=\sum _{v_z \in \Gamma _i \cap \Gamma _j} \mid CN_z^{(2)}\mid (d_z^{\, -\beta C}) \end{aligned}$$(5) -
Node-Coupling Clustering17: In this method, the clustering coefficient is used to determine the contribution of each common neighbour and the similarity between each pair of nodes. The similarity score between \(v_i\) and \(v_j\) is calculated using Eq. (6), where \(C_z\) is the clustering coefficient of node \(v_z\).
$$\begin{aligned} NCC_{ij}=\sum _{v_n \in \Gamma _i \cap \Gamma _j} \frac{\sum _{v_z \in CN_n^{(2)}}(\frac{1}{d_z}+C_z)}{\sum _{v_w \in \Gamma _n}(\frac{1}{d_w}+C_w)} \end{aligned}$$(6) -
Parameterized Algorithm16: In this method, the number of common neighbours and the closeness of two nodes are both taken into account to estimate the similarity between a pair of nodes. The parameterized similarity score between \(v_i\) and \(v_j\) is calculated by Eq. (7), where \(\alpha \) is a tunable parameter and \(d_{ij}\) is the shortest distance between nodes \(v_i\) and \(v_j\).
$$\begin{aligned} CCPA_{ij}=\alpha (\mid \Gamma _i \cap \Gamma _j \mid )+ (1-\alpha ) \frac{\mid V \mid }{d_{ij}} \end{aligned}$$(7) -
Higher-Order Path Index29: Based on the common neighbours, the significance of paths between two nodes is taken into account to propose an iterative method. Summing up the significance of the paths between two nodes determines the likelihood of link formation between them. For this purpose, the significance of a path of length 2 between nodes \(v_i\) and \(v_j\) is calculated using Eq. (8).
$$\begin{aligned} S_{ij}=\sum _{v_n \in \Gamma _i \cap \Gamma _j}\frac{1}{d_z} \end{aligned}$$(8)The significance of paths of length \(l>2\) between nodes \(v_i\) and \(v_j\) is calculated based on the significance of its constituent edges using Eq. (9).
$$\begin{aligned} S_{ij}=\sum _{k=3}^{l-2} f_1 \cdot f_2 \cdot \alpha ^{l-2}, \end{aligned}$$(9)where \(f_1\) and \(f_2\) denote the significance of the constituent edge and the significance of the path of previous iteration, and \(\alpha \) is a tunable parameter.
Apart from these methods, various other local and semi-local methods have been used to estimate similarity between a pair of nodes. Local methods include: Adamic Adar index30, Sorensen index10, resource allocation index31, node clustering coefficient32, node and link clustering coefficient33, heterogeneity index34 and tie connection strength index35. Semi-local methods, which estimate the likelihood of link formation between a pair of nodes on the basis of the paths between them, include: effective paths index36, significant paths index37, penalizing non-contribution links index38, local paths39 and friend link40.
In this paper, a novel method is proposed, which goes beyond the number of common neighbours by taking into account local information from both first- and second-order neighbourhood of the nodes.
A novel method for link prediction based on latent relationships
In this section, we propose a novel method for similarity-based link prediction, which we call Direct-Indirect Common Neighbours (DICN). This method takes into account latent relationships between nodes as will be described next. The idea is first to estimate the impact of common second-order neighbours between each pair of nodes. Then, this is combined with the impact of common first-order neighbours to estimate the similarity between the pair.
In order to determine the impact of common second-order neighbours, a neighbourhood vector \(N_i\) is first defined for each node i with \(\mid V\mid \) entries as in Eq. (10). The zth entry of this vector corresponds to node z. When \(z=i\), we set \(N_i[i]=d_i\), that is, the degree of node i. If node z is a second-order neighbour of node i (in this case, by definition, node z is not a first-order neighbour of node i), we set the corresponding vector entry, \(N_i[z]\), to \(CN_{iz}\) (see Eq. (1)), whereas, if node z is a first-order neighbour of node i, we add 1 to this quantity. Finally, if node z is not a first- or second-order neighbour of node i, they do not have any common neighbour, so \(N_i[z]=0\).
In order to estimate the likelihood of link formation between nodes \(v_i\) and \(v_j\), the union neighbourhood set, \(UN_{ij}\), for these nodes is calculated using Eq. (11).
Greater correlation between the union neighbourhood set, \(UN_{ij}\), of the vectors \(N_i\) and \(N_j\) indicates higher structural similarity between nodes i and j. Thus, the correlation coefficient between the union neighbourhood set of the vectors is then calculated to determine the correlation between two nodes. We use Pearson correlation coefficient for this purpose, thus, the correlation between the union neighbourhood set of the vectors \(N_i\) and \(N_j\) is calculated using Eq. (12).
In Eq. (12), \(\overline{N_i}\) is the mean of the values in the union neighbourhood set of vector \(N_i\); it is calculated using Eq. (13).
In our method, two nodes that do not have common neighbours may still have significant structural similarity. Thus, a relationship may be detected through correlation between their neighbours. Take, for example, the links \(e_{31}\) and \(e_{38}\) in the network shown in Fig. 3. Based on Eq. (12), nodes 3 and 1 have higher structural similarity, because \(Corr_{38}\cong 0.32\) and \(Corr_{31}\cong 0.01\). When the neighbours of two nodes are highly correlated a latent relationship between the nodes is implied. Thus, in Eq. (12), greater correlation between two nodes shows higher indirect similarity between the nodes and formation of a link between them can be regarded as likely. Direct similarity between two nodes is calculated based on the number of common first-order neighbours. We combine indirect and direct similarity in Eq. (14) to calculate the Direct-Indirect Common Neighbours (DICN) similarity score of nodes i and j.
Pseudo-code to implement the proposed method is shown in Algorithm 1. In lines 1–5 of the algorithm, the neighbourhood vector, \(N_i\), for each node \(v_i\) is calculated. The likelihood of formation of each non-existing link between nodes \(v_i\) and \(v_j\) is calculated in lines 6–10, whereas the union neighbourhood set and the indirect similarity between the nodes are calculated in lines 7 and 8, respectively. The link formation likelihood is computed in line 9 resulting in the DICN similarity score.
Example: Take the network in Fig. 3, as an example. In this network \(\mid V\mid =11\). Vectors \(N_2\) and \(N_5\) are calculated as follows:
Furthermore, \(UN_{25}=\{1,2,3,4,5,6,7,8,9,10,11\}\). The indirect similarity between \(v_2\) and \(v_5\) is calculated below:
Finally, the DICN similarity score between the nodes is given by:
Experimental results
Setting
In order to evaluate the performance of the proposed DICN method, this method and another 8 representative methods from the literature were implemented in Java and executed on a PC with an i5 2.3 GHz processor and 8 MB memory. The eight methods used for comparison are: Common Neighbours (CN)8, Preferential Attachment Index (PA)10, Jaccard Index (JC)11, Hub Promoted Index (HPI)28, Common Neighbours Degree Penalization (CNDP)15, Node-coupling Clustering (NCC)17, Parameterized Algorithm (CCPA)16 and Significance of Higher-Order Path Index (SHOPI)29.
Nine different real-world networks with a variety of features were used in the experiments. Zachary karate club (KRT)19 and Hamsterster (HAM)20 are social networks. Dolphins (DLN)21 is an animal network. US Airline (UAL)22 is an airport traffic network. NetScience (NSC)23 and KHN27 are co-authorship networks. Infectious (INF)24 is a network of face-to-face contacts in an exhibition. Yeast (YST)25 is a biological network. U. Rovira i Virgili email (EML)26 is an email communication network. Specific characteristics for each of the networks are shown in Table 1.
We follow an evaluation strategy, which is in line with the evaluation strategies used in other related work16,17. For each network, the set of existing edges, E, is randomly divided into two sets: the set of training edges \(E^T\) and the set of test edges \(E^P\), where \(E^T \cap E^P=\) and \(E^T \cup E^P=E\). We randomly select \(\beta \) percent of edges as \(E^T\) and the remaining, \(1-\beta \) percent of edges, as \(E^P\). To increase the confidence of the obtained results, the process is repeated 15 times and the average of the obtained results is reported in each experiment. The metric Area Under the receiver operating characteristic Curve (AUC), widely applied in the relevant literature1, is used to assess the accuracy of methods. The AUC is computed by picking an edge from \(E^P\) and an edge from the set of non-existing edges, \(E^N\), and calculating the similarity score between the pair of nodes connected to each of the edges. This process is repeated n times and AUC is calculated using Eq. (15).
In Eq. (15), \(n_1\) is the number of times when the similarity score of the nodes connected by the edge picked from the set \(E^P\) is higher than the similarity score of the nodes connected by the edge picked from the set \(E^N\), and \(n_2\) is the number of times when the two similarity scores are equal. With respect to the value of n, in our experiments we always compare every pair of links in \(E^P\) and \(E^N\). This means that \(n=|E^P| \cdot |E^N| = (1 - \beta /100) \cdot |E| \cdot (\frac{|V| \cdot (|V| -1)}{2}-|E|)\), where \(\beta \) is the percentage of edges in the training set, \(E^T\). The value of AUC is between [0, 1], where a higher value shows higher accuracy.
We highlight the process of calculating AUC using an example. Consider the network shown in Fig. 4a and assume \(\beta =80\%\). This network has 5 edges which, as shown in Fig. 4b,c, are randomly divided into a training edges set and a test edges set with 4 edges and 1 edge, respectively. The non-existing edges set for this network is shown in Fig. 4d. In order to calculate AUC in this example the likelihood of formation for the test edge \(e_{35}\) must be compared to non-existing edges \(e_{13}\), \(e_{14}\), \(e_{15}\), \(e_{24}\) and \(e_{45}\). Applying Eq. (14), \(DICN_{35}=2.5\), \(DICN_{13}=2.5\), \(DICN_{14}=0.59\), \(DICN_{15}=2.0\), \(DICN_{24}=2.82\) and \(DICN_{45}=0.59\). Thus, \(n_1=3\) and \(n_2=1\) and \(AUC=\frac{3+\frac{1}{2}\times 1}{5}=0.7\).
Results
Four different experiments are performed. Their objective is, respectively, to: (1) assess the accuracy of DICN when compared to other methods; (2) assess the robustness of DICN, with different sizes of training data; (3) and (4) validate Hypothesis 1 and 2 described earlier in the motivating section.
Experiment 1
In the first experiment, we consider a value of \(\beta \) equal to 80, as this is a value commonly used in other related experiments9,16. Then, for each of the nine methods and each of the nine networks, we calculate the value of AUC. The results are shown in Table 2. It can be seen that in eight of the nine networks, DICN outperforms all other methods. Even for the UAL network, DICN’s accuracy is very close to the best accuracy. As it relies on both the number of common neighbours and the correlation between the neighbours, DICN takes into account both direct and indirect similarity between the nodes which leads to better accuracy in distinguishing the links in \(E^P\) and \(E^N\) than other methods.
Experiment 2
In the next experiment, the robustness of the different methods with respect to the size (that is, the value of \(\beta \)) of the training set \(E^T\), is evaluated. For this purpose, the value of \(\beta \) is varied from 50 to 90 in steps of 10, a range where some reasonably good accuracy is expected and is in line with other studies9. The accuracy of different methods for each value of \(\beta \) is calculated by AUC. As all networks tend to follow a similar trend where higher values of \(\beta \) tend to increase accuracy, we show results in Fig. 5. Although, for small values of \(\beta \), DICN does not have the best accuracy for some networks, this method is consistently best when the value of \(\beta \) is 70 or higher in seven of the nine networks. This is because, when the training set is smaller it is harder to detect the latent relationship between the nodes due to the lower correlation between them. So DICN may not be so accurate in networks with a relatively small training set. However, in the presence of a large training set the correlation between the nodes is detected more accurately and the latent relationship is estimated by DICN more accurately. It is also interesting to observe that in some networks DICN outperforms all other methods significantly, something that could be investigated further to document the advantages of DICN.
Experiment 3
This experiment is dedicated to the validation of Hypothesis 1, which relates to the ability of the methods to distinguish links between nodes with no common neighbours. To do so, for each of the nine networks we take the set of test edges, \(E^P\) and the set of non-existing edges, \(E^N\). From these two sets, we select those edges that connect nodes that have no common neighbours and the degree of these nodes is greater than 1. Then we calculate the similarity score for each of these edges for our proposed method DICN and all other methods. We note that, with the exception of PA, CCPA and SHOPI, all other methods will result in a similarity score of zero, as the edges we selected are between nodes that have no common neighbour; hence, these methods are omitted for further analysis. The AUC of PA, CCPA, SHOPI and DICN methods is shown in Table 3. It can be seen that DICN is more accurate than other methods when distinguishing links between nodes with no common neighbours for five of the nine networks, while it has an accuracy very close to the best for the remaining four networks. In this experiment, by default the value of direct similarity in Eq. (14) is zero for all compared edges. Still, DICN can accurately distinguish the test and non-existing edges. Once again, this experiment suggests that calculating the correlation between neighbourhood vectors provides a good accuracy to detect indirect similarity between nodes when there are no common neighbours between them.
Experiment 4
This experiment is dedicated to validation of Hypothesis 2, which relates to assessing the ability of the methods to distinguish links between nodes with the same number of common neighbours. To do so, for each of the nine networks we take again the set of test edges, \(E^P\) and the set of non-existing edges, \(E^N\). From these two sets, we select the edges that connect nodes with the same number of common neighbours. Then we calculate the similarity score for each of these edges using our proposed method DICN, and the best performing methods from Experiment 2: NCC, CNDP, CCPA and SHOPI. The AUC of each method is shown in Table 4. Once again, the ability of DICN to consider latent relationships leads to higher accuracy in five of the nine networks. In the KRT, DLN and YST networks, DICN has results that are close to the best method. Only in the UAL network the NCC, CNDP and SHOPI methods significantly outperform DICN. Overall, the results obtained in this experiment confirm that assessing correlation using a neighbourhood vector for nodes is an accurate way to distinguish the test and non-existing edges of nodes with an equal number of common neighbours.
Conclusion
The prediction of future links and the identification of missing links have attracted significant research in social networks analysis. Different methods have been proposed for it, many of which are based on the number of common neighbours. The idea behind this paper has been that latent relationships between the nodes are not captured by the number of common neighbours. Thus, to take into account such latent relationships, a correlation-based measure was proposed and its accuracy was compared to other related methods, giving superior accuracy results. Further work can look into more elaborate experimentation and networks with varying characteristics, including directed and weighted networks. In addition, the definition of latent relationship can be expanded beyond second-order relationships, for example including correlation with the number of paths between the nodes or global properties, such as centrality of the nodes, and so on.
References
Lü, L. & Zhou, T. Link prediction in complex networks: a survey. Phys. A 390, 1150–1170 (2011).
Zhu, L., Guo, D., Yin, J., Ver Steeg, G. & Galstyan, A. Scalable temporal latent space inference for link prediction in dynamic social networks. IEEE Trans. Knowl. Data Eng. 28, 2765–2777 (2016).
Ma, C., Zhou, T. & Zhang, H.-F. Playing the role of weak clique property in link prediction: a friend recommendation model. Sci. Rep. 6, 1–12 (2016).
Kumar, A., Singh, S. S., Singh, K. & Biswas, B. Link prediction techniques, applications, and performance: a survey. Phys. A Stat. Mech. Appl. 124289 (2020).
Pan, L., Zhou, T., Lü, L. & Hu, C.-K. Predicting missing links and identifying spurious links via likelihood analysis. Sci. Rep. 6, 1–10 (2016).
Clauset, A., Moore, C. & Newman, M. E. Hierarchical structure and the prediction of missing links in networks. Nature 453, 98–101 (2008).
Martínez, V., Berzal, F. & Cubero, J.-C. A survey of link prediction in complex networks. ACM Comput. Surveys 49 (2016).
Newman, M. E. Clustering and preferential attachment in growing networks. Phys. Rev. E 64, 025102 (2001).
Yang, J. & Zhang, X.-D. Predicting missing links in complex networks based on common neighbors and distance. Sci. Rep. 6, 38208 (2016).
Lü, L., Jin, C.-H. & Zhou, T. Similarity index based on local paths for link prediction of complex networks. Phys. Rev. E 80, 046122 (2009).
Liben-Nowell, D. & Kleinberg, J. The link-prediction problem for social networks. J. Am. Soc. Inform. Sci. Technol. 58, 1019–1031 (2007).
Wang, C., Satuluri, V. & Parthasarathy, S. Local probabilistic models for link prediction. In Seventh IEEE international conference on data mining (ICDM 2007), 322–331 (IEEE, 2007).
Yu, K., Chu, W., Yu, S., Tresp, V. & Xu, Z. Stochastic relational models for discriminative link prediction. Adv. Neural Inf. Process. Syst. 1553–1560 (2007).
Martínez, V., Berzal, F. & Cubero, J.-C. Adaptive degree penalization for link prediction. J. Comput. Sci. 13, 1–9 (2016).
Rafiee, S., Salavati, C. & Abdollahpouri, A. Cndp: Link prediction based on common neighbors degree penalization. Phys. A 539, 122950 (2020).
Ahmad, I., Akhtar, M. U., Noor, S. & Shahnaz, A. Missing link prediction using common neighbor and centrality based parameterized algorithm. Sci. Rep. 10, 1–9 (2020).
Li, F. et al. Node-coupling clustering approaches for link prediction. Knowl. Based Syst. 89, 669–680 (2015).
Shang, K.-K., Yan, W.-S. & Small, M. Evolving networks—using past structure to predict the future. Phys. A 455, 120–135 (2016).
Zachary, W. W. An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33, 452–473 (1977).
Kunegis, J. Hamsterster full network dataset—konect (2014).
Lusseau, D. et al. The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behav. Ecol. Sociobiol. 54, 396–405 (2003).
Xu, Z. & Harriss, R. Exploring the structure of the us intercity passenger air transportation network: a weighted complex network approach. GeoJournal 73, 87 (2008).
Rossi, R. A. & Ahmed, N. K. The network data repository with interactive graph analytics and visualization. In AAAI (2015).
Isella, L. et al. What’s in a crowd? Analysis of face-to-face behavioral networks. J. Theoret. Biol. 271, 166–180 (2011).
Von Mering, C. et al. Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417, 399–403 (2002).
Guimera, R., Danon, L., Diaz-Guilera, A., Giralt, F. & Arenas, A. Self-similar community structure in a network of human interactions. Phys. Rev. E 68, 065103 (2003).
Batagelj, V. & Mrvar, A. Pajek datasets (2006) (2009).
Bliss, C. A., Frank, M. R., Danforth, C. M. & Dodds, P. S. An evolutionary algorithm approach to link prediction in dynamic social networks. J. Comput. Sci. 5, 750–764 (2014).
Kumar, A., Mishra, S., Singh, S. S., Singh, K. & Biswas, B. Link prediction in complex networks based on significance of higher-order path index (shopi). Phys. A 545, 123790 (2020).
Adamic, L. A. & Adar, E. Friends and neighbors on the web. Soc. Netw. 25, 211–230 (2003).
Lü, L. & Zhou, T. Link prediction in weighted networks: the role of weak ties. EPL (Europhysics Letters) 89, 18001 (2010).
Wu, Z., Lin, Y., Wang, J. & Gregory, S. Link prediction with node clustering coefficient. Phys. A 452, 1–8 (2016).
Wu, Z., Lin, Y., Wan, H. & Jamil, W. Predicting top-l missing links with node and link clustering information in large-scale networks. J. Stat. Mech: Theory Exp. 2016, 083202 (2016).
Shang, K.-K., Li, T.-C., Small, M., Burton, D. & Wang, Y. Link prediction for tree-like networks. Interdiscip. J. Nonlinear Sci. 29, 061103 (2019).
Yang, Y., Zhang, J., Zhu, X., Ma, J. & Su, X. Link prediction based on the tie connection strength of common neighbor. Int. J. Mod. Phys. C 30, 1950089 (2019).
Zhu, X., Tian, H. & Cai, S. Predicting missing links via effective paths. Phys. A 413, 515–522 (2014).
Zhu, X., Tian, H., Cai, S., Huang, J. & Zhou, T. Predicting missing links via significant paths. EPL Europhys. Lett. 106, 18008 (2014).
Zhu, X., Tian, Y. & Tian, H. Link prediction in complex network via penalizing noncontribution relations of endpoints. Math. Probl. Eng. 2014 (2014).
Zhou, T., Lü, L. & Zhang, Y.-C. Predicting missing links via local information. Eur. Phys. J. B 71, 623–630 (2009).
Papadimitriou, A., Symeonidis, P. & Manolopoulos, Y. Fast and accurate link prediction in social networking systems. J. Syst. Softw. 85, 2119–2132 (2012).
Acknowledgements
We would like to thank the anonymous reviewers whose comments helped improve the quality of the manuscript.
Author information
Authors and Affiliations
Contributions
A.Z. proposed original idea, developed code and designed and conducted the experiments under guidance from R.S. Both authors planned the work, analyzed the results and reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zareie, A., Sakellariou, R. Similarity-based link prediction in social networks using latent relationships between the users. Sci Rep 10, 20137 (2020). https://doi.org/10.1038/s41598-020-76799-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-020-76799-4
This article is cited by
-
Identifying influential users using homophily-based approach in location-based social networks
The Journal of Supercomputing (2024)
-
Incorporating self-attentions into robust spatial-temporal graph representation learning against dynamic graph perturbations
Computing (2024)
-
A new stochastic diffusion model for influence maximization in social networks
Scientific Reports (2023)
-
A Graph Representation Learning Framework Predicting Potential Multivariate Interactions
International Journal of Computational Intelligence Systems (2023)
-
A hybrid clustering approach for link prediction in heterogeneous information networks
Knowledge and Information Systems (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.