Weight prediction in complex networks based on neighbor set

Link weights are essential to network functionality, so weight prediction is important for understanding weighted networks given incomplete real-world data. In this work, we develop a novel method for weight prediction based on the local network structure, namely, the set of neighbors of each node. The performance of this method is validated in two cases. In the first case, some links are missing altogether along with their weights, while in the second case all links are known and weight information is missing for some links. Empirical experiments on real-world networks indicate that our method can provide accurate predictions of link weights in both cases.

A weight prediction method based on neighbor set. Our method relies on the assumption that the formation of link weights is regulated by local clusterings in which homogenous links tend to have similar weights. The local structure we considered is the neighbor set of a node, defined as the set of nodes linked to it, which captures a great deal of information about the node. For instance, in online social networks, the neighbors of a node represent the friends of a person. The co-existence of two people in the same neighbor set enhances the probability of their relationship changing from non-friends to friends.
Our method of weight prediction can be well explained by considering a simple example, illustrated in Fig. 1. If a link is generated between nodes a and b, then one wish to have a guess of the weight, w ab , of link (a, b). Since nodes a and b appear together in neighbor set A, according to our fundamental hypothesis, the weight of (a, b) is related to the weights of other similar links in A. Because only one other link (b, c) exists, here we focus on examining its relationship to the candidate link. From our perspective, the weight of (b, c) is correlated with the weights of links to common node α. For example, if Fig. 1 is an illustration of social networks and weights of the network indicate time commitments, then the amount of time b spends with c depends on the time α spends with b and c 33 . If the events "α is with b" and "α is with c" are independent of each other, then the event "b is with c" would have probability equal to the product of their probabilities. Based on this, one can simply estimate w ab as ⋅ . If other similar links exist, an averaging strategy will be applied to combine the estimates. Similarly, one can use the weight of link (d, e) to infer the weight of link (a, f) across neighbor sets A and B by estimating the value of w af to be ⋅ In practice, a node often belongs to more than one neighbor set. For example, in Fig. 1, node d belongs to both neighbor sets A and B. Indeed, any two candidate nodes may coexist in multiple neighbor sets. In this case, a natural approach is to calculate the individual contributions of different neighbor sets or pairs of neighbor sets to the candidate node pair and then combine them to obtain a more accurate weight. Viewed in terms of link formation, our hypothesis is that loosely connected clusterings are less likely than densely connected clusterings to form a link 34 and thus contribute less to the link weight of the link once it is formed. Under this assumption, a more refined weight can be estimated based on the connection probabilities in or across different neighbor sets.
A detailed explanation is as follows. Suppose our goal is to estimate the weight of the link (x, y). Let Γ (x) be the set of neighbors of node x and L xy 1 be the event that nodes x and y are connected. If nodes x and y both belong to the neighbor set of node α, i.e., x, y ∈ Γ (α), the weight of (x, y) can be written as Figure 1. An illustration of neighbor set. In this example, neighbor set A is defined as the set of neighbors of node α, which are a, b, c and d. Neighbor set B consists of three nodes, namely, d, e and f. Note that node α and node β have one common neighbor, which belongs to both neighbor sets A and B. Within neighbor set A, because there is only one link, the existence probability for the remaining possible links is = ⋅ − which is the average clustering weight over links similar to (x, y). Note that we apply add-one smoothing to preclude the possibility of an undefined fraction. Based on our hypothesis, in order to quantify the contribution of the neighbor set Γ (α) to the formation of w xy , we need to calculate the probability that the pair (x, y) is connected, given that both are in the neighborhood Γ (α). This probability can be estimated through On the other hand, if nodes x and y belong to different neighbor sets, say x ∈ Γ (α), y ∈ Γ (β), the weight of (x, y) can be described as which is the average weight across clusterings. When nodes x and y appear in separate neighborhoods, we can use the connection probability across the two neighbor sets to measure their contribution to the formation of w xy . Then the probability that nodes x and y are connected can be written as Clearly, this equation measures the connection density across neighbor sets Γ (α) and Γ (β). Finally, by considering the contributions of different neighbor sets to the formation probability of the link (x, y), we can estimate w xy by xy xy xy x y xy xy x y xy 1 are normalized probabilities, defined as . In this experiment, the training set contains 90% of the links, and the validation set contains the remaining 10%. With the help of link predictors, the candidate node pairs are sorted based on their scores. Then the top-L links are selected as the predicted link set E L . In this paper we set L as the size of validation set for the reason of weight prediction. After link prediction, the weight prediction algorithm is conducted. The corresponding predicted weight set and actual weight set are denoted as Ŵ L and W L , respectively. The actual weights for non-observed links are set to zero. In most cases, this default value is Scientific RepoRts | 6:38080 | DOI: 10.1038/srep38080 reasonable. For example, in transportation networks, if there is no connection between two nodes, then the traffic flow directly between these two nodes is zero. For some special cases, this is not the appropriate default, such as for those networks whose weights denote distances between nodes. However, this default is appropriate for all networks we presented here. Then the accuracy of weight predictor can be estimated by calculating the Pearson correlation coefficient and root mean squared error (RMSE) between the vectors Ŵ L and W L . Table 1 compares the accuracy of the linear-correlation method 30 (refer to the Methods section for details) with that of our method, as measured by the Pearson correlation coefficient under different link prediction approaches. Each Pearson correlation coefficient is calculated between the vectors of predicted and actual weights for the top-L ranked links. A larger correlation coefficient indicates more accurate linear correlation between predicted and actual weights. As shown in the table, almost all of the correlation coefficients achieved by our method are larger than those from the linear-correlation method under every link prediction algorithm, indicating that the good performance is due to our method itself, regardless the detailed link prediction algorithms. The linear-correlation method assumes that weights of links measure similarities or affinities between nodes, so the correlation between similarities and weights is weak for those networks whose weights don't exhibit similarities between nodes, such as the Everglades network. On such networks the linear-correlation method performs poorly. In contrast, our method can be applied to a wider range of networks, in which weights do not necessarily characterize similarities between nodes.
We also calculate the RMSE between the vectors of predicted and actual weights for the top-L ranked links. Detailed results are summarized in Table 2. As shown in the table, in most networks the weights predicted by our method have remarkably smaller errors than those predicted by the linear-correlation method under a variety of link prediction algorithms. In Everglades and USAir1, our method performs similarly to the linear-correlation method as measured by RMSE. However, combining with the metric of Pearson correlation coefficient, we can find that our method performs significantly better than the linear-correlation method on those networks.  Table 2. Comparison of prediction accuracy under the metric of root mean squared error for the top-L ranked links. In each network, the first row is the results achieved by the linear-correlation method, while the second row shows the accuracy of our method. Each accuracy value is an average over 100 independent random divisions of the links into a training set and a validation set. Furthermore, the linear-correlation method employs the information from the validation set to estimate the scaling coefficient in Eq. (19). As a result, predictive information from the validation set leaks into the optimization step and will lead to optimistically biased performance estimates 35 . This does not happen in our method because only the information from the training set is used. Next, we consider the case where only the weight information is missing. In this case, we can directly set E L as E V . For our method, the link prediction is not needed, and we can directly perform weight prediction. However, since the linear-correlation method uses link prediction to calculate the similarity scores S V , it actually still needs to perform both link prediction and weight prediction.
The Pearson correlation coefficient and RMSE between the vectors of predicted weights and actual weights are presented in Tables 3 and 4, respectively. Compared with the linear-correlation method, our method generally gives better estimates of weights in most networks. On the other hand, the advantages of our method are not so apparent when using RMSE to measure accuracy.
Altogether, empirical experiments indicate that the weights of links can be recovered more correctly by our method, in contrast to the linear-correlation method.
Furthermore, to assess the robustness of our method, we also present the accuracy results of weight predictions on different sizes of training set (ranging from 40% to 90%) in Figs 2 and 3. The results demonstrate that the advantages of our method is not sensitive to the density of the network. Because the CN-based (CN, WCN  and rWCN) indices have similar precisions in link prediction, our weight prediction method yields roughly the same results using these indices, as observed from the nearly identical points in Fig. 2. The same phenomenon also occurs by employing AA-based (AA, WAA and rWAA) and RA-based (RA, WRA and rWRA) indices. These figures also show that our method outperforms the linear-correlation method in most cases.

Discussion
In this paper, we explore the problem of weight prediction in weighted networks. A novel weight prediction algorithm which examines the local structure of neighbor set is proposed. To assess the prediction accuracy of our method, empirical experiments are conducted on six real-world networks. The simulation results demonstrate that our method can predict weights much more accurately than the linear-correlation method as measured by the Pearson correlation coefficient and root mean squared error. Furthermore, our method can be used no matter whether the existence of links is missing or not.

Methods
Link prediction algorithms. As described above, if the existence of some links is unknown, we need to determine which candidate links are most likely to exist before inferring their weights. Plenty of methods have been proposed to address this link prediction problem. Among them, Common Neighbors (CN) is the simplest framework to determine which non-connected node pair is more likely to become connected. Its basic  Table 3. Comparison of prediction accuracy under the metric of Pearson correlation coefficient when only the weight information is missing for some links. The validation set always contains 10% of the links from the example network. Each accuracy value is an average over 100 independent random divisions of links into a training set and a validation set. In each network, the best performance is emphasized in bold.  Table 4. Comparison of prediction accuracy under the metric of root mean squared error when only the weight information is missing for some links. The validation set always contains 10% of the links from the example network. Each accuracy value is an average over 100 independent random divisions of the links into a training set and a validation set. In each network, the best performance is emphasized in bold.  obtain a set of the candidate links most likely to exist. Because local information is applied for both our method and ref. 30, we compare our performance only with that method. In ref. 30, the authors assumed that the similarity index for link prediction between two unconnected nodes also reflects their interaction strength. Then, inspired by a linear correlation between similarity scores and link weights in many empirical networks, they set the weights of missing links proportional to their similarity scores. Formally, let the weighted adjacency matrix corresponding to the training set E T and validation set E V be denoted by W T and W V , respectively and let S V be the vector of similarity scores for links in E V . Given the linear correlation mentioned above, we want to find the prediction function F(W T ) = λ · S V , which minimizes the difference between λ · S V and W V , where λ is a free parameter. This can be estimated by solving the following optimization problem:

Network\Index
where ||·|| F is the Frobenius norm, defined as the square root of the sum of the squares of the matrix's elements. For the sake of brevity, we will call this weight prediction method linear-correlation in this paper.

Data description.
In this work, we consider six networks to evaluate our new weight prediction method.
1) Celegans 1 : a neural network of the nematode worm C. elegans, where nodes represent neurons, links join neurons if they have synaptic contacts, and the weight stands for the number of synapses between two neurons. 2) Everglades 37 : a food web network describing carbon exchanges in the Everglades during the wet season, where each node represents a taxon, and an edge denotes that one taxon uses another as food, with link weights representing trophic factors (feeding levels). 3) USAir1 37 : a network of US air transportation, where the weights of links are the frequency of flights between airports. 4) USAir2 38 : a network of flights between US airports in 2010. The weight of a link shows the number of flights between two airports. 5) Advogato 38 : a trust network of the Advogato online community, where nodes represent users of Advogato, links represent trust relationships and weights indicate the trust levels between users. 6) Geom 37 : a collaboration network of researchers in the area of computational geometry, where nodes represent authors, links join authors if they have coauthored a paper and weights are the numbers of joint works. To compare the results across different data sets, all link weights are normalized to fall with in the interval [0, 1] as in ref. 30.