The reconstruction of complex networks with community structure

Link prediction is a fundamental problem with applications in many fields ranging from biology to computer science. In the literature, most effort has been devoted to estimate the likelihood of the existence of a link between two nodes, based on observed links and nodes’ attributes in a network. In this paper, we apply several representative link prediction methods to reconstruct the network, namely to add the missing links with high likelihood of existence back to the network. We find that all these existing methods fail to identify the links connecting different communities, resulting in a poor reproduction of the topological and dynamical properties of the true network. To solve this problem, we propose a community-based link prediction method. We find that our method has high prediction accuracy and is very effective in reconstructing the inter-community links.

Many complex systems can be naturally described by complex networks, which has largely deepened our understanding of the structure of real systems. For example, many topological properties, such as small-world 1 , scale-free 2 , assortativity 3 , community 4 and rich club 5 , have been uncovered in not only the social and technology systems we are using everyday [6][7][8][9][10][11] , but also the biology systems within our bodies [12][13][14] . In addition, network representation is useful from practical point of view. It allows us to optimize the systems for higher functionality [15][16][17] and predict the future evolution of real systems 18,19 . Link prediction is one of these significant research problems 20 . It aims to estimate the likelihood of the existence of a link between two nodes, based on observed links and nodes' attributes in a network. With this problem solved, a large amount of cost in lab experiment for identifying the missing data could be reduced 20 .
Link prediction methods assume that similar nodes are those that have similar connectivity patterns. Therefore, the essential problem in link prediction is to objectively estimate the similarity between nodes 21 . Up to now, many similarity metrics on link prediction have been proposed. The most straightforward method is the so-called Common Neighbor index which directly computes the number of overlapped neighbors between two nodes to determine their similarity 22 . This index, though simple, has many shortcomings. It is strongly biased to the large degree nodes and it works poorly in sparse networks. To solve these problems, many other methods, such as Jaccard 23 , Resource Allocation 24 , Local Path methods 25 etc, are designed. Recently, some attention has also been paid to study link prediction in weighted 26,27 , directed 28,29 , bipartite 30,31 networks. Moreover, some link prediction methods have been introduced to detect the spurious connections in complex networks 32 .
In order to quantify the quality of link prediction, the index called area under the receiver operating characteristic curve ( ) AUC is usually used 33 . In practice, it calculates the probability that a true link has a higher link prediction score than a nonexisting link. In the case of predicting missing links, the predicted links need to be added to the observed networks to obtain the reconstructed networks 20 . The AUC index can only reflect the fraction of corrected links added to the network, but cannot capture whether the reconstructed network has the same or similar structural and dynamical properties as the true network. This is especially important in the networks with community structure 34 . It can happen in such networks that a link prediction method correctly identifies many missing links, but completely neglects those links connecting different communities. These inter-community links actually play an important role in the networks. They characterize the interactions between different clusters 35 . They are also strongly related to many global network properties such as average shortest path and the betweenness centrality 36 . Without these links, some dynamical properties such as bond percolation will be largely distorted 37 .
In this paper, we apply several representative link prediction methods to reconstruct complex networks, namely to add the missing links with high likelihood of existence back to the networks. Even though large AUC is achieved, the reconstructed networks from these existing methods are found to be very different from the true networks, especially in terms of the average betweenness of the predicted links. This result indicates that the missing inter-community links are seldom captured by the existing link prediction methods. To solve this problem, we propose a community-based link prediction method. Our method can effectively identify the inter-community links by slightly sacrificing the prediction accuracy. The final obtained network can thus well reproduce the structural and dynamical properties of the true network.

Results
We consider an undirected network ( , ) G V E where V is the set of nodes and E is the set of links. In link prediction, the original links E are first randomly divided into two parts: the training set (E T ) and the probe set (E P ). The training set contains % 90 of the original links and the link prediction methods run on it. The probe set consists of the remaining % 10 of the original links (The results of other division ratios are shown in SI). The probe set is used to test the accuracy of the link prediction methods. The accuracy is usually measured by the AUC value (see the Methods section for details), the higher the better. Besides accuracy, we consider also whether the link prediction methods can effectively recover the structural properties of the original network. Normally, the link prediction methods predict missing links by assigning each unconnected node pair a score which estimates the likelihood for each node pair to have a missing link between them. An accurate link prediction method will assign high score to the true missing links and low score to the nonexistent links. Unfortunately, for most of the existing link prediction methods, there is no obvious score gap between the true missing links and nonexistent links. Therefore, in order to reconstruct the network, one has to assume that the number of true missing links L is roughly known. In this fashion, one can add L top-ranking links in the link prediction methods to the observed network to reconstruct the predicted network. The approach is widely used in the literature 38,39 . Consistent with the previous works, we also assume that we know roughly the total number of true missing links. The L node pairs (L = |E P |) with the highest score (denoted as the "predicted links") will be added to the training set E T to obtain the reconstructed network G′ (V, E′ ). A well-performed link prediction method should not only aim at achieving a high AUC value, but also make the structural properties of G′ (V, E′ ) close to G(V, E).
In this paper, we focus on the networks with community structure. According to the definition, the nodes within a community are densely connected while the nodes across communities are much more sparsely connected. In this kind of networks, the inter-community links are in general more difficult to be predicted. Without these inter-community links, the average shortest path length of the reconstructed networks would be much higher than the original networks, and the transportation dynamics 40 in this network would be much slower and congested in the reconstructed networks. In order to solve this problem, we propose a community-based link prediction method. We first detect the communities by using the EO algorithm 41 in the training set. Then the similarity scores between unconnected node pairs are computed by some classic local similarity measures (i.e. the CN or RA methods, see the Methods section for definitions). We also consider three global link prediction methods 32,39,42 , the results are similar to those of CN and RA (see Supplementary Information (SI)). A tunable parameter β ∈ [0, 1] is proposed to combine the information of communities and node similarity for link prediction. In practice, the node pairs are classified as intra-community pairs and inter-community pairs. Within each classification, the node pairs are ranked in descending order according to the similarity measures. β controls the probability that the intra-community node pairs ranked higher than the inter-community node pairs (see the Methods section for details). This method is inspired by ref. 43 but used here for a different goal. For convenience, when the method is combined with common neighbor similarity, it is called community-based CN method (CBCN). Similarly, it is called community-based RA method (CBRA) when it is combined with the resource allocation similarity. The illustration of the method is shown in Fig. 1. Like previous works 43 , we adopt AUC to evaluate the accuracy of the link prediction. In addition, we propose to monitor the average edge-betweenness B of the predicted links (calculated by adding those predicted links to the network). If the average edge-betweenness is high, more inter-community links are predicted (For the solid evidences, see SI). In fact, measuring the average betweenness of the reconstructed network is also a good evaluation metric for this issue. Despite some quantitative difference, the results are qualitatively consistent with the results when B is used (see results in SI). We first test our method in a classical artificial network: GN-benchmark network 35 which is widely used in the research of community structure. In the GN-benchmark network, n = 128 nodes equally distribute in 4 communities, and each node has on average in out links where k in is the average number of neighbors within the same community ( ≤ ≤ k 8 1 5 in ) and k out is the average number of neighbors between different communities ( ≤ ≤ k 1 8 out ). As k in increases, the community structure of network becomes clear. Given an observed network, the obtained similarity score between nodes is deterministic if CN and RA similarity measurements are applied. However, the community detection algorithm has randomness. Therefore, there is some stochasticity in the link prediction process coming from the community detection algorithm. In this paper, we use the extremal optimization (EO) algorithm to detect communities. As stated in ref. 41, the performance of this algorithm is rather stable. Therefore, the stochasticity of the link prediction process is expected to be relatively small. We perform several times of realizations and find that the variance is much smaller than the mean value. Therefore, we mainly report the results of the mean value of different realizations.
In Fig. 2, we show the dependence of AUC and B on β under different k in . The CBCN and CBRA are used in Fig. 2(a-d), respectively. One can see that AUC increases with β, indicating that the links within the communities are easier to be predicted. The results of CBCN and CBRA are similar and the increment of AUC is more significant when the community structure is more obvious (i.e. larger k in ). This result is consistent with a recent finding in ref. 43. In Fig. 2(a,b), the dashed lines mark the AUC of the original CN and RA methods (without β to adjust the ranking of the intra-and inter-community missing links). One can see that the AUC of CBCN and CBRA can be respectively higher than the AUC of CN and RA when β is large.
In Fig. 2(c,d), it shows that B actually decreases with β. This is natural as a larger β means more intra-community missing links are ranked higher, thus the predicted links are mainly within communities. In Fig. 2(c,d) the dashed lines mark the B of the links in the probe set. Clearly, if one only considers AUC, β = 1 is the optimal solution. However, this setting of β would make B of the predicted links smaller than that of the true missing links. A good link prediction method should not only have high AUC but also make B of the predicted links close to that of the true missing links. Interestingly, we observe that when β is large, a small change in β can result in a significant decrease in B but little influence on AUC. This observation indicates the possibility to adjust β for a satisfactory results in both AUC and B .
We also examine our method on four real networks: ZK is a social network in the zahcary karate club 44 , NS is the largest connected component of a co-authorship network of scientists who are publishing on the topic of network science 45 , Email is an email network of an university built by regarding each email address as a node and linking two nodes if there is an email communication between them 46 , C.elegans is a neural network of the worm Caenorhadities elegans with each neuron as a node and each When β = 0, the inter-community missing links are ranked higher than the intra-community missing links in the prediction list. Therefore, mainly inter-community links are added to the network by the link prediction method. When β = 1, the intra-community missing links are ranked higher than the intercommunity missing links in the prediction list, and mainly intra-community links are added to the network. When β = 0.05, the results are mixed, both inter-and intra-community missing links are added to the network. The similarity measure used in this toy network is CN.
Scientific RepoRts | 5:17287 | DOI: 10.1038/srep17287 synapse or gap junction as a link 47 . All of these real networks are widely used in the literature and the basic structural properties of them are listed in Table 1. Here we use them to examine our methods. Figure 3 shows the performance of the community-based link prediction methods on these real networks. One can see that the results are qualitatively the same as those in the GN-benchmark networks. In these real networks, as the community structure is not as obvious as the GN-benchmark, the effect of β on AUC is even smaller, especially after β > 0.1. However, the influence of β on B is still strong.
We denote β ⁎ as the β that can make B of the predicted links the same as that of the true missing links (i.e. the links in the probe set). Accordingly, the AUC under β ⁎ is denoted as ⁎ AUC . The quantitative results of β ⁎ and ⁎ AUC in four real networks are reported in Table 1. Clearly, the ⁎ AUC of CBCN and CBRA can still be higher than the AUC of CN and RA, respectively.
To further understand the performance of each method, we compute the number of correctly predicted inter-and intra-links and the number of inter-and intra-links in the predicted links (results are shown in SI). We find that when the existing link prediction methods are used in GN-benchmark, the number of inter-links in the predicted links is almost zero, indicating that these existing methods tend to neglect inter-links. On the contrary, CBCN and CBRA have many inter-links in the predicted links. However, if we look at the number of correctly predicted inter-links in our methods, the number is also small. This is because the inter-links are sparsely and randomly connected in GN-benchmark (i.e. almost  form no triangle) and it is difficult for CBCN and CBRA to capture their similarity to other links. In real networks, however, the inter-links form more triangles than thus are easier to be predicted. We test the NS real network with clear community structure (collaboration network between network scientists). We find that CN and RA can correctly predict 17.6 and 30.0 inter-links while CBCN and CNRA can correctly predict 23.7 and 31.5 inter-links (For more detailed results in NS network, see SI). These results indicates that CBCN and CNRA can respectively outperforms CN and RA in real networks as well.
In Fig. 4, we further investigate the influence of k in on β ⁎ and ⁎ AUC in the GN-benchmark networks. In Fig. 4(a,b), one can see that β ⁎ has an abrupt change after k in > 10. After this value, β ⁎ significantly increases with k in . This is because when the community structure is obvious (k in > 10), we don't have to sacrifice too much AUC and a large β can already make B close to the true value. In Fig. 4(c,d), we show the dependence of ⁎ AUC on k in . One can see that when k in is large, ⁎ AUC is very close to the AUC of the original CN or RA. However, when k in is relatively small, ⁎ AUC can be much smaller than AUC of CN or RA. This is because when k in is small, β needs to be adjusted to a very small value in order to keep B of the predicted links the same as the real links (as shown in Fig. 2). In this case, a large amount of AUC needs to be sacrificed for a higher B .
So far, we have already shown that adjusting β in the community-based link prediction methods can indeed help the methods predict more high-betweenness links in the networks. A natural question to ask at this point is how to choose β in real use. Even though β ⁎ can be chosen at the value where B of the predicted links becomes the same as the real links. However, as B of the real links is unknown information, the above strategy seems to be an inapplicable way. To solve this problem, one has to learn the optimal β ⁎ from the observed data. To mimic this process, we use a so-called threefold validation where a small part (usually % 10 of all links) is moved from the previously introduced training set E T to a learning set E L 48 . The threefold validation is usually used to avoid model over-fitting in machine learning. In our case, by checking at which β the predicted links from E T can have the same B as the links in E L , one can determine the estimated optimal parameter β ⁎ e . One concern for the learning process is that the missing links may largely change the structural properties. To check this, we first conduct the community detection algorithm (EO algorithm) on the original true network and denote the obtained communities as the "true detected communities". Then we randomly remove a fraction of links from the true network to obtain the observed network. We do again the community detection algorithm on the observed network and compute the fraction of nodes classified correctly by comparing the obtained communities with the so-called "true detected communities". We find that the fraction of nodes classified correctly is rather high, especially when the community structure is obvious (correct rate is over 80% when k in ≥ 10). Moreover, we compare β ⁎ e with β ⁎ determined with E P in Fig. 4(a,b). One can see that β β  ⁎ ⁎ e at different k in . The learned optimal parameter β ⁎ e is then used to predict missing links based on ∪ E E T L which are then compared with entries in E P to finally measure the link prediction accuracy ⁎ AUC e . The results are shown in Fig. 4(c,d). One can see that ⁎ AUC e is indeed close to ⁎ AUC . As discussed above, the β ⁎ is usually too small when k in < 13, which directly results in a low AUC in link prediction. Therefore, we propose an additional constraint in the learning process: when determining the optimal β ⁎ e with the learning set E L , we also monitor the prediction AUC of these links in E L (denoted as AUC E L). In order to make sure the optimal β ⁎ e will not be too small, we assume that at most we can sacrifice % 5 of the accuracy. Here, we define the AUC of the original method CN or RA as AUC o . If before AUC E L drops to % 95 of AUC o , the predicted links can have the same B as the links in E L , β ⁎ e is chosen as this crossover point. If not, β ⁎ e is chosen as the value where AUC E L equals to % 95 of AUC o . The β ⁎ e obtained in this way is denoted as "constrained β ⁎ e ". The results of the constrained β ⁎ e and its prediction accuracy "constrained ⁎ AUC e " are shown in Fig. 4 as well. So far, we have discussed three parameters: β ⁎ , β ⁎ e and constrained β ⁎ e . A summary of these three parameters is given in Table 2. Note that even though the amount of missing links is not known, the estimation of β ⁎ e and constrained β ⁎ e will not be influenced. This is because β ⁎ e and constrained β ⁎ e are obtained from the learning process in which the amount of links in the learning set E L is known.
Moreover, we study whether the structural and dynamical properties of the reconstructed networks from CBCN and CBRA are truly closer to the true networks. We take into account six indices, including the average shortest path of the networks ( ) d , clustering coefficient (C) 47 , assortativity coefficient (r) 3 , congestibility (D) 49 , synchronizability (Q) 50 and spreading ability (µ c ) 51 . The results of different link prediction methods are listed in Table 3. The original real networks are denoted as A 0 . We first randomly divide the links in A 0 to three parts: training set E T (with 80% of the links), learning set E L (with 10% of the links) and probe set E P (with 10% of the links). We apply the community-based link prediction methods to compute the constrained β ⁎ e with E T and E L . Then we do ∪ E E T L to obtain a complete E T .
We apply the community-based link prediction methods with the constrained β ⁎ e on the complete E T . The E P number of links with the highest link prediction score are then added to E T to create the reconstructed network ⁎ A . We also create the reconstructed networks with β arbitrarily set as 0 and 1, and denote these networks as A 1 and A 2 , respectively. For comparison, the reconstructed networks with the    Table 3. The properties of the reconstructed networks when different link prediction methods are applied. A 0 represents the original networks, and A 1 , ⁎ A , A 2 stand for the reconstructed networks, when β = 0, β = constrained β ⁎ e , β = 1 respectively. A 3 is the reconstructed networks of the traditional methods CN and RA. ( ) d , C, r, D, Q, µ c in turn, represent the average shortest path, the clustering coefficient, the assortativity coefficient, congestability, synchronizability and spreading ability of the networks. We highlight the values that are closest to the original networks in bold font. The results are averaged over 100 independent realizations.
Scientific RepoRts | 5:17287 | DOI: 10.1038/srep17287 traditional link prediction methods (e.g. CN and RA) are denoted as A 3 . From Table 3, we can see that the reconstructed networks from the community-based link prediction methods (i.e. A 1 , A 2 and ⁎ A ) have more similar network properties to the real network A 0 than those obtained by the traditional link prediction methods (A 3 ). The best results sometimes appear in A 1 and A 2 . However, when A 1 is closest to A 0 , A 2 is very different from A 0 , and vice versa. ⁎ A keeps a reasonable trade-off between these two methods: ⁎ A best reproduces the network properties of A 0 in many cases; when ⁎ A is not the best, ⁎ A is the closest one to the best. These results confirm the importance of the parameter learning process.
Finally, we discuss the computational complexity of our method. The method is actually a combination of local link prediction algorithm and the community detection algorithm. For the local link prediction algorithm such as CN and RA, the computational complexity is ( * ) O N k 2 where N is the number of nodes and k is the mean degree of the network. In this paper, we use the extremal optimization (EO) algorithm for community detection, with computational complex ( * ) O N lnN 2 . Apparently, the computational complexity in our method is mainly determined by the community detection algorithm. If the method is applied to large networks, one can choose a faster community detection algorithm, such as the method in ref. 52 with complexity ( + ) O N L in which L is the number of edges in the network.

Discussion
Predicting the missing or future links is a very important research topic itself and has applications in many different domains. Although many link prediction methods have been proposed in the literature, they consider all the missing links homogeneous (i.e. all the missing links are considered equally important). In this paper, we argue that in the networks with community structure, the links connecting different communities are actually of more significance and more difficult to be predicted. We propose a community-based link prediction method which allows us to predict more missing inter-community links (with high edge-betweenness) in both artificial and real networks. The results show that our method can predict more high betweenness links without losing much link prediction accuracy. As the community-based link prediction method has a parameter to tune, we propose a learning process to determine the optimal parameter. We finally apply the community-based link prediction method to reconstruct networks. The results show that the reconstructed networks by our method have very similar network properties with the real networks. Even though our paper tries to solve a specific problem, it points out several long-neglected important issues in link prediction research: (i) Links in the network are not with equal importance. The algorithms should give priority to those important links. (ii) Prediction results should be evaluated not only by accuracy but also by how much the predicted links can recover the properties of the true network. (iii) The parameters in the link prediction algorithms should be estimated via a learning process before applied to real prediction. These issues will encourage researchers to reconsider the existing works in link prediction and may inspire a series of more effective algorithms in the future.
In this paper, we proposes an effective method to predict the inter-community links. Compared to the existing methods which all fail to predict the inter-community links (especially when the community structure is obvious), our method has a large proportion of inter-community links in the top ranking. We admit that the improved precision of these inter-community links is not high, this is because those links have a very low probability of existing. However, by including more inter-community links in the prediction list, we manage to obtain reconstructed networks with closer topological properties to the true networks. Predicting important links in networks is a scientific problem which cannot be completely solved in one paper, it surely asks for more studies in the future. Therefore, our paper raises up some important questions for future research. The method in this paper use the classic EO community algorithm to detect communities. An interesting question would be comparing the performance of different community algorithms in helping link prediction algorithms identify inter-community links. In the networks without clear community structure, the links with high edge-betweenness are still more important than the low edge-betweenness links. In these networks, the method proposed in this paper cannot be directly applied as it relies on the community detection method. Therefore, how to predict high edge-betweenness links in networks without community structure is an important extension. Finally, our study highlights the fact that the missing links are not with equal importance. Besides betweenness, the importance of links can be measured by other properties such as degree-product, clustering coefficient, link salience 53 etc. We hope the method in this paper will shed some light on designing methods to predict these kinds of important links in complex networks.

Methods
Classic link prediction algorithms. We use two representative classic link prediction algorithms in this paper: common neighbors (CN) and resource allocation (RA). After the network data is divided into the training set E T and probe set E P , these two methods generate the predicted links by estimating the similarity values between different node pairs in E T . We denote the set of neighbors of node x by Γ( ) x . CN simply measures the similarity between node x and node y with the number of overlapped neighbors, where k z is the degree of node z and O xy is the set of the common neighbors between x and y. After obtaining s xy for each node pairs, the missing links is ranked by sorting s xy in descending order.
Community detection. The community detection method in the paper is the EO method 41 . It detects communities by optimizing the modularity Q with a heuristic search. The modularity Q is defined as ( ) where q j is the contribution of individual node j given a certain partition into communities. γ ( ) c j is the number of links node j has with nodes in the same community ( ) c j , ( ) c j is the community which node j belongs to. k j is the degree of node j and ( ) a c j is the fraction of links that have one or two nodes inside of the community ( ) c j . M is the number of the links in the network.
Community-based link prediction method. After computing s xy , the node pairs are classified into two sets according to the community detection results: intra-community node pairs and inter-community node pairs. The node pairs in each set are ranked according to s xy in descending order. The ranking list in intra-community node pairs is denoted as R inter and the ranking list in inter-community node pairs is denoted as R inter . The parameter β is used when R inter and R inter are combined. Initially, R is empty. The node pairs are then moved from R inter and R inter to R one by one from top to bottom. In each step, R inter is picked with probability β and R nter i is picked with probability β − 1 . For instance, if there is already n node pairs in R and in next step R inter is picked, highest ranked node pair in R inter is removed and placed in the + n 1 position in R. Note that the ranking list R inter and R inter become shorter and shorter while the ranking list R becomes longer and longer. The procedure is terminated if both R inter and R inter are empty. Besides AUC, we considered another important metric called Precision. It is defined as the fraction of correctly predicted links in the top-L ranking list. Here, L is set as the total number of missing links. The results are shown in SI. Despite some quantitative difference, the results of precision are qualitatively consistent with that of AUC (i.e. prediction accuracy increases with β). B is defined as the average betweenness of the predicted links when they are added to the networks. The predicted links are just E P number of top ranking links in R. The betweenness of a link B ij is defined as the ratio of the shortest paths which pass through the edge e ij among all the shortest paths in the network,