Reconstructing propagation networks with temporal similarity

Node similarity significantly contributes to the growth of real networks. In this paper, based on the observed epidemic spreading results we apply the node similarity metrics to reconstruct the underlying networks hosting the propagation. We find that the reconstruction accuracy of the similarity metrics is strongly influenced by the infection rate of the spreading process. Moreover, there is a range of infection rate in which the reconstruction accuracy of some similarity metrics drops nearly to zero. To improve the similarity-based reconstruction method, we propose a temporal similarity metric which takes into account the time information of the spreading. The reconstruction results are remarkably improved with the new method.

The "special range" is actually due to two reasons: the similarity degeneracy and degree penalty of the similarity metrics. In order to understand why the "special range" exists when CN similarity is used, we take a detailed look at the similarity matrix under different infection rate µ. In our method, we need to take the E node pairs (E is the number of links in the true network) with the highest similarity and add links between each node pair to obtain the reconstructed network. When µ is in the special range, we find that there is a serious similarity degeneracy problem when we try to pick up the top E node pairs. It means that there are so many node pairs with the same similarity that one cannot use a simple similarity threshold to cut the similarity and obtain exact top E node pairs. We sort the similarity values in descending order and consider two adjacent similarity Sc 1 and Sc 2 such that E 1 < E node pairs will be obtained when Sc 1 is used as the threshold and E 2 > E node pairs will be obtained when Sc 2 is used as the threshold. In our method, Sc 1 is used in this case and E − E 1 node pairs are randomly selected from the node pairs with similarity Sc 2 . We checked the value of E 2 − E 1 , it is usually very large (e.g. larger than 10 4 ). Therefore, these E − E 1 node pairs needs to be randomly selected from a large number of candidates, resulting in a low reconstruction precision. To show this, we plot the dependence of E − E 1 on µ in Fig. S7. One can see that in SW, E − E 1 indeed suddenly increases in the "special range" . The increase of E − E 1 in BA is smaller, so we only observe a small drop of precision in the special range in this network in Fig. S7.
The similarity degeneracy in the "special range" is related to the spreading coverage under different µ. When µ is small, the spreading can only propagate very limited number of steps. Most of node pairs are with zero similarity. As µ increases, more nodes are with nonzero similarity and the similarity degeneracy becomes less serious. However, when µ is around the critical infection rate, the spreading starts to propagate locally, making many node pairs receive one or two common news. The similarity degeneracy becomes serious again. As µ further increases (larger than the "special range" ), spreading propagates globally and the similarity between node pairs becomes well separated. This problem in Jac and LHN is less serious because they effectively reduce the similarity degeneracy by some normalization (but we can still observe an increase of E − E 1 in the "special range" ). However, the normalization in these methods may cause strong penalty on large degree nodes, contributing also to the drop of precision in the "special range".
To show this, we denote the number of news the node i received as d i and the degree of node i in the network as k i . We first study the correlation between d i and k i under different infection rate µ. The result is shown in Fig. S8. In BA network, one can see that the correlation first increases with µ when µ is small. After reaching a highest value, the correlation starts to decrease. The "special range" happens when the correlation is very high. The behavior of the correlation between d i and k i is similar in SW. However, the "special range" happens when the correlation is around 0.25 (but not the highest).
Based on the results in Fig. S8, we conjecture the reason that forms the "special range" in Jac and LHN is as follows: • µ smaller than the "special range": When µ is low, the spreading can only cover a small number of nodes.
As the spreading information is so little (the similarity matrix is very sparse), it is natural that precision of the network reconstruction is low. As µ increases, the spreading covers more and more nodes and the similarity matrix becomes denser. The network reconstruction precision gradually increases with µ.
• µ within the "special range": As µ increases in BA networks, the correlation between d and k increases as well. The LHN metric computes the similarity between two nodes by the number of their common news over the product of their received news number (i.e. d i d j ). When d becomes more strongly correlated with k, the large degree nodes will be punished and finally have very small number of links in the reconstructed network. This effect is absent when µ is small because the correlation d and k is very low in that case (e.g. lower than 0.25). The punishment from d i d j is not concentrated on large degree nodes. In SW networks, as the correlation between d and k is not very high and there is no hub with extremely large degree in SW networks, the effect of the punishment is weaker and the drop of precision is less obvious in the "special range".
• µ larger than the "special range": when µ becomes even larger, the spreading can reach the small degree nodes more frequently. In this case, we can obtain meaningful similarity relation between the small degree nodes and other nodes. Therefore, the correctly predicted links of the small degree nodes will significantly increases with µ when µ is large enough. This could be the reason why the precision increases again after the "special range".
In order to prove the conjecture above, we carry out more simulation. We consider a group of small degree nodes consisting of the all nodes with the lowest degree in the network, and a group of large degree nodes consisting of all the nodes with the highest degree nodes. As degree distribution is homogeneous in SW, the small degree nodes group has similar number of nodes as the large degree node group. However, in BA, there is only one node with the highest degree. Therefore, we consider top-10 nodes with highest degree in BA. We study the relation between the number of correctly predicted links of different groups and the infection rate µ in Fig. S9.
Our conjecture above is proved by Fig. S9. In Fig. S9(d) (i.e. LHN is used in BA), the correctly predicted links (denoted as N s ) are mainly the links connecting to large degree nodes when µ is small. When µ is in the special range, the large degree nodes are punished and N s drop to 0 due to the high correlation between k and d. Since some small degree nodes are connecting to these large degree nodes, their N s are slightly punished as well. Therefore, we also observe a small drop of N s for the small degree nodes. When µ is further increased, N s for large degree nodes stay zero, but N s for small degree nodes increases a lot since the spreading can reach the small degree nodes more frequently and result in a denser and more meaningful similarity matrix for prediction. The final drop of the number of correctly predicted links (when µ is very large) is because the spreading cover almost the whole network, the similarity extracted from the spreading can no longer reflects meaningful structural information of the network.
The results of LHN in SW are similar (see fig. S9(c)). As the degree is homogeneous in SW, the punishing effect is less obvious than BA. However, we can still observe that the large degree nodes start to be punished in the "special range" and small degree nodes start to have many correctly predicted links when µ is larger than "special range" . Since there is no penalty term in CN, there is no sudden drop of N s of the large degree nodes in both Fig. S9(a)(d).
The results of Jac are between CN and LHN, as shown in Fig. S9(b)(e).

B. Other similarity metrics
We study the influence of different parameters (i.e. N , k , µ) on the performance of different similarity metrics in network reconstruction, as shown below in Fig. S10. We consider eight similarity metrics including the common neighbor index (CN), Jaccard index (Jac), Cosine index (Cos), Hub depressed index (HDI), Hub promoted index (HPI), Sorensen index (SSI), Resource allocation index (RA), and Leicht-Holme-Newman index (LHN). For each method, we also study its temporal version. The description of these methods is as follows.
(i)Cosine Index (Cos) It also named the Salton index [1], which is defined as (ii) Temporal Cosine Index (TCos) The Cos index can be improved by T iα as (iii) Sorensen Index (SSI) This index is used mainly for ecological community data [2], and is defined as (iv) Temporal Sorensen Index (TSSI) The improved sorensen index is defined as (v) Hub Promoted Index (HPI) This index is proposed for quantifying the topological overlap of pairs of substrates in metabolic networks [3], and is defined as (vi) Temporal Hub Promoted Index (THPI) This index is defined as (vii) Hub Depressed Index (HDI) There is a measure with the opposite effect on hubs [4]. It is defined as (viii) Temporal Hub Depressed Index (THDI) It is defined as (ix) Preferential Attachment (PA) Based on the well-known Preferential Attachment process mechanism [3], the likelihood score for two nodes to have a link can be calculated as (x) Temporal Preferential Attachment (TPA) It can be expressed as (xi) Asymmetric Index (AS) This measure is for inferring the network topology in directed networks [5]. Mathematically, it can be expressed as (xii) Temporal Asymmetric Index (TAS) Similarly, it can be expressed as We find that the temporal similarity metrics can significantly outperform the corresponding traditional similarity metrics especially when µ is large. This is consistent with the findings in the paper. When k increases, the precision of both traditional similarity metrics and temporal similarity metrics tend to increase. When N increases, the precision of both traditional similarity metrics and temporal similarity metrics tend to decrease. However, when k and N increase, the temporal metrics constantly outperform the traditional metrics. Therefore, it is better to use the temporal similarity metrics to reconstruct networks.
When different similarity metrics are compared, we find that CN and RA indices have smaller drop of precision in the "special range" than the other similarity metrics such as LHN, SSI, HPI, HDI, Cos and Jac. This is because the latter group of metrics all has some form of punishment based on node degree. In LHN, the drop of precision in the special range is most significant. The "special range" effect is much less obvious when the temporal similarity metrics are used. In LHN, however, an observable drop of precision in the "special range" still exists. This is because the degree punishment is most severe in LHN. We then compare the results of different metrics on SW and BA networks. In SW networks, all the temporal metrics can reach a very high precision (close to 1) when µ is large. However, TRA method reaches the highest value later (i.e. a larger µ is needed) than the other methods. In BA networks, the THPI reaches a highest precision.
In summary, if the time information of the spreading is unknown, it is better to use RA and CN to reconstruct the network as their precision is not affected much by the "special range" effect. If the time information of the spreading is available, it is better to use THPI to reconstruct the network as it works similar to other metrics in heterogeneous networks while works best in homogeneous networks.    N , k , µ) on the precision of different similarity metrics in network reconstruction. Each row of the subplots is corresponding to one similarity metric. In the first column, k = 10, N = 500. In the second column, µ = 0.1, N = 500. In the third column, k = 10, µ = 0.1. Table S1. Basic properties of real undirected networks and the performance of the CN, TCN, Jac and TJac methods on these networks. The parameters are set as µ = 2/ k and f = 0.5. We select a relatively large µ because the performance difference between traditional similarity metric and temporal similarity metric becomes more significant under large µ, as shown in Fig. 4. The similarity method with the best performance in each network is highlighted in bold font. The results are averaged over 50 independent realizations. Degree correlation  CN  TCN  Jac  TJac  CN  TCN  Jac  TJac  CN  TCN  Jac  TJac Table S3. Basic properties of real directed networks and the performance of the CN, TCN, Jac and TJac methods on these networks. The parameters are set as µ = 2/ k and f = 0.5. We select a relatively large µ because the performance difference between traditional similarity metric and temporal similarity metric becomes more significant under large µ, as shown in Fig. 4. The similarity method with the best performance in each network is highlighted in bold font. The results are averaged over 50 independent realizations.  Table S4. Basic properties of real directed networks(N for number of nodes, E for number of links), AUC and two network reconstruction metrics(note for Precision and Correlation) of two additional typical similarity definitions which are LHN method and RA method, and also these methods with temporal information(short for TLHN and TRA). The parameters are set as µ = 2/ k and f = 0.5. The similarity with best performance in each network is highlighted in bold font. The results are averaged over 50 independent realizations. Degree correlation  LHN  TLHN  RA  TRA  LHN  TLHN  RA  TRA  LHN  TLHN  RA Table S5. AUC of other additional typical similarity definitions, and also these methods with temporal information. The parameters are set as µ = 2/ k and f = 0.5. The similarity with best performance in each network is highlighted in bold font. The results are averaged over 50 independent realizations. Cos  TCos  SSI  TSSI  HPI  THPI  HDI  THDI  PA  TPA  AS Table S6. Precision of other additional typical similarity definitions, and also these methods with temporal information. The parameters are set as µ = 2/ k and f = 0.5. The similarity with best performance in each network is highlighted in bold font. The results are averaged over 50 independent realizations. Cos  TCos  SSI  TSSI  HPI  THPI  HDI  THDI  PA  TPA  AS Table S7. Correlation of other additional typical similarity definitions, and also these methods with temporal information. The parameters are set as µ = 2/ k and f = 0.5. The similarity with best performance in each network is highlighted in bold font. The results are averaged over 50 independent realizations. Cos  TCos  SSI  TSSI  HPI  THPI  HDI  THDI  PA  TPA  AS