Fast and accurate detection of spread source in large complex networks

Spread over complex networks is a ubiquitous process with increasingly wide applications. Locating spread sources is often important, e.g. finding the patient one in epidemics, or source of rumor spreading in social network. Pinto, Thiran and Vetterli introduced an algorithm (PTVA) to solve the important case of this problem in which a limited set of nodes act as observers and report times at which the spread reached them. PTVA uses all observers to find a solution. Here we propose a new approach in which observers with low quality information (i.e. with large spread encounter times) are ignored and potential sources are selected based on the likelihood gradient from high quality observers. The original complexity of PTVA is O(Nα), where α ∈ (3,4) depends on the network topology and number of observers (N denotes the number of nodes in the network). Our Gradient Maximum Likelihood Algorithm (GMLA) reduces this complexity to O (N2log (N)). Extensive numerical tests performed on synthetic networks and real Gnutella network with limitation that id’s of spreaders are unknown to observers demonstrate that for scale-free networks with such limitation GMLA yields higher quality localization results than PTVA does.


S.1 Pinto-Thiran-Vetterli Algorithm
The algorithm proposed by Pinto, Thiran and Vetterli is a deterministic Gaussian method for a single source identification in a network of size N . It requires a full knowledge of the network structure G and of times t k , 1 k K, at which the information arrived at observer k (observed delays). The observers are nodes that report report from which neighbor and at what time it received the information. One observer (lets say o K ) serves as a reference point. The method also assumes the information spreads through the network along the shortest paths and the propagation times θ i for each edge are i.i.d Gaussian random variables, for which the mean µ and the variance σ 2 are known. The assumption of diffusion along the shortest paths is used when the algorithm is applied to an arbitrary, non-tree graph. In this case the actual propagation tree (which is not known a priori) is approximated by a breadth-first search (BFS) tree composed of the shortest paths between the node which is supposed to be the source and all observers. The algorithm calculates a score for each node s using a density function of (K − 1)-dimensional normal distribution, where K is the number of observers. The node with the highest score is pointed as the rumor source. If |P(u, v)| denotes the length of the path P (u, v) connecting nodes u and v, d is the vector of observed delays, µ s is the vector of deterministic delays and Λ s is the delay covariance matrix, the score of node s is given by: where [µ s ] k = µ(|P(s, o k )| − |P(s, o K )|), for k, i = 1, . . . , K − 1. Despite the simplifications, the method works reasonably well under certain conditions (density of observers ρ = K/N > 10% and propagation ratio λ = µ/σ > √ 2).
PTVA uses the whole network for inferring the source, which makes the algorithm computationally expensive.
It calculates the score for each node in the network which requires to create BFS tree N times. Each BFS tree covers practically the entire network since it reaches to all observers, including these which are far from the root node. The complexity of creating BFS tree is O(N 2 ) per node s ∈ G. Unfortunately, the assumption of diffusion along the shortest paths works well only for the observers which are close to the suspected node (the root of BFS tree) since it makes use of the fact that typical random networks are locally tree-like. However, this may not be true for observers which are far from the supposed source. Thus making use of the distant observers may be problematic -it increases the expenses and decreases the correctness (because there can be many possible paths between the source and the remote observers). So far we have considered the situation when the number of observers is much smaller than the number of other nodes. In this case the overall complexity In general, the algorithm's complexity for arbitrary graphs as O(N (K 3 + N 2 )). Assuming K ∼ N γ , PTVA complexity ranges from O(N 3 ) when γ 2/3 to O(N 4 ) when γ = 1.
The time complexity of matrix operation for PTVA is O(N 2.68 ) per node (red line), while the time complexity of tree creation is O(N 1.81 ) per node for PTVA (green) and O(N 1.34 ) for GMLA (blue). The tests were performed on BA network with m = 3. The observers were distributed randomly over whole network with density ρ = 0.1.
The results of selected studies on the localization quality of Pinto-Thiran-Vetterli Algorithm executed on data with Limited Information (PTVA-LI) are presented in figures S2 and S3. The explanation and motivation of PTVA-LI can be found in Section Pinto-Thiran-Vetterli Algorithm in the main paper. Figure S2 shows how the algorithm's quality of localization depends on the density of observers. We present these results for three different values of infection rates: β ∈ {0.2, 0.5, 0.8}. As seen in Fig. S2, increasing the density of observers improves the quality of the results to some extent. The higher the infection rate is, the faster the accuracy curve saturates on the level which is equal to the infection rate. This means that the accuracy of PTVA-LI does not converge to unity for ρ → 1 if β < 1. On the other hand the computation time increases very fast with the density of observers, i.e. t ∼ ρ 2 . Figure S3 presents dependency of the localization quality on the infection rate for three different values of the density of observers ρ ∈ {0.1, 0.2, 0.3}. One can see that the infection rate is a crucial factor for the quality of results of PTVA-LI. For β < 0.5 (which corresponds to propagation ratio λ < √ 2) PTVA-LI has worse accuracy and distance error than the baseline method. For 0.5 β < 0.6 PTVA-LI has better accuracy than the baseline method but still worse distance error. For β 0.5 PTVA-LI finally overtakes the baseline method. Our studies demonstrate that PTVA-LI works best with small density of observers ρ ≈ 0.2 (due to the short calculation time) and high propagation ratio (the higher λ, the better performance).

S.2 Gradient Maximum Likelihood Algorithm
GMLA always uses K 0 fastest informed observers (which we call "the nearest" for simplicity) and skips the others. Thus, the question arises if the observers which receive the signal later do not provide valuable additional information. Figure S4 shows how the score changes if the algorithm takes into account increasing number of the observers sorted in ascending order of the time of receiving the signal. It also shows that the first dozen or so nearest observers provide enough information to estimate the score of a node. Taking more of them does not substantially affect the result. The question remains whether the nearest observers are the best choice. Perhaps taking into account only the latest observers will provide equally good or better results? Figure S5 presents the tests performed using various groups of observers. The total number of observers is 200 and they are partitioned into 10 non-overlapping groups. The first group contains first 20 observers with the lowest delays.
The second group consists of the observers from positions 21-40. The last, tenth group contains 20 observers which were informed at the latest. These results show that only the nearest observers provide sufficiently accurate information to locate the true source.
The basis of GMLA are two fundamental ideas: smartly limiting the number of observers and a gradient-like selection of suspected nodes. We have studied the effect of each of them on performance of the algorithm. The basic premise behind the selection of suspected nodes is saving time spent on calculating the scores. The number of suspected (and checked) nodes in GMLA is proportional to the logarithm of the network size (Fig. S6a), and therefore much smaller than the number of suspected nodes in PTVA-LI. Figure S7 shows the quality of the localization results of the algorithm with the gradient-like selection. These results show that the gradient-like selection not only decreases the time of computation ( Fig. S7d) but also improves the accuracy, the rank of true source and the distance error. Table S1 explains why the gradient-like selection improves quality of the source detection. The first column shows that PTVA-LI always finds the true source if the score of the source is the global maximum (23.9% of all cases). Otherwise, PTVA-LI is always wrong (76.1% of all cases). In comparison, GMLA (but without limiting the number of observers) sometimes miss the true source even though the score of the source is the best score in the network (1.5% of all cases), but in some cases it can detect the source which has not the maximum score (5.8% of all cases). Figure S8 shows the quality of the localization results of the algorithm which uses the limited number of the nearest observers, but does not use the gradient-like selection of suspected nodes. The accuracy of such an algorithm increases with the network size and outperforms PTVA-LI for networks of size N = 500 or bigger. However, this is true only for scale-free networks. As we write in Discussion in the main paper, performance improvement by limiting the number of observers is due to rejecting observers which are "behind the hubs". The substantial computation time reduction by limiting the number of observers is due to a big acceleration of matrix operations (see Fig. S1). Table S1: The results of 1000 realizations of PTVA-LI/GMLA on BA network (N = 1000, k = 6). The numbers on the left side of the cells refer to PTVA-LI. The symbol φ(s * ) denotes the score of the true source, while max(φ) is the highest score in the network in a given realization.
Figures S9 and S10 present the results of the performance tests of GMLA for BA network. A very substantial difference between PTVA and GMLA is that GMLA computes and assigns scores to limited number of nodes and therefore sometimes it can skip the real source. When this happens, the node which is the true source is left without the score thus its rank is equal to the size of the network (it occupies the last place on the score ranking ex aequo with the other skipped nodes). Figures S9a and S10a show how often GMLA assigns the score to the node which is the true source. The probability P (s * ∈ V s ) is a monotonically increasing function of ρ and β which is 0.85 for ρ = 0.2 and β = 0.5 (which are typical values used in our studies). The fact that P (s * / ∈ V s ) < 1 weakens the performance of GMLA but our studies show that this is a case only when ρ or β are very small. On the other hand, when the propagation ratio is very high (β close to 1) PTVA-LI performs better than GMLA. However, in real life the propagation ratio is rather low, because the spreading process is highly nondeterministic. In cases of low and medium β, GMLA performance dominates over PTVA-LI performance despite the fact that GMLA sometimes rejects the true source at the selection stage. The observers were distributed randomly over whole network with density ρ = 0.2. SI model was used for propagation with infection rates β = 0.5. The results are averaged over 100 realizations. The standard errors are smaller than symbols on plot. S.3 Optimal number of the nearest observers for GMLA GMLA requires optimization of the one crucial parameter, namely the number of selected nearest observers K 0 . This parameter controls both the precision and the speed of the algorithm. Figure S11 shows non-trivial dependency between K 0 and performance of GMLA for various sizes of BA network. Three types of behavior caused by controlling K 0 are visible. The first is seen in Fig. S11a and S11b. The probability that the true source is among the nodes with the highest score (Fig. S11a) decreases when K 0 increases. This is due to the fact, that the average size of the group of top scorers (Fig. S11b) also decreases with K 0 . This is because the more observers we use, the better resolution in the score we have. The second type of dependency on K 0 can be seen in Fig. S11c,d,e,f and g. These are: the accuracy, the rank of the true source, the distance error, the number of suspected nodes, and the computation time. For each of those the optimal value of the nearest observers K * 0 exists. However, this optimal value can be different for different measures of the performance and moreover it is dependent on the size of the network (see Fig. 2d in the main article). The interesting fact is that too small value of K 0 increases the calculation time (Fig. S11f). This can be easily understood when we notice that the number of suspected nodes (Fig. S11g) also increases with small K 0 . It follows that when K 0 is too small, GMLA needs more steps before it reaches the local maximum of score and therefore it computes scores for more nodes (which takes time). The last type of behavior is a linear dependency between the size of the propagation tree and the number of the nearest observers (see Fig. S11h). Regardless of the network size, the tree size is about 1.4 times larger than K 0 . It is also worth mentioning that the effect of K 0 on probability of including the true source by GMLA while computing scores is rather small (Fig. S11i). Another method to find the optimal number of the nearest observers is presented in Fig. S12. We test performance of GMLA versus the size of BA network for various types of dependency between K 0 and N . The best results in terms of accuracy, rank and distance error were obtained using function K 0 = 0.5 √ N , which is in agreement with Fig.   2d in the main article.
We conducted the same study for Erdős-Rényi (ER) network ( Fig. S13-S14). Figures S13c,d,e show that for ER graph K * 0 is a saturation point rather than a peak. However, the minimum of the computation time is still visible in Fig. S13f. In this case the optimal number of the nearest observers is the minimal number K 0 with the algorithm achieves the maximal performance. Figure S14 presents performance of GMLA versus the size of ER network for various types of dependency between K 0 and N . Again, the best results were obtained for K 0 = 0.5 √ N , as in the case of Barabási-Albert network.
We checked also if other factors can affect the optimal number of the nearest observers. Figures S15-S16 present dependency between K 0 and performance of GMLA for BA and ER networks of various average degrees (between 4 and 10). Despite the fact that Fig. S15c,e and S16c,e may suggest some influence of the average degree on K * 0 , it is too small to be a significance. Moreover, Fig. S15a,b,d,f,g and S16a,b,d,f,g do not confirm existence of such influence. The results of the simulations with various infection rates β (Fig. S17-S18) show that the optimal number of the nearest observers does not depend on the propagation ratio.

S.4 Performance of GMLA for various networks
In addition to the tests performed on Erdős-Rényi and Barabási-Albert networks (Fig. 3-4 in the main paper), we present the results for Random Regular Graphs, Exponential Random Graphs (in two versions, generated from the configuration model and evolved by random node attachment) and scale-free networks with degree distribution P (k) ∼ k −3 (created using the configuration model). Both algorithms achieve best results for Random Regular Graphs (Fig. S19), but PTVA-LI performs better than GMLA, except for the computation time. For Exponential Random Graphs the algorithms have very similar quality of the source localization with a slight advantage of PTVA-LI for the configuration model (Fig. S20) and a small advantage of GMLA for the evolving networks S21). Both algorithms have the lowest performance for scale-free networks, but in this case GMLA outperforms substantially PTVA-LI. Deterioration of the quality of detection is the result of existence of hubs in the network (see Discussion in the main paper). The observers which are "behind the hubs" are not reliable (we call them "noisy observers") and distort the estimation of the scores. GMLA uses only K 0 nearest observers, among which there is a lower fraction of the noisy observers than among all observers (Fig. 6 in the main paper). This in combination with the gradient-like selection of suspected nodes improves the quality of the algorithm's results.