Abstract
Spread over complex networks is a ubiquitous process with increasingly wide applications. Locating spread sources is often important, e.g. finding the patient one in epidemics, or source of rumor spreading in social network. Pinto, Thiran and Vetterli introduced an algorithm (PTVA) to solve the important case of this problem in which a limited set of nodes act as observers and report times at which the spread reached them. PTVA uses all observers to find a solution. Here we propose a new approach in which observers with low quality information (i.e. with large spread encounter times) are ignored and potential sources are selected based on the likelihood gradient from high quality observers. The original complexity of PTVA is O(N^{α}), where α ∈ (3,4) depends on the network topology and number of observers (N denotes the number of nodes in the network). Our Gradient Maximum Likelihood Algorithm (GMLA) reduces this complexity to O (N^{2}log (N)). Extensive numerical tests performed on synthetic networks and real Gnutella network with limitation that id’s of spreaders are unknown to observers demonstrate that for scalefree networks with such limitation GMLA yields higher quality localization results than PTVA does.
Introduction
We live in the networked society. Every second we interact with many networks from which we collect, process and transmit a huge amount of information, which increases exponentially each year^{1,2,3,4}. Increasing interconnectivity of the world exposes us to worldwide range of pathogens, viruses both physical and virtual, misinformation and rumors with often grievous consequences^{5,6,7,8}. A good example is a fake tweet about explosion in White House in 2013, which caused $130 billion loss on the stock market^{9}. Another example is the United States presidential election of 2016 when many rumors or fake news became viral on Facebook or Twitter and might have affected elections^{10}. Many papers seek finding best conditions for spreading^{11,12,13,14,15,16} or sets of optimal spreaders^{17,18,19,20} but here we investigate an inverse problem. It became clear that one of the major challenges facing network and data scientists is to develop effective methods for detecting and suppressing spread of dangerous viruses, pathogens, misinformation or gossips. The basic component of such a system is undoubtedly a fast algorithm finding a source of such spread. The first widely discussed research on this subject has been done by Shah and Zaman^{21} and Pinto, Thiran and Vetterli^{22}. In social networks, Shah and Zaman introduced rumor centrality of a node as the number of distinct ways a rumor can spread in the network starting from that node. They showed that the node with maximum rumor centrality is the Maximum Likelihood Estimator of the rumor source if the underlying graph is a regular tree. They studied also the detection performance for irregular geometric trees, smallword networks and scalefree networks. This method assumes that we know all the connections between nodes and additionally the infection states of all nodes. Pinto et al. relaxed some of these constraints since their algorithm requires information about state of not every node, but only about some fraction of nodes called observers. A further description of this algorithm is given in the next section and in Supplementary Information. After these two publications, the topic of the source detection became popular and many other variants of this problem have been studied. We can distinguish two main approaches to this issue: the snapshotbased^{21,23,24,25} and the detectorbased^{22,26,27} source detection. The first one requires the snapshot of an entire network at a certain time instance, the second needs to monitor only a small subset of nodes but all the time. Regardless of the above division, researchers considered also different epidemic models^{25,28}, spreading at weighted or timevarying graphs^{29,30,31,32} and multisource detection problems^{33,34}. In 2014 Jiang et al. described stateoftheart and conducted comparative studies^{35}. One of their conclusions is that current methods are too computationally expensive and they can not be use for a quick identification of the propagation source. The main goal of our research was finding the method which executes in reasonable time on large complex networks and delivers high quality of localization results at the same time.
Results
Before demonstrating our main results, we present a brief description of PintoThiranVetterli algorithm (PTVA)^{22}. Then, we introduce our approach in which observers with low quality information (i.e. with large spread encounter times) are ignored and potential sources are selected based on the likelihood gradient from high quality observers. In order to measure the performance of the algorithms we use three different quality of localization measures: the accuracy, the rank and the distance error. The accuracy is the empirical probability that a source found by the algorithm is the true source. The rank is the true source position on the nodes list, which is sorted in descending order by likelihood of being the source. The distance error is the shortest path distance between the real source and the source found by the algorithm. Details on these measures can be found in the section Methods.
PintoThiranVetterli Algorithm
Pinto, Thiran and Vetterli^{22} proposed a general framework for the localization of the spread source in which some of the nodes in network act as observers and report from which neighbor and at what time it received the information. However, in real life the identity of the neighbor that sent the message to the observer is not always available (like in the case of gossip spreading on the public square). For this reason, and for the sake of greater generality and applicability of our studies, we do not require data received by observers to contain the identities of nodes from which the spread came. We refer to tests in which PTVA is applied to such data as PintoThiranVetterli Algorithm executed on data with Limited Information (PTVALI). This lowering of the requirements on input data increases applicability of the methods but reduces detection accuracy, and yet it does not affect the algorithm’s complexity or speed. Thus, PTVALI tests for the speed and complexity are valid also for PTVA.
PTVA calculates the likelihood of each node to be the source (which we call the score, see Eq. 1 in Sup. Inf. Section S.1) using the reported times (observed delays) from all available observers. For this purpose, PTVA assumes information spreads through the network along the shortest paths and therefore uses breadthfirst search (BFS) tree in place of the actual but unknown propagation tree. The method also assumes that the propagation times θ_{ i } for each edge are i.i.d Gaussian random variables, for which the mean μ and the variance σ^{2} are known. The algorithm’s complexity for arbitrary graphs is O (N(K^{3} + N^{2})), where N is the size of the network and K is the number of observers. If K ~ N^{γ}, PTVA complexity ranges from O(N^{3}) when γ ≤ 2/3 to O(N^{4}) when γ = 1. For more details on PTVA see Sup. Inf. Section S.1.
Gradient Maximum Likelihood Algorithm
Description
Compared to the framework introduced by PTVALI we propose two improvements: a limited number of observers, and a gradientlike selection of suspected nodes. The first idea takes advantage of the fact that observers which are very far from the spread source make very small contribution to the score in comparison to the nearest observers (Fig. S4 in Sup. Inf. Section S.2). On the other hand, those distant observers increase greatly the cost of information processing. Since a distance between any observer o_{ k } and the true source should increase (in average) with the arrival time t_{ k }, we can use only a small number K_{0} ≪ K of the nearest observers and drastically shorten the time needed for computing the score. The limited number of observers was used in earlier work^{22,36} where the search algorithm was run before all K observers get infected in order to limit the outbreak. In contrast, here we focus on the optimization of the algorithm’s complexity for large complex networks.
The second idea introduces a procedure of the nodes selection for the score calculation. It is very likely that the spread source is in close proximity to the observer which has the smallest time at which the spread was observed (the observer one). The procedure starts by calculating scores of the nearest neighbors of the observer one and then selects a neighbor with the highest score. Next, the algorithm jumps into this node and calculates scores for its nearest neighbors in order to find the one which has a score greater than or equal to the current maximum. The process is gradientlike and it is continued until all neighbors have a score lower than the current maximum (see Fig. 1). Each calculated score is remembered (along with the node) which allows the algorithm to avoid doublecalculation and to prepare a ranking of nodes suspected to be the source. The number of suspected nodes N_{0} = V_{ s } depends primarily on the size of the network and the average degree 〈k〉. The empirical studies shows that \({N}_{0}\sim \langle k\rangle \,\mathrm{log}(N)\) (Fig. S6 in Sup. Inf. S.2). It is worth noting that the algorithm does not guarantee that the true source s^{*} will be selected for score calculation, i.e. P(s^{*}∈V_{ s }) < 1 (see Fig. S9a and S10a in Sup. Inf. S.2).
The Gradient Maximum Likelihood Algorithm (GMLA) is summarized in Algorithm 2. \({\mathscr{G}}\) denotes the underlaying graph, μ and σ^{2} denote the mean and the variance of the random propagation delay associated with one edge, {o_{ k }} is the set of observers and {t_{ k }} are the times at which they observed the spread. The score of a node is the likelihood that this node is the true source. We denote the score of a node v as ϕ(v). The formulas for ϕ(v), μ_{ v } and Λ_{ v } are given by equations (1,3,4) in Sup. Inf.
Complexity
Using the symbols K_{0} and N_{0} we reformulate the time complexity of GMLA as \(O({N}_{0}({K}_{0}^{3}+{N}^{2}))\) in the worst case. Assuming \({N}_{0}\sim \,\mathrm{log}(N)\) and K_{0} ≪ N, which is true for our method, the complexity can be further simplified into O(log(N)N^{2}).
Finetuning and performance
The number of the nearest observers K_{0} is a crucial parameter of GMLA and should be carefully selected. If K_{0} is too small, the accuracy of the algorithm decreases. On the other hand, large K_{0} increases the time of computation. The optimal number of the nearest observers \({K}_{0}^{\ast }\) is the minimal number of the nearest observers K_{0} needed to achieve maximal quality of the spread source localization. We test how \({K}_{0}^{\ast }\) depends on the network size, the average degree and the propagation ratio for ErdösRényi (ER) and BarabásiAlbert (BA) networks^{37} (see Sup. Inf. Section S.3). No substantial relationship was found between \({K}_{0}^{\ast }\) and the average degree of the network or the propagation ratio (Figs S15–S18 in Sup. Inf. S.3). Figure 2 presents how the number of the nearest observers affects the performance of GMLA for various sizes of BA network with the minimum degree m = 3 (m is the initial degree of each attached node, thus 〈k〉 = 2m = 6). It is easy to see a peak of the accuracy and the valleys of the rank and distance error. Figure 2d shows the estimates of \({K}_{0}^{\ast }\) for different sizes of BA network. In the case of ErdösRényi network, no peak of the accuracy is observed, but the saturation point is clearly visible (Fig. S13c in Sup. Inf. S.3). This also applies to the rank of the true source and the distance error (Fig. S13d,e). The fact that we can observe the peak of the accuracy for BA networks (not only the saturation point like for ER graphs) has substantial consequences, because it means that taking only K_{0} ≪ K nearest observers not only shortens the computation time, but it may also improve the quality of the source localization under certain circumstances. As we show further in Discussion, such a circumstance is the occurrence of the hubs in BA network. In the next paragraphs we present a numerical estimation of the complexity of GMLA as well as its performance in terms of the quality of results in comparison to PTVALI.
Tests on synthetic networks
We tested GMLA and PTVALI for various sizes of ErdösRényi (ER) random graphs and BarabásiAlbert (BA) networks. We used SusceptibleInfected model (see details in section Methods) for the spread with the infection rate β = 0.5 (\(\lambda =\sqrt{2}\)). The observers were distributed randomly over a whole network with the density ρ = 0.2. In order to maintain a high efficiency of GMLA, we set the number of the nearest observers as a function of the network size \({K}_{0}=0.5\sqrt{N}\) (see Fig. 2d and Fig. S12 in Sup. Inf. S.3) For comparative purposes, we introduce also a baseline method. The baseline method is very naive and according to it, the true source is always the observer one (with smallest delay t_{ k }). Details on the baseline method are given in section Methods.
The most important feature of GMLA is a remarkable reduction of the computation time. Figures 3d and 4d show that the empirical complexity decreases from O(N^{3.46}) to O(N^{1.15}) for ER graph and from O(N^{3.49}) to O(N^{1.32}) for BA network. Furthermore, one can observe an initial difference between GMLA and PTVALI computing times for the networks of size 200, which is a factor 4.4 for ER graph and 3.6 for BA network.
The quality of the source localization clearly depends on the network topology. In general, both algorithms achieve better results for ER graphs than BA networks. In the case of ER graphs, the accuracy of both algorithms is almost the same (Fig. 3a), but PTVALI is characterized by lower rank and distance error (Fig. 3b,c). On the other hand, for BA networks which are larger than 300 nodes GMLA outperforms PTVALI in every test of quality of the results (Fig. 4a–c). Moreover, the advantage of GMLA increases with the size of BA network and is especially high for large networks, for which the computation of PTVALI takes too long to collect a large enough statistics.
Tests on real social network
Another test was performed on Gnutella, a real peertopeer network. This kind of network is used for direct exchange of data via Internet between users and therefore can be used to spread the malware. The graph obtained from SNAP Datasets^{38,39,40} contains N = 6299 nodes and has the average degree 〈k〉 = 6.6 (more details on data are in the section Methods). We examine the algorithms for different densities of the observers, but we keep a constant number of the nearest observers in GMLA (K_{0} = 30). During tests we use simple SI model to simulate spreading. The results are shown in Fig. 5. For the density of the observers below 10% the outcomes of both methods are very similar – GMLA has slightly better accuracy but visibly worse rank than PTVALI. The situation changes when the density of the observers is equal or greater than 10% – GMLA performs better according to all efficiency measures. However, the main difference between these algorithms lies in the computation time (Fig. 5b). Initially, for ρ = 2.5% the computation time differs by a factor 61.5, but it increases with the density of observers since the computation time for PTVALI increases with ρ (see Fig. S2d in Sup. Inf. Section S.1).
Discussion
We introduce a new algorithm (GMLA) for the spread source localization in the wellknown PintoThiranVetterli limited observers formulation. The main drawback of the PintoThiranVetterli Algorithm (PTVA) is its time complexity. For large networks with many observers the complexity of PTVA is defined by the complexity of matrix operations, which is O(K^{3}) per node in the worst case (where K denotes the number of observers). We avoid this drawback in out algorithm by reducing the number of the observers used to determine the score (the likelihood of being the source) and by limiting the number of suspected nodes. The latter is performed by the selection procedure which starts from the neighbors of the first observer and follows the gradient of the score. As a result of the selection, we get a limited number of the suspected nodes \({N}_{0}={V}_{s}\sim \,\mathrm{log}\,N\) in contrast to PTVA where each node is checked (V_{ s } = V). Thanks to this approach, the complexity of Gradient Maximum Likelihood Algorithm (GMLA) is O(log(N)N^{2}) in the worst case and as far as we know this is the fastest algorithm for the spread source detection in generic networks with incomplete observations.
We test GMLA and PTVALI on ErdösRényi, BarabásiAlbert and Gnutella networks and compare performance of these algorithms using three measures: the accuracy, the rank of true source, and the distance error. Both algorithms work noticeably better for ER graphs than BA networks. For ER graphs, the quality of source localization by both algorithms is similar (with a minimal advantage of PTVALI), but for BA networks GMLA achieves much better results. The additional tests performed on Regular Random Graphs (Fig. S19 in Sup. Inf. Section S.4), Exponential Random Graph (Figs S20, S21 in Sup. Inf. S.4) and Configuration Model with the degree distribution which follows a powerlaw (Fig. S22 in Sup. Inf. S.4) confirm that GMLA outperforms PTVALI for scalefree networks. As is well known, the essential property of scalefree network is existence of the hubs  the nodes with a very high degree (here we consider nodes with \(k\geqslant \sqrt{N}\) to be the hubs). The hubs are usually responsible for a very rapid spread in the network, but can their presence hinder detection of the source? Fig. 6a shows the accuracy of PTVALI for 4 special sets of observers in BA network. All sets are equipotent (15 nodes) and contain only the observers which are the second order neighbors of the true source. In addition, the first set (black triangles) consists solely of the observers which are “behind” the hubs. We say the observer is “behind” the hub (or is noisy) if the shortest path between this observer and the true source passes through any hub. This also applies to the observers which are the hubs. The second set (gold triangles) is the opposite of the first set  it contains only nonnoisy observers which are not “behind” any hub. The third set (dark red squares) is a random mixture of the first two. The last set (purple diamonds) consists of the observers which have the smallest times at which the spread reached them (the quickest observers). This is the same criterion for the selection of observers as that which GMLA uses. As Fig. 6a shows, using the observers “behind” the hubs substantially worsens the accuracy of PTVALI. It means that information is degraded after passing through the hub. This is the main reason why PTVALI and GMLA are less effective for scalefree networks. The highest accuracy of PTVALI is achieved when using only nonnoisy observers. However, the quality of the source localization of the algorithm with the quickest observers is only slightly lower. Since GMLA uses the quickest observers, it achieves better results than PTVALI in scalefree networks with hubs, because the nearest observers infrequently are “behind” the hubs for sufficiently large networks, as is confirmed by Fig. 6b. Moreover, this conclusion is supported by the results obtained for Gnutella network, which also contains some hubs (0.4% of nodes has degree \(k\geqslant \sqrt{N}\)).
Although GMLA does not use information from all observers, as PTVALI does, it achieves better results for scalefree networks in quality of localization tests based on three measures: the accuracy, the rank of true source, and the distance error. This is because GMLA acts like a filter and rejects low quality information from distant observers which are often “behind” the hubs.
In summary, we proposed a new method for fast and accurate detection of spread source with incomplete observations which is capable to process timely large networks consisting of tens of thousands of nodes. Our algorithm is much faster and provides higher quality of localization results than PintoThiranVetterli algorithm for scalefree networks. The key to this success is limiting the information sources to the most important observers, while ignoring excessive and noisy information from far observers, as well as use of likelihood gradient for selection of potential spread sources. The phrase “less is more” once again turned out to be truth here.
Methods
Propagation ratio
For spreading process we define the propagation ratio λ as the ratio between the mean μ and the standard deviation σ of time delay associated with an edge in the network.
SusceptibleInfected (SI) model
We simulate the spread through the network using discrete SusceptibleInfected (SI) model^{41}. In this model each node can be in one of two states: susceptible or infected. At t = 0 only one random node is infected. We called this node the true source. At each subsequent time step each infected node has a chance to pass the information to its neighbor. The number of chances per time step is equal to the number of neighbors and for each neighbor the probability of success β is the same. The parameter β is called the infection rate. Since the number of time steps needed to pass the information from one node to its neighbor is equal to the number of independent trials (with the probability β) needed for first occurrence of success, it is described by the geometric distribution and therefore the mean propagation time per edge is μ = 1/β and the variance is σ^{2} = (1−β)/β^{2}. It follows that the propagation ratio λ = μ/σ for SI model is \(\lambda =1/\sqrt{1\beta }\).
Efficiency measures
Accuracy
The accuracy of a single realization is \({a}_{i}=1/{V}_{top}\) if s^{*}∈ V_{ top } or a_{ i } = 0 otherwise, where s^{*} is the true source and V_{ top } is a group of nodes with the highest score (top scorers). The total accuracy a is an average of many realizations a_{ i }, therefore a ∈ [0,1]. This measure takes into account the fact that there might be more than one node with the highest score (ties are possible).
Rank
The rank is the position of the true source on the node list sorted in descending order by the score. In other words this measure shows how many nodes, according to an algorithm, is a better candidate for a source than the true source. If the real source has exactly the same score as some other node (or nodes), the true source is always below that node (these nodes) on the score list sorted in descending order. The rank takes into account the fact that an algorithm which is very poor in pointing out the source exactly (low accuracy) can be very good at pointing out a small group of nodes among which is the source.
Distance error
The distance error is the number of hops (edges) between the true source and a node designated as the source by an algorithm. If V_{ top } > 1, which means that an algorithm found more than one candidate for the source, the distance error is computed as a mean shortest path distance between the real source and the top scorers.
Baseline method
The baseline method serves as the benchmark for accuracy and distance error tests. It assumes that the real source is the first observer reporting the spread. The baseline method works in no time and its accuracy is expected to be equal to the density of observers; this follows from the fact that if the true source is among the observers, it has to be the observer with the smallest arrival time. One can expect a quite low value of the mean distance error in this case, because the baseline method never makes big mistakes in terms of distance from the true source. Apart the poor accuracy, the baseline method does not assign the scores to the nodes which means that it cannot be used to find the rank of the real source.
Gnutella peertopeer network
We used the data from SNAP Datasets^{38,39,40}. This dataset consists of a snapshot of the Gnutella peertopeer file sharing network from 8 August 2002. Nodes represent hosts in the Gnutella network topology and edges represent connections which were established on 8 August 2002. The data has been anonymized by the researchers from Stanford University before it was made available. The graph contains N_{ tot } = 6301 nodes and E_{ tot } = 20777 edges, but we use the largest connected component which consists of N = 6299 nodes and E = 20776 edges (〈k〉 = 6.6). The diameter of the network is 9, the average path length is 3.7 and the average clustering coefficient is 0.0109.
Testbed
The time tests were performed in Java 7 using AMD FX8350 4 GHz processor. We used jblas v.1.2.4^{42} as a fast linear algebra library for Java.
References
 1.
Barabási, A.L. Linked: How Everything Is Connected to Everything Else and What It Means for Business, Science, and Everyday Life (Plume, 2003).
 2.
Newman, M. E. J. The structure and function of complex networks. SIAM Review 45, 167–256, https://doi.org/10.1137/S003614450342480 (2003).
 3.
Helbing, D. & Balietti, S. From social data mining to forecasting socioeconomic crises. The European Physical Journal Special Topics 195, 3, https://doi.org/10.1140/epjst/e2011014018 (2011).
 4.
Giannotti, F. et al. A planetary nervous system for social mining and collective awareness. The European Physical Journal Special Topics 214, 49–75, https://doi.org/10.1140/epjst/e2012016889 (2012).
 5.
PastorSatorras, R. & Vespignani, A. Epidemic spreading in scalefree networks. Phys. Rev. Lett. 86, 3200–3203, https://doi.org/10.1103/PhysRevLett.86.3200 (2001).
 6.
Moya, I., Chica, M., SaezLozano, J. L. & Cordon, O. An agentbased model for understanding the influence of the 11M terrorist attacks on the 2004 Spanish elections. Knowledgebased Systems 123, 200–216, https://doi.org/10.1016/j.knosys.2017.02.015 (2017).
 7.
Sun, M., Zhang, H., Kang, H., Zhu, G. & Fu, X. Epidemic spreading on adaptively weighted scalefree networks. Journal of Mathematical Biology 74, 1263–1298, https://doi.org/10.1007/s0028501610576 (2017).
 8.
Fu, F., Christakis, N. A. & Fowler, J. H. Dueling biological and social contagions. Scientific Reports 7. https://doi.org/10.1038/srep43634 (2017).
 9.
Strauss, G., Shell, A., Yu, R. & Acohido, B. SEC, FBI probe fake tweet that rocked stocks. USA Today https://www.usatoday.com/story/news/nation/2013/04/23/hackattackonassociatedpressshowsvulnerablemedia/2106985/ (2013).
 10.
Alcott, H. & Gentzkow, M. Social Media and Fake News in the 2016 Election. Journal of Economic Perspectives 31, 211–236, https://web.stanford.edu/gentzkow/research/fakenews.pdf (2017).
 11.
Lind, P. G., da Silva, L. R., Andrade, J. S. & Herrmann, H. J. Spreading gossip in social networks. Phys. Rev. E 76, 036117, https://doi.org/10.1103/PhysRevE.76.036117 (2007).
 12.
Stegehuis, C., van der Hofstad, R. & van Leeuwaarden, J. S. H. Epidemic spreading on complex networks with community structures. Scientific Reports 6, 29748 https://www.nature.com/articles/srep29748 (2016).
 13.
Wang, J., Sun, E., Xu, B., Li, P. & Ni, C. Abnormal cascading failure spreading on complex networks. Chaos, Solitons & Fractals 91, 695–701 http://www.sciencedirect.com/science/article/pii/S0960077916302442. https://doi.org/10.1016/j.chaos.2016.08.007 (2016).
 14.
Liu, Q.H., Wang, W., Tang, M., Zhou, T. & Lai, Y.C. Explosive spreading on complex networks: The role of synergy. Phys. Rev. E 95, 042320, https://doi.org/10.1103/PhysRevE.95.042320 (2017).
 15.
Czaplicka, A., Hołyst, J. A. & Sloot, P. M. A. Stochastic resonance for information flows on hierarchical networks. The European Physical Journal Special Topics 222, 1335–1345, https://doi.org/10.1140/epjst/e2013019295 (2013).
 16.
Czaplicka, A., Holyst, J. A. & Sloot, P. M. A. Noise enhances information transfer in hierarchical networks. Scientific reports 3, 1223 https://www.nature.com/articles/srep01223. https://doi.org/10.1038/srep01223 (2013).
 17.
Ash, C. Superspreaders are local and disproportionate. Science 355, 1036 LP–1036 http://science.sciencemag.org/content/355/6329/1036.1.abstract (2017).
 18.
Morone, F. & Makse, H. A. Influence maximization in complex networks through optimal percolation. Nature 524, 65–68 http://www.nature.com/nature/journal/v524/n7563/abs/nature14604.html (2015).
 19.
Jankowski, J. et al. Balancing Speed and Coverage by Sequential Seeding in Complex Networks. Scientific Reports 7, 891 http://www.nature.com/articles/s41598017009378., https://doi.org/10.1038/s41598017009378 (2017).
 20.
Singh, P., Sreenivasan, S., Szymanski, B. K. & Korniss, G. Thresholdlimited spreading in social networks with multiple initiators. Scientific reports 3, 2330 http://www.nature.com/srep/2013/130731/srep02330/full/srep02330.html. https://doi.org/10.1038/srep02330 (2013).
 21.
Shah, D. & Zaman, T. Rumors in a network: Who’s the culprit? IEEE Transactions on Information Theory 57, 5163–5181, https://doi.org/10.1109/TIT.2011.2158885 (2011).
 22.
Pinto, P. C., Thiran, P. & Vetterli, M. Locating the source of diffusion in largescale networks. Physical Review Letters 109, 1–5, https://doi.org/10.1103/PhysRevLett.109.068702 (2012).
 23.
Prakash, B. A., Vrekeen, J. & Faloutsos, C. Spotting culprits in epidemics: How many and which ones? Proceedings  IEEE International Conference on Data Mining, ICDM 11–20. https://doi.org/10.1109/ICDM.2012.136 (2012).
 24.
Lokhov, A. Y., Mézard, M., Ohta, H. & Zdeborová, L. Inferring the origin of an epidemic with a dynamic messagepassing algorithm. Physical Review E  Statistical, Nonlinear, and Soft Matter Physics 90, 1–9, https://doi.org/10.1103/PhysRevE.90.012801 (2014).
 25.
Zhu, K. & Ying, L. Information Source Detection in the SIR Model: A SamplePathBased Approach. IEEE/ACM Transactions on Networking 24, 408–421, https://doi.org/10.1109/TNET.2014.2364972 (2016).
 26.
Rumor source detection under probabilistic sampling. IEEE International Symposium on Information Theory  Proceedings 2184–2188. https://doi.org/10.1109/ISIT.2013.6620613 (2013).
 27.
Luo, W., Tay, W. P. & Leng, M. How to identify an infection source with limited observations. IEEE Journal on Selected Topics in Signal Processing 8, 586–597, https://doi.org/10.1109/JSTSP.2014.2315533 (2014).
 28.
Brockmann, D. & Helbing, D. The Hidden Geometry of Complex, NetworkDriven Contagion Phenomena. Science 342, 1337–1342, https://doi.org/10.1126/science.1245200 (2013).
 29.
AntulovFantulin, N., Lančić, A., Šmuc, T., Štefančić, H. & Šikić, M. Identification of Patient Zero in Static and Temporal Networks: Robustness and Limitations. Physical Review Letters 114, 1–5, https://doi.org/10.1103/PhysRevLett.114.248701 (2015).
 30.
Shen, Z., Cao, S., Wang, W. X., Di, Z. & Stanley, H. E. Locating the source of diffusion in complex networks by timereversal backward spreading. Physical Review E  Statistical, Nonlinear, and Soft Matter Physics 93, 1–9, https://doi.org/10.1103/PhysRevE.93.032301 (2016).
 31.
Braunstein, A. & Ingrosso, A. Inference of causality in epidemics on temporal contact networks. Scientific Reports 6, 27538 http://www.nature.com/articles/srep27538. https://doi.org/10.1038/srep27538 (2016).
 32.
Jiang, J., Wen, S., Yu, S., Xiang, Y. & Zhou, W. Rumor Source Identification in Social Networks with Timevarying Topology. IEEE Transactions on Dependable and Secure Computing 5971, 1–1 http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7393814. https://doi.org/10.1109/TDSC.2016.2522436 (2016).
 33.
Fu, L., Shen, Z. S., Wang, W. X., Fan, Y. & Di, Z. R. Multisource localization on complex networks with limited observers. Epl 113 DOI Artn 18006 10.1209/02955075/113/18006 (2016).
 34.
Fioriti, V., Chinnici, M. & Palomo, J. Predicting the sources of an outbreak with a spectral technique. Applied Mathematical Sciences 8, 6775–6782 http://arxiv.org/abs/1211.2333. https://doi.org/10.12988/ams.2014.49693 (2014).
 35.
Jiang, J., Wen, S., Yu, S., Xiang, Y. & Zhou, W. Identifying Propagation Sources in Networks: StateoftheArt and Comparative Studies. IEEE Communications Surveys and Tutorials X, 1–17, https://doi.org/10.1109/COMST.2016.2615098 (2014).
 36.
Spinelli, B., Celis, L. E. & Thiran, P. Observer Placement for Source Localization: The Effect of Budgets and Transmission Variance. 743–751 (54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). https://doi.org/10.1109/ALLERTON.2016.7852307 (2016)
 37.
Albert, R. & Barabási, A. L. Statistical mechanics of complex networks. Reviews of Modern Physics 74, 47–97, https://doi.org/10.1088/14783967/1/3/006 (2002).
 38.
Leskovec, J. & Krevl, A. Gnutella peertopeer network: snapshot from August 8, http://snap.stanford.edu/data/p2pGnutella08.html. Accessed: 20171130 (2002).
 39.
Ripeanu, M., Iamnitchi, A. & Foster, I. Mapping the gnutella network. IEEE Internet Computing 6, 50–57, https://doi.org/10.1109/4236.978369. (2002).
 40.
Leskovec, J., Kleinberg, J. & Faloutsos, C. Graph evolution: Densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 1 https://doi.org/10.1145/1217299.1217301 (2007).
 41.
Bailey, N. T. J. The Mathematical Theory of Infectious Diseases and its Applications. (Hafner Press, New York, 1975).
 42.
Braun, N. L., Schaback, J. & Jugel, M. L. jblas  Linear Algebra for Java. http://jblas.org/.
Acknowledgements
The work was partially supported as RENOIR Project by the European Union Horizon 2020 research and innovation programme under the Marie SkodowskaCurie grant agreement No 691152, by Ministry of Science and Higher Education (Poland), grant Nos. 34/H2020/2016, 329025/PnH /2016, and by National Science Centre, Poland Grant No. 2015/19/B/ST6/02612. J.A.H. was partially supported by the Russian Scientific Foundation, Agreement #177130029 with cofinancing of Bank Saint Petersburg. X.L. and B.K.S. were partially supported by the Army Research Laboratory under Cooperative Agreement Number W911NF0920053 (the ARL Network Science CTA) and by the Army Research Office grant W911NF1610524. B.K.S was partially supported by the National Science Centre, Poland, project no. 2016/21/B/ST6/01463. This research was also supported in part by PLGrid Infrastructure.
Author information
Affiliations
Contributions
R.P., K.S., B.K.S. and J.A.H. designed the research; R.P. implemented and performed numerical experiments and simulations; R.P., X.L., K.S., B.K.S. and J.A.H. analyzed data and discussed results; R.P., X.L., K.S., B.K.S. and J.A.H. wrote and reviewed the manuscript.
Corresponding author
Correspondence to Robert Paluch.
Ethics declarations
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Paluch, R., Lu, X., Suchecki, K. et al. Fast and accurate detection of spread source in large complex networks. Sci Rep 8, 2508 (2018). https://doi.org/10.1038/s41598018205463
Received:
Accepted:
Published:
Further reading

Locating the propagation source in complex networks with a directioninduced search based Gaussian estimator
KnowledgeBased Systems (2020)

A Naming GameBased Method for the Location of Information Source in Social Networks
Complexity (2020)

Community Detection Based on Symbiotic Organisms Search and Neighborhood Information
IEEE Transactions on Computational Social Systems (2019)

Source detection of rumor in social network – A review
Online Social Networks and Media (2019)

Localization of diffusion sources in complex networks: A maximumlargest method
Physica A: Statistical Mechanics and its Applications (2019)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.