Locating multiple diffusion sources in time varying networks from sparse observations

Data based source localization in complex networks has a broad range of applications. Despite recent progress, locating multiple diffusion sources in time varying networks remains to be an outstanding problem. Bridging structural observability and sparse signal reconstruction theories, we develop a general framework to locate diffusion sources in time varying networks based solely on sparse data from a small set of messenger nodes. A general finding is that large degree nodes produce more valuable information than small degree nodes, a result that contrasts that for static networks. Choosing large degree nodes as the messengers, we find that sparse observations from a few such nodes are often sufficient for any number of diffusion sources to be located for a variety of model and empirical networks. Counterintuitively, sources in more rapidly varying networks can be identified more readily with fewer required messenger nodes.

our knowledge, there has been no solution to the problem of locating multiple diffusion sources associated with general dynamical processes on arbitrary time varying networks from local observations 24 . The purpose of this paper is to provide an optimal solution. In particular, exploiting a combination of the structural observability and sparse signal reconstruction theories, we develop a general source localization framework that is applicable to arbitrarily time varying networks with any number of sources. We demonstrate that sparse data from a small set of messenger nodes are capable of identifying multiple diffusion sources accurately and efficiently, even in the absence of detailed information about the network structure such as link weights and the presence of measurement noise. The framework is established analytically and validated through extensive numerical tests of model and empirical networks.

Framework of locating multiple sources on time-varying networks. A time-varying network with
N nodes is generally defined by a node set V = {v 1 , v 2 , ..., v N } with a set E of time varying edges, where (v i , v j , w ji , t) ∈ E denotes a directed edge pointing from nodes v i to v j with link weight w ji at activation time t. In this paper, we consider the following class of discrete-time, diffusion processes on such time varying networks: where x i (t) is the state of node i at time t capturing the fraction of infected individuals, the concentration of water or air pollutant and etc., at place i. β is the constant diffusion coefficient, and w ij (t) is the link weight at time t, where self loops are a result of the diffusion process 2 . For an undirected network, we have w ij (t) = w ji (t). (Diffusion dynamics in continuous time can be treated similarly -see Sec. S1 in Supplemental Information (SI)). The nodes from which observations are made are the messenger nodes. When the outputs from the messenger nodes are taken into account, the system becomes where the state vector  ∈ t x( ) N comprises all nodes in the network at time t and A(t + 1) = I + βL(t + 1). In represents the q measurable outputs from q messengers at time t, and The basic difference between source nodes and passive nodes is that, initially (t = t 0 ), the states of the former and latter are nonzero and zero, respectively. Without loss of generality, we set t 0 = 0. Thus, if the initial states of all nodes can be recovered from the measurements of the messenger nodes at a later time (t > 0), all sources can be identified. A solution to this problem can be obtained by exploiting the observability condition in canonical control theory. Specifically, we consider instants of time t = 0, 1, ..., T and rewrite Eq. (2) as N is the initial state vector, q is the number of messenger nodes, and  ∈ is the observability matrix. To be able to accurately locate the diffusion sources, a unique solution of Eq. Since information about the link weights may not be available, a direct calculation of rank(O) is not feasible. A resolution is to analyze the structural observability [25][26][27][28] , which is a highly nontrivial task for time varying networks. Our idea is to exploit the independent paths in static mappings of the underlying network 29 , as shown in Fig. 1(b). In particular, a mapping from a time varying network to a static network can be obtained by cloning all nodes into different layers that correspond to different time t. If an edge is active at t [as shown in Fig. 1(a)], the two nodes at both ends of the edge in the corresponding layers in Fig. 1(b) will be connected. Note that the direction of links in Fig. 1(b) is reversed with respect to the actual direction of diffusion in Fig. 1(a) -a consequence of the duality relation between structural observability and controllability 28 . Figure 1(c) indicates the quantity N OR ({a}) when node a is chosen as a messenger node. There is a single independent path, i.e., a → c, such that N OR ({a}) = 2 (one independent path and a itself). If a and d are messengers [ Fig. 1(d)], there are two independent paths and N OR ({a, d}) = 4 (including the two messengers themselves). In this case, the network is fully observable. The key to source localization is thus to identify all independent paths from messenger nodes in the static mappings of the original time varying network. In this paper, to generate a time-varying network, we propose a uniform activation network model in which random activations are imposed on a static network. Specifically, let z be the number of times (activations) an edge is active in a time interval, which can be randomly selected from a uniform distribution U z (1, ) max with z max denoting the maximum number of activations. After z is given for each edge, the active time associated with each activation is uniformly chosen from the distribution U(1, T) under the constraint that a link cannot be activated twice (or more) at one active time.
Estimate of observable range. For a set Q of messenger nodes, N OR (Q) is exactly the number of independent paths plus the number of the messengers, which can be calculated by using the maximum flux algorithm. Here, we provide a theoretical estimate of the number of independent paths. As shown in Fig. 1, since every node has a self-loop, if there exists a link for a certain layer (t > 0), there must exist a path starting from the layer to the top layer (t = 0), as shown in Fig. 1(d). Moreover, there exists at most one independent path starting from one node in a given layer (t > 0). Thus, for a messenger node v, the maximum number of independent paths from v for all layers is the number of layers in which v has a link that points to other nodes. The number is nothing but the number l v of distinct activations of v, where each activation (active time) corresponds to a layer with a link going out from v (see Sec. S2 in SI for more details). Thus, since the overlap among independent paths from v is negligible, we have n OR ({v}) ≈ (l v + 1)/N, based on which the quantity n OR (Q) of node set Q can be estimated as The fraction p of messenger nodes is thus p = q/N, where q is the number of messengers.
For the uniform activation network model, if the number of distinct activations, l v , cannot be directly measured, we can use the activation times distribution U z (1, ) max and the active time distribution U(1, T) to estimate the average number 〈l〉 of distinct activations. Specifically, for a node with k edges, we denote their activations by z 1 , &hellipsis;, z k . The probability of the number of distinct activations being l for one node with z 1 , ..., z k is given by (see Sec. S2 in SI) Therefore, for one node associated with z 1 , ..., z k , the average number of distinct activations is For a node of degree of k, the average number of distinct activations is . Given 〈l〉 for each node, for the entire messenger set Q, the normalized observable range can be approximated as

Messenger selection.
Considering the cost of allocating messengers for monitoring the state of the whole network, finding a minimum set of messengers through independent paths represents the most efficient way to locate sources. Moreover, the set can be used to characterize the source locatability of the network. The difficulty is that this task is NP-complete 30 . We employ an alternative approach by exploiting a greedy optimization algorithm to maximize the observable range n OR through selection of the messenger set (see Sec. S3 in SI). In addition, sub-modularity 31,32 is exploited to reduce the computational cost and provides guaranteed performance at least (1 − 1/e) ≈ 0.63 compared to the global optima.
We test our framework using model and empirical networks. Figure 2 shows the observable centrality of nodes for Erdös-Rényi (ER) 33 random and scale-free (SF) 34 networks. Three features are found, which do not occur for static networks 35 . First, nodes of larger degree k have a higher observable centrality N OR , in sharp contrast to what happens in a static network where both driver and messenger nodes tend to avoid large degree nodes due to their small controllable and observable range. Second, N OR gradually approaches the upper limit T + 1 as k increases. Third, N OR is nearly independent of the network structure and depends mainly on T and z max . The theoretical prediction [Eq. (9)] and numerical results agree well with each other.
The results in Fig. 2 suggest that large-degree nodes be chosen as the messengers (denoted as the max-deg strategy). To validate this strategy, we compare it with the more elaborative strategy of greedy optimization. As shown in Fig. 3, n OR resulting from the max-deg strategy is quite close to that from the greedy strategy, especially for relatively larger values of z max . The great advantage of the max-deg strategy is that it is based on local information only whereas the greedy strategy requires global information about the network. Another remarkable finding is that a very small fraction p of messenger nodes are sufficient to fully locate multiple sources (n OR = 1) for both ER and SF networks. We also test our framework using three empirical time varying networks, as shown in Fig. 4. It should be noted that the number of distinct activations l of every node is available. We see that a quite small value of p can ensure a complete localization of diffusion sources in all the empirical networks. For both model and empirical networks, numerical calculations are in good agreement with theoretical predictions (see Sec. S3 in SI for more details). A counterintuitive phenomenon is that, in both model and real networks, it is relatively easier to locate diffusion sources in more rapidly changing (more frequently updating) networks as the set of required messenger nodes is smaller (e.g., comparing = z 5 max with = z 30 max in Fig. 3 and hour with day in Fig. 4). A heuristic explanation is that more rapid changes in the network structure in fact limit the spreading patterns from sources, facilitating source localization from a relatively smaller number of messenger nodes.
Actual localization of multiple diffusion sources. We articulate an efficient and robust method to actually locate the sources based on the already identified messenger set. In a realistic situation, the number of sources is much smaller than the network size, so the vector x(0) in Eq. ((3)) has many zero elements. The sparsity of x(0) can be exploited to greatly reduce the required measurement from messengers by using the compressive sensing (CS) paradigm for sparse signal reconstruction 36,37 , Specifically, Eq. (3) can be solved and accurate reconstruction of x(0) can be achieved through solutions of the following convex-optimization problem: Here M is the number of continuous measurements made by messengers. Because of the linear independence of the rows in matrix O and the sparsity of x(0), it is feasible to reconstruct x(0) as M is much smaller than T + 1. We define n M ≡ M/(T + 1) to compare with the data amount T + 1 required by conventional solution to x(0). To be more realistic, we include both measurement noise and uncertainties in the link weights in Eq. (2), which is reformulated as   Table S1 and Sec. S4 in SI.
where the measurement y(t) is contaminated by white truncated Gaussian noise of zero mean and variance σ 2 : q is zero vector and  ∈ 1 q is the one vector. We assume that the uncertainties in the link weights W are also truncated Gaussian: The random noise is restricted to positive values to make sure that the values of measurements and link weights are nonnegative. Here we use multiplicative noise to ensure that, on average, the ratio of the measurements remains the same with or without noise during the dynamics. To quantify the performance of source localization, we use the standard AUROC (area under a receiver operating characteristic) metric 37 , where AUROC = 1 indicates the existence of a threshold to fully distinguish between sources and passive nodes whereas AUROC = 0.5 indicates that the two types of nodes cannot be distinguished (Sec. S7 in SI).
We use empirical networks (as in Fig. 4) to test the performance of our CS based source localization method. As shown in Fig. 5(a) and (b), AUROC increases with n M . When n M is small, AUROC shows large deviation indicating that the location of sources largely affects the accuracy of source localization for given selected messengers; once n M exceeds some value, say 0.5, AUROC is close to 1 and the standard deviation reduces a lot implying that all sources at any locations can be accurately located. We also compared the performance of source localization for different messenger selection strategies (See Sec. S4 and Fig. S5 in SI). Figure 5(c) and (d) show the localization accuracy versus measurement noise σ and weight uncertainty σ′. We see that relatively high accuracy can still be achieved even when the noise variance approaches unity. Nonetheless, in some simulations the AUROC is small (See Sec. S5 and Fig. S6 in SI for the distributions of AUROC) and we may improve these performances by increasing the number of messengers or the length of observation time. Further efforts are still needed to see how to balance the cost of adding more messengers or increasing observation time.
In real systems, we cannot know the time-varying network structure in advance, which prevents us from selecting the optimal messengers. However, if the network structure evolves with periodicity or follows some patterns, e.g., the activation dynamic of each edge remains stable for a long period, we can construct a rough network based on the past interactions and select messengers using its structural properties, e.g., nodal degree and estimated observable range. To test the effectiveness of our method under such situation, we divide the time-varying network into two parts according to the order of each edge's activation time: the first part with which a rough network is constructed and a set of messengers is selected, and the second part within which the source localization is applied. Figure 6(a-c) display the activation time distributions of the three empirical networks, which indicates circadian rhythms, and illustrate the dividing time point used in the simulation. Messengers are selected using greedy algorithm and max-deg strategy ensuring full observable of the first part network, and are further used to locate the sources on the second part network. As shown in Fig. 6(d), our sources localization method shows a good performance for both strategies on the empirical networks.

Discussions
Source localization is significant for preventing negative diffusion processes and reducing damages. Combining structural observability theory with sparse signal reconstruction, we succeed in developing a general framework for locating multiple diffusion sources in time varying networks, an extremely challenging problem in complex dynamical systems. The framework allows us to define an observable centrality for each node and to locate any number of sources by observing a small number of messenger nodes with larger values of observable centrality and exploiting the natural sparsity of sources. Appealing features of our framework include requirement of only small amounts of measurements and robustness against noise and uncertainties in system parameters. We offer analytic formulas for the observable centrality and the minimum number of messenger nodes, which are validated using model and empirical networks. A general finding based on our framework is that large degree nodes produce more valuable information than small degree nodes, an opposite result to that for static networks based on structural observability theory. As a result, choosing larger degree nodes as messenger nodes is more efficient to locate multiple sources in time varying nodes; in contrast, small degree nodes are often selected as messenger nodes in static networks. A counterintuitive finding is that sources in a more rapid varying network can be located more readily than in a slowly changed network. A heuristic explanation for this phenomenon is that frequent changes of the network structure in general produce more independent path in the static mapping of the original time varying network. As a result, the number of necessary messenger nodes is reduced and the sources become relatively easier to be localized. When dealing with time-varying networks, forward-planing problem is an unavoidable issue, because in many real systems the future structure of the time-varying network cannot be obtained in advance. While if the network structure evolves periodically or following some patterns, we can select messengers by fully exploiting the structural information embedded in the past interactions; If the evolution of time-varying network is totally random, then selecting messengers randomly may be the only way. In this paper, multiplicative noise is considered to test the robustness of our method, although the average performance is still satisfied, the worst cases are even worse than that of random guess (AUROC < 0.5) when the noise is strong. Therefore, it is very important to develop a more robust and efficient inference framework that can deal with different noise settings. One possible improvement is relaxing the object function Y = O ⋅ X to ||Y − O ⋅ X(0)|| 2 + λ||X(0)|| 1 in the cost of adding a tuning parameter λ. Another possible way is to develop a probabilistic approach which can utilize the distribution of noise to give a maximum likelihood estimation of the sources.
Our framework has potential applications in addressing many problems relevant to source localization, such as consensus, synchronization on power grid networks, locating the sources of epidemic spreading and rumor spreading in society, online social communities and computer networks. Moreover, our work has implications in disease diagnosis and therapy, such as identify focus sources of epilepsy and tumors in human body. Because of the significance and broad application potential of the source localization problem, we expect that the theory and practical algorithms presented in this work will stimulate further efforts, e.g., a more efficient and accurate algorithm to identify a minimum set of messenger nodes and a new framework available for systems with strong nonlinear properties.
Data availability statement. Data can be accessed at http://www.sociopatterns.org/datasets.