The shape of memory in temporal networks

How to best define, detect and characterize network memory, i.e. the dependence of a network’s structure on its past, is currently a matter of debate. Here we show that the memory of a temporal network is inherently multidimensional, and we introduce a mathematical framework for defining and efficiently estimating the microscopic shape of memory, which characterises how the activity of each link intertwines with the activities of all other links. We validate our methodology on a range of synthetic models, and we then study the memory shape of real-world temporal networks spanning social, technological and biological systems, finding that these networks display heterogeneous memory shapes. In particular, online and offline social networks are markedly different, with the latter showing richer memory and memory scales. Our theory also elucidates the phenomenon of emergent virtual loops and provides a novel methodology for exploring the dynamically rich structure of complex systems.

Here we show that network memory is inherently multidimensional and cannot be meaningfully reduced to a single scalar quantity.Accordingly, we introduce a mathematical framework for defining and efficiently estimating the microscopic shape of memory, which fully characterises how the activity of each link intertwines with the activities of all other links.We validate our methodology on a wide range of synthetic models of temporal networks with tuneable memory, and subsequently study the heterogeneous shapes of memory emerging in various real-world networks.
A temporal network is a graph whose structure changes over time.A temporal network G over N nodes can be formalised as a set of L discrete-time stochastic processes G = {E α } α=1,2,...,L , where L ≤ N (N − 1) 2 is the number of different pairs of nodes that can be connected by links over time.Each E α = {E α t } t=1,2,... is the stochastic process governing the dynamics of link α, with the random variable E α t taking the value 1 if link α is present at time t, and 0 otherwise.Note that, in general, these stochastic processes are not independent.Indeed, the properties of a temporal network not only depend on the patterns of activities of each of its links, but also on the ways in which these patterns influence each other across the network.Since the set of every possible graph with N nodes is finite, it is in principle possible to enumerate all the configurations of a temporal network, build an alphabet accordingly, and transform the temporal network into a time series of symbols from this alphabet.A straightforward way to define a scalar memory Ω(G) of G is then, by direct analogy to the case of a scalar time series, as the order p of the lowest-order Markov chain that is able to reproduce the sequence of symbols generated by G (see SI Section I-A and B and II-A for details).This approach can only work in practice for very small numbers of nodes N , as the size of the alphabet grows extremely rapidly (∼ 2 1 2 (N 2 −N ) ) and very long time series would be required for an accurate estimate.
There is, however, a more fundamental problem with this approach.Not only is the scalar memory order Ω(G) hard to estimate, but it also fails to capture fundamental microscopic differences between temporal networks.As we will show below, each temporal network is characterised by a precise pattern of memories at a microscopic scale, that we name the shape of the memory.Links can heterogeneously influence the activity of other links, and the entangled temporal dependencies among these can even bring about virtual memory resonances in the activity of each link which are systematically undetected by Ω(G), yet have real and measurable physical effects on e.g.spreading dynamics.Overall, memory is indeed a heterogeneous, multidimensional fingerprint which is not reducible to a scalar quantity.
Theory -In order to fully characterise the shape of the memory of a temporal network, we propose to define the memory co-order Ω(E α E β ) of a pair of links α and β as the furthest point in the past history of {E β t } which has influence on the current evolution of {E α t } (see SI section IIB for details).Notice that for α = β we have Ω(E α E β ) = Ω(E α E α ) = Ω(E α ).The evaluation of the whole L × L co-memory matrix M, whose element m αβ = Ω(E α E β ) is the memory co-order of the pair of links α and β, allows us then to describe, at a microscopic level, the type of memory present in a network.As an example, Fig. 1(a,b) displays the co-memory matrices M we have extracted in the case of a synthetic temporal network with N = 10 nodes and L = 45 links with two different types of correlated dynamics.To compute the values of m αβ here (and throughout this work) we have used a modified version of the Efficient Determination Criterion (EDC) [31,32] as this performs well as an estimator, is strongly consistent, and allows for optimised implementations (for full details see SI sections FIG. 1. Shape of memory and emergence of virtual loops in temporal networks with correlated link dynamics A temporal fully connected network G with N = 10 nodes and L = 45 links whose dynamics are both autocorrelated and heterogeneously cross-correlated, generated from an eCDARN(p) model with parameters q = 0.9, y = 0.5, c = 0.7 and a set of memory lengths p randomly sampled with uniform (panel (a)) or a bimodal (panel (b)) probability from {0, 1, . . ., 6} (see SI section IV-A for details).(a) The 45 × 45 entries of the co-memory matrix M (shown with a color code) display the shape of the network memory at the microscopic scale of pairs of links.In this specific case the eCDARN(p) model is chosen such that the causal structure of link dependencies is restricted in a Bayesian ring topology of L = 45 nodes, so that when link α samples its activity from the past history of other links, it randomly samples from α ± 1.The scalar memory of the network is Ω(G) = 6.Pairs of neigbouring or close links in the Bayesian ring exhibit high memory co-order, often above Ω(G), due to the onset of virtual loops (see the text), whereas distant links are seldom cross-correlated and thus display low co-order memory.(b) Similar to (a), but where the link's causal structure is given by a different Bayesian graph (see SI section IV-A for full details).A notably different memory shape emerges, however the scalar memory of the network is still Ω(G) = 6.(c) Distribution of memory co-orders in both examples, showing different heterogeneous profiles which in both cases are not well characterized by Ω(G).

I-C and II-C
).The network's memory exhibits a very peculiar shape induced by both the specific link dependence that we have planted and the pre-specified set of memory length parameters.This is further highlighted in the heterogeneous distribution of memory co-orders reported in Fig. 1(c), where it is clear that memory cannot indeed be characterised by the value of the scalar memory Ω(G) alone, which in this case is equal to 6 in both networks.The long memory contributions (above order 6) that we see in this distribution are a manifestation of what we call virtual loops (VLs).These emerge e.g. when link α depends on the past of link β, and link β in turn depends on the past of link α, inducing a long-memory loop in the activity of each link separately.While being virtual in the sense that they are not pre-specified by the model nor captured by Ω(G), they do indeed play an important role in the dynamics of the network and affect processes occurring on it.These virtual loops typically emerge when causal dependencies of link activities are described by a cyclic Bayesian network, and are indeed reminiscent of other forms of causal loops appearing in various fields of physics and modern science (see SI section III-B and C for a discussion).To better illustrate this, Fig. 2(a) and (b) show an example of a toy model of a temporal network with only three nodes and two links.The model allows us to tune the shape of the memory, while the scalar memory Ω(G) of the network is kept fixed.The adopted causal depen-dencies between the two links (each link can copy from the past of the other link) induce virtual loops.These govern the two diagonal terms of the co-order matrix M and have measurable and important effects on dynamical processes taking place over the network.Fig. 2(c) and (d) show that the time taken for an infection to spread over the entire network can indeed be very different in networks with the same value of Ω(G), but with different memory shapes (see SI sections III and IV for other models, thorough mathematical analysis of virtual loops, and additional details).Furthermore, it is easy to prove (see theorem 1 in SI section II-A and B) that Ω(G) ≤ max α,β {m αβ } ∶= Ω eff (G), that is, the scalar memory is bounded from above by the maximum co-order over all link pairs, which we term the effective memory of the network.Fig. 2(d) shows that Ω eff (G) accounts for the virtual loops in the toy network model and thus captures the measurable differences in the spreading times.Of course, Ω eff (G) is still not able to account for the rich memory heterogeneity of a temporal network (see panel (c) of Fig. 1), but is (i) better conceptually defined than Ω(G) as it captures the effect of virtual loops, and (ii) can be computed efficiently from M. In those cases where virtual loops are absent or they are decoherent, Ω(G) indeed approaches Ω eff (G) (see SI section IV for a thorough exposition of virtual loop decoherence).
Validation in synthetic networks -We have tested FIG. 2. Virtual loops and their effects on spreading processes in temporal networks (a) A sketch of five steps of the temporal evolution of a network with three nodes and two links evolving according to the displayed equation (for details see SI section IIIA).If Q t = 0 then link at time t is generated randomly.Conversely, if Q t = 1 link copies a state from the past, namely link 1 copies the value of link 2 at time t − 2, while link 2 copies the value of link 1 at time t − 3. (b) Bayesian graph of the causal dependencies between the two links.Link 1 copies from the past of link 2 (p1), whereas link 2 copies from the past of link 1 (p2), thereby inducing first-order virtual loops in the memory of links 1 and 2, which virtually copy from their own past, p1 + p2 steps back.The co-memory matrix M whose entries are the co-memory order of each pair of links is also shown.The scalar memory of the process can be proved to be Ω(G) = max{p1, p2}, whereas the effective memory Ω eff (G), obtained as the largest entry of the co-memory matrix, is p1 + p2, which differs from Ω(G) due to the existence of the virtual loops.(c) A SI (Susceptible-Infectious) epidemic spreading is defined over the temporal network.Each node can either be in the infected (red) or susceptible (green) state.If at time t there is a link E t = 1 between an infected node and a susceptible one, then the infection will be passed with some probability (if random variable Λt = 1) and the susceptible node will become infected (see SI section IIIE for details).(d) Analytical and numerical results for the average time taken for every susceptible to become infected, as a function of the difference between the scalar memory Ω(G) and the effective memory Ω eff (G) (this latter being extracted from M).For each curve, the value of Ω(G) is fixed to a given value p1, while the value of Ω eff (G) = p1 + p2 is varied by changing p2 ≤ p1.We find that the spreading times depend on the value of the effective memory Ω eff (G), which is then a better descriptor of the effects of memory than Ω(G), this latter quantity being unable to detect any of these effects.Numerical results are obtained as averages over 10 7 realisations of the network, and are in perfect agreement with the analytical prediction (see SI section IIIF for the full analysis).
the accuracy of our memory shape estimator in four generative temporal network models, each of varying complexity and with differing memory shapes.Whereas we lack analytical expressions for the memory shape, in all these models a ground truth for Ω(G) can be obtained analytically, so our analysis can examine the measurable effect of virtual loops.(i) First, we consider the DARN(p) and eDARN(p j ) models [23], where all links have independent yet autocorrelated dynamics.By design these network models are free from virtual loops, thus we expect Ω eff (G) ≈ Ω(G).(ii) Then, we consider the CDARN(p) and eCDARN(p) models [37], where link dynamics are not only autocorrelated but also cross-correlated, since links can sample their next state from either their own history or from the history of other chosen links.Virtual loops are expected to emerge in these cases, inducing Ω eff (G) > Ω(G) (see Fig. 1 for an illustration of the eCDARN(p) model and SI section IV-A for details of all four models and precise theorem statements and analytical derivations of the scalar memory).In every case we generate 10 3 realisations of each temporal network model (fully connected backbone of N = 10 nodes with randomly chosen ground truth scalar memory and a range of different parameter configurations) and count the hit rate (percentage of correct predictions) between the estimated Ω eff (G) and the analytical value FIG. 3. The shape of memory of six real-world temporal networks: (EM) emails between the employees at a construction company [33], (CM) text messages between college students [34], (RM) social contacts at a US university from the Reality Mining experiment [35], routes taken by (PB) buses, (PT) overground trains and (PU) underground trains in the Paris public transport system [36].For each temporal network, we estimate the shape of the memory M restricted to the 100 most frequently active links, and plot the respective heat maps (lighter colour means higher memory).From these we extract the distributions of memory co-orders.The effective network memory Ω eff (G) is also highlighted by hollow circles, and the actual values are reported below the plots.Networks have been sampled at two different temporal resolutions ∆t, namely every 1 and 10 minutes (heatmaps only show the ∆t = 1 min resolution).In the two online social networks (EM and CM) the distributions of memory co-orders concentrate around zero and decay rapidly, indicating very short memory overall, except for a few pairs of links.In the offline university social network (RM) we find instead two clear peaks corresponding to the presence of memory at two time-scales of about 5 and 40-50 minutes (corresponding to interaction during lecture room changes and during whole lectures) respectively.Two peaks are also observed in the three engineered networks.However, both peaks are compatible with a time scale of 5-7 minutes in PT and PU, suggesting that such systems exhibit only one effective timescale, due to enforced planning and scheduling.The bus network (PB) in addition to the 5-7 minutes also shows a memory timescale of about 30 minutes, possibly due to external phenomena such as collective delays induced by traffic jams.
of Ω(G).Results are reported in SI Figure S8 and Section IV-B.For long enough temporal series the hit rate is consistently 100% in models which are free from virtual loops, suggesting that not only is our estimator accurate, but that in these cases Ω eff (G) = Ω(G).In those models where virtual loops emerge, the hit rate decays as expected.Interestingly, in a variety of cases a high hit rate is maintained due to the phenomenon of virtual loop decoherence (see SI Section IVB-D for full details and a more in depth analysis of the performance of our estimator).
Real-world temporal networks -We have then studied the shape of memory in various real-world temporal networks, including online and offline social interactions and different types of transportation systems.Results are shown in Fig. 3 for the six networks.Only the 10 2 most active links for each network, i.e. those with largest value of ∑ t E α t , have been considered when constructing the co-memory matrix M. The coloured heat maps in the top row indicate that memory shapes vary across networks, overall being notably longer in offline networks than in online ones.The middle panels show the distribution of memory co-orders.In order to detect memory at different timescales, we have sampled the networks at two temporal resolutions, namely ∆t = 1 and ∆t = 10 min (see SI Section V-A for details).The results should be interpreted accordingly: notice for instance, that order 2 at the ∆t = 1 resolution is equivalent to a memory length of 2 minutes, whereas order 2 at the ∆t = 10 resolution is equivalent to a memory length of 20 minutes.We have found that, while in the transportation networks with tight scheduling only one memory timescale flags up, in the case of the bus network, whose scheduling can be more affected by external factors such as traffic jams, and in the case of the offline social network, at least at least two different memory timescales show up.The sit-uation is particular clear in the case of human contacts at university, which show memory lengths of 5 and 40-50 minutes corresponding to different types of mechanisms of recurrent social interactions during lectures and in between lecture room changes).
As a complement, a different projection of M into the so-called (⟨Ω⟩ i in , ⟨Ω⟩ i out ) plane is considered in SI section V-C, confirming that there is a notable difference between online and offline temporal networks, with the former having on average weaker and more homogeneous memory profiles than the latter, as well as a difference between social and engineered ones.
Conclusions -Our approach, based on the evaluation of the co-memory matrix, not only provides a sound and efficient approximation of the memory of a temporal network, but also offers a comprehensible description of its microscopic shape.Our results unveil previously hidden rich and heterogeneous memory shapes and indicate that fully considering this microscopic structure is of capital importance when it comes to understanding how epidemics spread or information diffuses in time-varying systems.We hope our work will prompt further studies and will find useful applications in areas such as urban mobility, epidemiology or information processing in neuroscience.
Data Availability.Data associated with this study can be found via the following links: manufacturing company e-mail communicationwww.ii.pwr.edu.pl/~michalski/, text messages between college students snap.stanford.edu/data/CollegeMsg.html, reality mining experimenthttp://realitycommons.media.mit.edu/realitymining.html,public transport datawww.nature.com/articles/sdata201889Code Availability.Complete implementations of our general method and all examples are available in C++, Python 2.7, Python 3.6, Java, MatLab and Rust at github.com/oewilliams/temp-net-memory(see also SI section VI).
Let us consider a discrete random variable X with sample space S and probability mass function P(x) = Prob{X = x} with x ∈ S. The entropy H(X) of the random variable X is defined in terms of the probabilities P(x) of observing x ∈ S as: The definition of entropy can be extended to a pair or more discrete random variables.Let Y be a second discrete random variable with sample space S ′ .The joint entropy of the pair X, Y is given by: where P(x, y) = Prob (X = x, Y = y) is the joint distribution of the two random variables.We can also define the entropy of the random variable X when it is conditioned on the second discrete random variable Y as: x,y P (x, y) log P (x y) .

(S3)
Entropy and memory of a time series A time series T = {X t } t=0,1,... or simply {X t } is a time-discrete stochastic process in which, at each time step t, with t = 0, 1, 2, . .., the random variable X t takes values in state space S. We will indicate as x t the realization of random variable X t , i.e. the value taken at time t by time series.
The entropy rate H of the time series T can be defined as: and the conditional entropy of T as: If the process T is strongly stationary, i.e. if its joint probability distribution does not change when shifted in time [38], then it can be proven that H (T ) = H ′ (T ).For our purposes we will assume that this is always the case, allowing us to study only conditional entropies.Here and in the following, for the sake of simplicity, we introduce the following notation.We denote the sequence of random variables X 0 , X 1 , ....., X n as X 0,n , and similarly for the realisations x 0 , x 1 , ..., x n we write x 0,n .Since x i ∈ S ∀i, then we have x 0,n ∈ S n+1 .We then define the n th order block entropy H n of the process T as the entropy associated with the first n + 1 random variables X 0 , . . ., X n : Note that in particular H 0 coincides with the entropy of the marginal distribution of the first random variable X 0 .Since T is stationary, H 0 is then the entropy associated with the marginal distribution of any of the random variables, in other words H 0 = H(X i ), ∀i.
Similarly, for n > 0, H n is the entropy of blocks of n + 1 consecutive random variables (i.e.we are not required to consider the first n + 1 random variables differently to any other set of n + 1 consecutive random variables).Of course, the entropy rate H of the process T is then: Analogously, we can define the n th order conditional entropy h n as: P (x n x 0,n−1 ) P (x 0,n−1 ) log P (x n x 0,n−1 ) .
(S8) so that: We are now ready to define the memory of the time series T .Informally, the memory of T can be thought of as the number of times steps into the past which have an influence on the next observed value.More formally, we can define the memory length, memory order, or simply memory Ω(T ) of stochastic process T = {X t } as the order p of the lowest-order Markov chain that is able to reproduce the process, i.e. such that the conditional probability mass functions satisfy: for each x 0 , x 1 , . . ., x t ∈ S t+1 , or in compact form P(x t x 0,t−1 ) = P(x t x t−p,t−1 ) for each x 0,t ∈ S t+1 .This is equivalent to say that T can be identified as a p-th order Markov chain, and we write: Ω (T ) = p: It is now possible to relate Ω (T ) to the entropies we have introduced above.Observe first that h n is a monotonically non decreasing function of n, i.e. h n ≥ h n−1 ∀n.Second, h n increases with n until when n is precisely equal to p, and remains constant thereafter, i.e. h p+d (T ) = h p (T ) for any positive integer d.This can be easily demostrated, since for a p th order Markov chain T , and for n = p + d with d some positive integer, we can write: x 0,p+d P (x 0 x 1,p+d ) P (x 1,p+d ) log P (x 0 x 1,p+d ) = − x 0,p+d P (x 0 x 1,p ) P (x 1,p+d ) log P (x 0 x 1,p ) = − x0,p P (x 0 x 1,p ) log P (x 0 x 1,p ) x p+1,p+d These two conditions together mean that the n th order conditional entropy will be at a maximum when we take n = Ω(T ), i.e.Ω(T ) coincides with the minimum value of n which maximises h n (T ).How should we compute the conditional entropy, and hence the memory, of a time series in practice?Ideally we would need an infinitely long realisation of T to be able to accurately estimate these values.For finite size time series, simply maximising h n will not give a true estimate of the memory.Various methods have been developed to overcome such a limitation, and thus present consistent estimators for the order of a process when observing a finite time series, which we will briefly discuss in further sections.

Estimating the memory
We will now detail three examples of how to estimate the memory of a stochastic process from the information stored in finite time series.These estimators all make use of the information theoretic framework that we have here detailed.The aim here is to overcome the problems associated with only having a finite amount of data from which to estimate the memory via the introduction of a "penalty term".Given an observed time series (i.e. a realisation of T ) with values (x 0 , x 1 , ..., x T ), where x k ∈ S and the set has S = m symbols, we start by to counting how many blocks of symbols of a given size are found in the time series.We label n i1,...is as the total number of times a block of s consecutive symbols with a specific arrangement for each of the s entries appears in the time series, where i j ∈ S, ∀j = 1, . . ., s. Specifically we have where I is the indicator function, so that I(x k = i 1 , ..., x k+s−1 = i s ) = 1 when x k = i 1 , ..., x k+s−1 = i s is true, and zero otherwise.We then define the following log likelihood function log L t = i1,...,it+1 n i1,...,it+1 log n i1,...,it+1 n i1,...,it . (S14) With this we can define the Akaike information criterion (AIC) [39], Bayesian information criterion (BIC) [40], and the optimal form of the efficient determination criterion (EDC) [31,32], as follows: Given some upper bound K on the order of the time series we define the corresponding estimators as This gives us three alternatives for the estimation of the true order of the observed process.While the AIC and, to a lesser extent the BIC, are commonly used to estimate the order of a Markov chain given observed data, the EDC has been shown to be, in some sense, optimal.By optimal we mean that it is the strongly consistent estimator with the fastest convergence to the true order.It should be noted that the AIC, while the most popular, is not a consistent estimator, but is included here for completeness.In light of this, when estimating the order of a given stochastic process from the observation of a time series realisation, in this work we will always use the EDC (for completeness, we remind that there are also other possible alternatives, as the memory can be estimated in practice via different approaches, each with their own advantages [31,[39][40][41][42][43], but in this work we stick with the EDC for what said above).
So far all that is written above is well-known; in what follows we provide details of our proposed framework to extend the concept of memory to temporal networks.

DEFINING AND QUANTIFYING THE MEMORY OF A TEMPORAL NETWORK
The scalar memory Ω(G) A temporal network G is formally defined by the stochastic processes that generate the time evolution of its nodes and links.For simplicity, we assume here that the set of nodes is fixed, so that only the links of the network can change over time.We indicate the number of nodes with N , and we label with the index α ∈ {1, 2, . . .L} each of the L node pairs that can be connected over time.In the most general case L = N (N − 1) 2, however smaller values of L are enough for the adequate description of the system if further topological restrictions are imposed to the backbone of the temporal network.A temporal network G over the N nodes can then be written as a set of L discrete-time stochastic processes G = {E α } α=1,2,...,L .For each link α, with α = 1, 2, . . ., L, E α = {E α t } t=1,2,... is the stochastic process which describes its dynamics.From now on, we will therefore indicate the network as either G = {E α } α=1,2,...,L , We can introduce a first definition of the memory of a temporal network G = {E α t } by direct analogy to the case of a scalar time series discussed in Section .The scalar memory order, or scalar memory length, or simply scalar memory Ω(G) of the temporal network G can be defined as the order p of the lowest-order Markov chain able to reproduce the process, i.e. the minimum value of p such that: P(e t e 0 , e 1 , . . ., e t−1 ) = P(e t e t−p , . . ., e t−1 ) (S21) for each e 0 , e 1 , . . ., e t .In compact form, this can be written as P(e t e 0,t−1 ) = P(e t e t−p,t−1 ) for each e 0,t , where by e 0,t we indicate e 0 , e 1 , . . ., e t .We can finally write: [p ∶ P(e t e 0,t−1 ) = P(e t e t−p,t−1 )] (S22) Since the space of every possible graph with N nodes is finite, it is in principle possible to enumerate all these graphs, build an alphabet accordingly, and transform a realization G into a time series of symbols from this alphabet, from which the same methods to extract the memory order of a scalar time series can be used.
The shape of the memory: the co-memory matrix M and the effective memory However, since a temporal network is the set of stochastic processes describing the dynamics of its L links, together with the memory order Ω(G) of the network as a whole, we may want also to characterize the memory of each link separately, or even the memory in the influence between two links.Such a profile of the memory at the microscopic levels of the links of the network is what we will name the shape of the memory of the temporal network.It is therefore convenient to introduce a novel concept to capture the length of the memory of the mutual influence of two different links.In order to do this, let us consider two stationary time series X = {X t } and Y = {Y t }.We can define the memory co-order, or simply co-memory Ω(X Y) = Ω({X t } {Y t }) of the random process {X t } with respect to {Y t }, as the furthest point in the past history of {Y t } which has influence on the value taken by X t at time t, i.e. as the lowest value of p such that P(x t y t−1,t−p ) = P(x t y t−1,−∞ ) where y 0,t is shorthand for the sequence y 0 , y 1 , ..., y t .We can thus write: for each y 0,t−1 and each x t .Since we consider processes to be stationary, this value is invariant of t.Note that if the two processes X and Y are the same, then Ω(X Y) = Ω(X X ) = Ω(X ).
We can now define the memory co-order or simply co-memory Ω(E α E β ) of the random process E α = {E α t } t=1,2,... , representing edge α, with respect to a second random process E β = {E β t } t=1,2,... , representing edge β, as the furthest point in the past of E β t which has influence on the future evolution of E α t : where the subscript e β t−1,t−n is shorthand for the sequence e β t−1 , e β t−2 , ..., e β t−n .The memory co-order can be evaluated for each couple of links α and β, so that the whole memory shape of the temporal network can be fully characterised by the co-memory matrix M, a L×L matrix with entries M αβ = Ω(E α E β ), which accounts for the set of all possible co-orders.Notice that, by construction, the memory co-order of a link α with itself is precisely the memory order of the link, that is Ω(E α E α ) = Ω({E α t } t ), so that matrix M contains in the diagonal entries information on the memory present in the evolution of each link considered as an independent dynamical process from the rest of the network.
Based on matrix M it is now possible to introduce another scalar projection of the network's memory, which we label effective memory Ω eff (G), as the maximum value of the co-orders over the network link pairs: It is easy to prove the following theorems, which relate the effective memory to the memory of the network as a whole.
Theorem 1.Given a temporal network G with L edges and edge processes {E α } for α = 0, ..., L, we have: where Ω(G) is the memory of the temporal network, Ω(E α E β ) are the memory co-orders and Ω eff (G) is the effective memory.
Proof.It is useful to introduce the following compact notation: E 1,L 0,n ≡ E 1 0,n , ..., E L 0,n and e 1,L 0,n = e 1 0,n , ..., e L 0,n .The n th order conditional entropy h n (G) = h n ({E α t }) can be written as: Expanding the joint probabilities in terms of conditional probabilities over the different links: we can now write an expression for h n (G) in terms of the contributions from the different links: We now note that the memory p must be consistent with equations S10 and S21 and hence must be the minimum value of n which maximises h n (G).Hence, let us define our prospective effective memory p, which we will then show to be an upper bound, as: To test that this value of p is indeed a point for which we obtain the maximum value for our conditional entropy we take a value of n ≥ p so that n = p + d.The entropy is then given by

(S31)
This shows us that h n (G) = h p (G), and so Ω(G) ≤ p.Since we defined p to be the maximum value of the possible coorders of the network, and hence the effective memory, we have now proved that Ω(G) That is to say, the memory of the temporal network is bounded above by the furthest time into the past of any link that has influence on the evolution of any other link.
Corollary 2. In the most general case of a temporal network G, the memories of its links as given by the values Ω(E α ), with α = 1, . . ., L, do not provide an upper or lower bound for Ω(G).That is to say, in general: except in special cases, such as the one considered in the previous corollary.
Proof.It is sufficient to provide two examples: one in which the maximum memory of the links is less than the memory of the network, and one in which it is greater.
First we look at a case in which the maximum link memory is less than the network memory.Consider G formed by two links ruled by the stochastic processes , for some values of q and p.Then Ω({E Second, we consider a case in which the maximum link memory is greater than the network memory.Consider again G with two links ruled by {E 1 t } and {E 2 t }.Assume now As in the previous case the memory of the network is p. however, if we substitute our expression for E 2 t into our formula for E 1 t , then we obtain , which is greater than Ω(G).We will further explore the reasoning for this in later sections.
Conceptually speaking, observe that the effective memory is similar to Granger causality in that it considers the influence of past states of a process on the present evolution of another process [44].However, they are not the same.Indeed, in the toy models that we introduce later on to demonstrate the cases where the scalar memory Ω (G) does not capture the influence of memory on spreading processes, it can also be seen that Granger causality suffers from the same issue, as here the extent of any causality between the two links is precisely the scalar memory.

Estimating the co-memory matrix M
We have shown that the memory of a temporal network can be understood in terms of the co-orders of its links, which represent the memory that one link has of another.As in the main text, we define the co-order Ω(X Y) of a link process X composed of random variables X t with realisations x t , with respect to link process Y composed of random variables Y t with realisations y t , as What remains to be found however is a way of estimating this value.We will here adapt the form of the efficient determination criterion (EDC) given in Eq.S17 and S20.Firstly, it is clear that both the number of states m and the number of observations T are consistent with their applications to sequences in general.Specifically m will be 2 (since links are either present or not) and T is defined by the data being used.This means that we now need only focus our attentions on the log-likelihood function log L k .Again, given a pair of sequences {X t } and {Y t } with realisations x t and y t respectively, and where t = 0, ..., T , the likelihood of observing the full sequence Where, similarly to before, y i−1,i−k is the joint y i−1 , y i−2 , ...y i−k , and the counting function n(x i , y i−1,i−k ) is defined by Taking the empirical estimate for the conditional we can then write the log-likelihood as Now, precisely as before, we obtain the estimator giving co-order estimate An example where we show explicitly the value taken by the estimator EDC(k) for a concrete co-order Ω(E 1 E 1 ) is depicted in Fig. S4.In that figure we also plot the autocorrelation function of the signal {E 1 t }, defined in the usual way ACF(τ )∝ ⟨E 1  t ⋅ E 1 t+τ ⟩ t .
Pair Memory Ω pair (G) Here we introduce an alternative approach to estimating the scalar memory of a temporal network.We do this by taking pairs of links and analysing the memory of the pair as if it were its own temporal network.By this we mean that, if we had two links E α and E β (α ≠ β) then rather than looking at the set we look at the order of the random vector directly: Ω ((E α , E α )).
An advantage to this alternative approach is that, while it is possible to measure the co-orders of pairs of links directly, this does not immediately allow us to make use of the great body of work that has been done on the estimation of the memory of a general symbolic sequence.We also show that in general this is a better estimate of the scalar memory than the effective memory.
From S26 we know that the memory of a temporal network G with generating edge processes E α is bounded above by Now, consider two links from this network: E 1 and E 2 .In isolation they form their own temporal network G 1,2 whose scakar memory is bounded above by max{Ω(E 1 ), Since this new temporal network contains only two links, it can only exist in 4 possible states (e 1 t , e 2 t ) ∈ {(0, 0), (0, 1), (1, 0), (1, 1)}.This is not an unreasonably large state space, and so we can estimate the memory directly.All that remains to do is index this state space.To do this let us first detail a more general concept: given two time series {X t }, {Y t } ∈ {0, 1} with realisations x t and y t respectively, define the product time series {Z t } ∈ {0, 3} with realisations z t , as z t = f (x t , y t ) We know that, since f is a bijection, the possible states of Z t are simply labels for the possible states of (e 1 t , e 2 t ).As such Ω(Z t ) = Ω(G 1,2 ), and hence we can get the memory of the two link sub-network directly.What remains to be seen is how this translates to the memory of the temporal network as a whole.
If we now consider each possible pair of links E i t , E j t , and their product sequence , then we can obtain the following result: Theorem 2. For a temporal network G with link processes E α and where the product of pairs ) are given by some bijection f ∶ {0, 1}×{0, 1} → [0, 3], where the effective memory is given by and the pair memory is given by Ω pair (G) = max α,β Ω({Z αβ t }) , the following inequality holds: Proof.Let us prove the second of these inequalities first: Ω pair (G) ≤ Ω eff (G).We know that for any pair of links ) .

(S46)
As required.Now the first of the inequalities.Assume without loss of generality that Ω(G) = p.Then there must exist at least one link E m with p = min n (n ∶ P(e m t e 1,L t−1,t−n ) = P(e m t e 1,L t−1,−∞ )), i.e. there must be at least one link which remembers some part of the network p time steps ago.We then see that there is at least one link E which is remembered at least p time steps ago, i.e.P(e m t e 1,L t−1,t−p ) is a function of E (and other links and time indices).Taking the pair process {Z m t } we must hence have that Ω({Z m t }) ≥ p, since its conditional must be a function containing terms at least p steps into the past.Hence there exists some (α, β) such that Ω({Z αβ t }) ≥ Ω(G), and hence we must have that Ω(G) ≤ Ω pair (G).This concludes our proof.
Notably, we must also have that Ω({Z αβ t }) = Ω({Z βα t }), and so for a network with L links only L(L − 1) 2 pairs of values (α, β) must be checked to find the maximum, and hence the estimated pair memory of the network, meaning that this approach may be faster to implement than finding the co-orders directly in some cases (though, possibly because of the choice of estimator used in this work, this is not the case here).

WHY THE CO-MEMORY MATRIX M AND THE EFFECTIVE MEMORY Ω eff (G) ARE BETTER DEFINED AND MORE USEFUL CONCEPTS THAN THE SCALAR MEMORY Ω(G)
Two toy models and their virtual loops The main reason why it is important to properly define and extract the memory of a temporal network is to unveil its influence on the dynamics of processes occurring over the network.We will show here that the co-orders M αβ = Ω(E α E β ) and the associated effective memory Ω eff (G) of a temporal network G are better quantities to characterize spreading processes occurring over G, than a traditional definition such as Ω(G), which is not only computationally difficult to estimate, but also suffers from fundamental drawbacks.We will illustrate this by means of the following two temporal network models.
Model 1 -The first temporal network we consider is a chain graph with three nodes and two links with binary states 0, 1.The temporal activity of the two links is ruled by the two coupled stochastic processes E 1 = {E 1 t } t=1,2,... and E 2 = {E 2 t } t=1,2,... defined as: This model also induces virtual loops of memory p1 + p2, although, in practice the estimation method may not necessarily be able to identify these virtual loops due to virtual loop decoherence (see text for details). with and p 1 and p 2 two positive integers.This means that, at each time step t, each link will either be sampled from a Bernoulli trial, or it will copy from the past states of the other link.In the latter case, link 1 will copy the state of link 2 at time t − p 1 , i.e. exactly p 1 steps back in the past, while link 2 will copy the state of link 1 at time t − p 2 .The model is illustrated in the left panel of Figure S1.
We can now state and prove the following theorem on the memory Ω(G) of model 1: Proof.By construction we have that P(e Interestingly, when we look at the different entries of the matrix M, we get some intriguing results.In fact, in this case the matrix is two-dimensional and, together with the terms Ω(E 1 E 2 ) = p 1 and Ω(E 2 E 1 ) = p 2 we must also evaluate the diagonal terms Ω(E 1 E 1 ) and Ω(E 2 E 2 ).Let us first consider Ω(E 1 E 1 ).We can treat the coupled system in Eq.S48 by re-writing an expression for E 1 t containing only terms related to link 1 directly.We obtain: which clearly shows that Ω(E 1 E 1 ) = p 1 + p 2 .Similarly we can prove that Ω(E 2 E 2 ) = p 1 + p 2 .These results have been confirmed by numerically simulating the model and measuring the co-orders directly.By construction, this means that the effective memory of this system is Ω eff (G) = p 1 + p 2 , which is possibly up to twice the value of the scalar memory Ω(G).If, without any loss of generality, we set p 1 > p 2 and then fix p 1 and let p 2 vary, we can thereby construct a variety of temporal networks with exactly the same scalar memory Ω(G) = p 1 , but with different co-memory matrices and a tunable effective memory Ω eff (G) = p 1 + p 2 .In this case, the difference between Ω eff (G) and Ω(G) is the result of an induced virtual loop (VL) in the dynamics of the temporal network: when link 1 draws from the memory of link 2, it may effectively be drawing from its own, more distant, past.
Importantly, these contributions to the memory of the network are intrinsically indirect: the source of memory present microscopically at link E 1 t and E 2 t is induced by the coupling of the link activities.Of course, these effects are obtained as we restrict the observation state space to single links.These mechanisms are similar to what happens when we project a low-order Markov chain defined on a given state space onto a smaller dimensional state space.While these effects can be seen as an artefact of such a projection, they turn out to have important consequences on dynamical processes taking place on temporal networks, as we will show in the next section.But first let us introduce a slightly more realistic model of a network with two links, in which the virtual loops have a different structure.
Model 2 -In order to study the effects of virtual loops in a slightly more realistic -but still controlled-setting, we introduce a second toy model in which, given the same values of p 1 and p 2 from our first model, we obtain exactly the same scalar memory Ω(G), but with a different memory kernel.The link evolution of this model is illustrated in the right panel of Figure S1 and is specified by a coupled pair of slightly modified DAR(p) processes: where As in model 1, at each time step, each link in model 2 will either be sampled from a Bernoulli trial, or it will copy from the past states of the other link.However, the first link will not exactly copy the state of link 2 at time t − p 1 , but it will copy one of the previous t − 1, ..., t − p 1 states of the second link selected at random with uniform probability.The scalar memory of model 2 is clearly Ω(G) = max{p 1 , p 2 }, as this is the furthest point into the past of the network that is required to when generating the next step of its evolution.A formal proof of this will be given in theorem 5, as this model is a specific case of the eCDARN(p) model described in section IVA.Again, as in the previous model we find interesting patterns when we look at the different elements of the co-memory matrix M. As in model 1 we have Ω(E 1 E 2 ) = p 1 and Ω(E 2 E 1 ) = p 2 .To evaluate Ω(E 1 E 1 ) we rewrite E 1 t using only terms related to link 1, obtaining: From this we can then see that the dynamics of the first link can equivalently be described by the following process: where Qt ∼ Bernoulli(q 2 ) and Y t ∼ Bernoulli(y).Hence the stochastic process {E 1 t } t=1,2,... is a form of DAR(p) process of order p 1 + p 2 [45].Implicitly, this tells us that Ω(E 1 E 1 ) = p 1 + p 2 , and analogously we can get that Ω(E 2 E 2 ) = p 1 + p 2 .Again these results are confirmed by measuring the co-orders directly through numerical simulations of the model.Summing up, the effective memory of this system is, as expected In exactly the same way as before, and without loss of generality we can fix p 1 > p 2 and let p 2 vary to construct a variety of temporal networks with exactly the same scalar memory Ω(G) = p 1 , but with different co-memory matrices and tunable effective memories Ω eff (G).

Virtual loops of arbitrary order and conditions for VLs to emerge
The temporal network toy models discussed above consist of a linear chain of three nodes and two links.As illustrated in figure S2, a convenient way of representing these temporal networks is by a graphical model with two nodes, describing the two links (stochastic processes) of the temporal networks, connected by directed arrows that express the temporal dependence structure between different network links.This representation is known in the literature as a Bayesian network (BN), and when links indeed describe causal relationships, it is often termed as a Bayesian or causal network.While BNs are usually directed acyclic graphs (DAGs) by definition, it is easy to see that the BN associated to our toy temporal network models is indeed cyclic (CBN), and this is indeed a sufficient condition for virtual loop effects to emerge in memory.Models 1 and 2 are among the simplest temporal networks to provide a nontrivial virtual loop (VL) structure.Because VLs in this case are induced via a causal path involving only a pair of links ( Order 2 Order 3 Order 4 Examples of cyclic Bayesian networks describing the link dependencies from which virtual loops emerge.Virtual loops produce long memory contributions in the diagonal entries of the co-memory matrix.These loops are induced by appropriately coupling the dynamics of the links.Shown are possible configurations where VLs of order 1, 2, 3 and 4 can emerge.The nodes in each diagram denote the stochastic process E α associated to link α = 1, 2, . . ., L of a temporal network.Solid arrows denote temporal dependencies between links, whereas dashed arrows indicated induced VLs. as a VL of order 1.Now, it is easy to construct 'higher order' virtual loops e.g.simply by building temporal networks with an underlying Bayesian ring topology where link 1 dynamically depends on link 2, link 2 dynamically depends on link 3, etc., and link n dynamically depends on link 1. CBNs with VLs of orders 2, 3 and 4 are shown in figure S2 together with one of order 1.In theory, VLs of higher order induce contributions to the co-memory matrix with longer memory, yielding a longer effective memory Ω eff (G).In practice, virtual loops with high memory are difficult to observe due to extremely long time series being required to capture the effect.More importantly, these VLs are stable as long as the temporal dependencies between links are fine tuned, and quickly dissipate otherwise, meaning that the practical relevance of VLs in the context of real-world temporal networks is less clear.In the next subsection, by comparing models 1 and 2 in more detail, we will give a closer look to this mechanism of "virtual loop decoherence".
Before that, we should also highlight that the underlying BN needs not to be cyclic for virtual loops to emerge: this is a sufficient, not necessary condition.Indeed, when links are also auto-correlated (meaning that on top of the link temporal dependencies, we prescribe that the link has also an internal dynamics which is auto-correlated), then virtual loops can also emerge even in the case that the underlying BN is a priori acyclic.The reason is because in that case, the interplay between the auto and cross-correlated dynamics induce virtual Bayesian links.This can be better explained with an example.Suppose that link α has an internal auto-correlated activity, and on top of that, also depends on the past of link β.The interplay between the two dynamics is captured in the auto-correlated nature of α, which displays signs of memory which are a mix of the ones obtained from its own internal dynamics and from the memory displayed by β, hence the actual memory of α (as measured by the co-order Ω(E α E α )) is in general larger than the memory of the internal dynamics of α.This effect is indistinguishable from a virtual loop, hence we can confidently label it in a similar way.Accordingly, the following table summarises when virtual loops emerge in the system.

Virtual loops in other areas of physics and beyond
The virtual loops discussed above are indeed particular cases of "causal loops" and therefore share some similarities with important concepts arising at the heart of a number of key challenges in several areas of modern science.For instance, when finding the marginal distributions of a collection of random variables it is common to use the message passing, or belief propagation, algorithm [46,47].Such methods are important in the study of Gaussian graphical models in machine learning [48], signal processing [49], and a plethora of other such inference problems [50][51][52].These message passing algorithms also have uses in the context of statistical physics, where they can be related to the Bethe-Peierls approximation (or the replica symmetric cavity method in the context of spin glasses), and in turn the Thouless, Anderson, Palmer equations for local magnetisations [53][54][55][56].However, this approach becomes inexact precisely when the Bayesian graphs that underly these problems have loops.Because of this the study of how to best overcome this problem has been seen as deeply important [57][58][59], and indeed has been a subject of recent attention [60].In a different vein, causal loops have been a subject of interest in the study of Feynman diagrams."One-loop" diagrams, in which there is a single causal loop, have historically presented challenges to study [61,62].However when these challenges have been overcome they have helped to explain such phenomena as the Casimir effect [63,64], Hawking radiation [65], and the Lamb shift [66].

The phenomenon of virtual loop decoherence
The difference between the two toy models relies in the way in which the state of the system depends from the past states.Even though in theory the scalar memory Ω(G), the effective memory Ω eff (G) and the co-memory matrices are identical in the two models, in practice the estimation of these quantities is different.In our first model, the extent of the memory is localised: each link will, when referencing from the past, always look at the state of the other link a fixed number of time steps into the past (p 1 or p 2 ).In our second model each link will, when referencing from the past, uniformly pick from among the past p 1 (or p 2 ) states of the other link.In other words, links in the second model will seldom copy the past state exactly p 1 or p 2 time steps ago.While theoretically the local co-memory of each link is still p 1 + p 2 due to the presence of virtual loops, whether we can accurately estimate this quantity is less obvious.In theory we would require a very large observed time series to estimate the theoretical effective memory p 1 + p 2 consistently.While this is a finite size effect, it is however important with regards to processes running on top of these networks, and therefore will affect e.g. the behaviour of spreading processes we run on them, as we will show in the next section.It is therefore in this second model that we would expect to observe what we term "virtual loop decoherence": each link will not always utilise the full extent of its potential memory, in that the virtual loops will not always reference a point at time t− (p 1 + p 2 ) in their past history.Because of this we expect the influence of the virtual loops to be limited in comparison to our first toy model, and hence the theoretical co-order Ω(E 1 t E 1 t ) = p 1 + p 2 will be difficult to detect.The effect of this is that the estimated values of the diagonal terms in the co-memory matrix will in some cases be smaller than the effective memory of the network, or in more extreme cases, less than Ω(G).In either of these cases virtual loops will not contribute to Ω eff (G) (virtual loop decoherence).As a byproduct, decoherence would cause the estimation of Ω eff (G) to approach the scalar memory Ω(G), meaning that in those practical scenarios where VL decoherence does emerge, then the scalar memory Ω(G) might after all be a good approximation to the effective memory operating underneath.In other words, conceptually the correct scalar quantity under study is Ω eff (G) and not Ω(G), but when the system is free from VLs or these decohere, then both quantities tend to be close.
To analyse this we generate 10 4 instances of both models with randomly selected values of p 1 and p 2 and a range of values for the memory strength q and the link probability y, along with the number of time steps each network is generated for.For each instance we estimate the effective memory Ω eff (G), and record a "hit" if this estimate is precisely p 1 + p 2 .We plot in Fig. S3 this hit rate, as a function of the parameters q and y.As expected, we see that in all but the few cases where the memory in each model is removed (q ≈ 0), that the hit rate for our first model is markedly higher than for the second.In particular, the hit rate of model 1 is consistently 100% for a large range of the model parameters.On the other hand, the hit rate for model 2 is typically smaller, hovering around 50% for the same parameter range, suggesting that the virtual loops which are FIG.S3.Hit rates and hisrograms for both toy models.(Left and middle panels) Hit rate measuring the fraction of sampled toy model networks for which the estimation of Ω eff (G) coincides with the local co-memory induced by the virtual loops, as given by Ω(E 1 t E 1 t ) = p1 + p2, for model 1 (left panel) and model 2 (middle panel).For each realisation of the temporal network, we sample p1, p2 uniformly from a range 1, 2, . . ., 10.Both models depend on parameters q (memory length) and y (link probability, see the text for an explanation of these parameters).The red curve displays the dependence of the hit rate on q for a fixed y = 0.25, whereas blue curve displays the dependence on y for a fixed q = 0.9.In every case we let the network evolve for T = 10 5 time steps, and every point is the average of 10 4 realisations (each realisation having a different p1, p2).In model 1 the virtual loops govern Ω eff (G), whereas in model 2 virtual loop decoherence sets in and the estimated Ω eff (G) is no longer systematically governed by the virtual loops.(Right panel) Dispersion histogram where we count the frequency of each estimated Ω eff (G) (averaged over 10 4 realisations of the temporal network).We compare this quantity to p1 + p2 by measuring the normalised dispersion d = [Ω eff (G) − (p1 + p2)] 2 for models 1 and 2, with fixed parameters q = 0.9 and y = 0.25.In model 1 the dispersion is systematically zero (a result of a 100% hit rate), i.e. virtual loops systematically govern the estimation of the effective memory.In model 2 virtual loops are decoherent and the estimated value of the effective memory tends to be smaller than p1 + p2 accordingly.We do not however reach the asymptotic case d = −1 that that we associate with the effective memory matching the scalar memory, and hence virtual loops not contributing.
present only cause the estimated co-memory to be p 1 + p 2 in at most half of the sampled cases, while for the rest these loops are decoherent.This effectively causes the estimation of Ω eff (G) to approach the scalar memory Ω(G) = p 1 .
To further complement this analysis, we have computed, for each temporal network realisation of models 1 and 2, the normalised frequency histogram of the dispersion d = [Ω eff (G) − (p 1 + p 2 )] p 2 .For a given set of realisations of each temporal network model, d accounts for how well the estimated effective memory approximates the theoretical one (p 1 + p 2 , induced by the virtual loop), normalized over p 2 .Finding d = 0 means that the estimation matches the theory and virtual loops govern the effective memory.For d ≠ 0, virtual loop decoherence sets in.Typically, we expect that in this (and indeed most) scenarios we will observe d < 0, meaning that the estimated memory contribution of any virtual loops is smaller than p 1 + p 2 .When this contribution gets smaller, d approaches its minimum value d = −1, the case associated with virtual loops being completely decoherent and the effective memory coinciding with the scalar memory Ω(G) = p 1 .In the right panel of Figure S3 we depict the histograms of d for an ensemble of 10 4 realisations of models 1 and 2, with parameters q = 0.9, y = 0.25 and with p 1 and p 2 samples uniformly randomly from the range 1, .., 5 (inclusive).We systematically find d = 0 for model 1, as expected given that the hit rate is 100% for this model.In the case of model 2, we find that the histogram is more scattered, favouring situations with d < 0. This means that the memory contribution to the effective memory of the virtual loops is decreased, and accordingly Ω eff (G) gets closer to Ω(G), although this trend is never reached as VLs never completely decohere (see however theorem 6 for a rigorous proof that full virtual loop decoherence can take place in some systems when the size of the temporal network is large enough).Interestingly, we also find that in a small percentage of the ensemble we find d = 1.This apparent paradox can be explained by exploring the EDC curves for these cases (see Fig. S4).In every case where we find d = 1, the assignments happen to be p 1 = 2, p 2 = 1, and while the estimator selects Ω(E 1 t E 1 t ) = 4 instead of 3, notice that the EDC curve is essentially flat at that neighbourhood.
Epidemic Spreading defined on top of models 1 and 2 Models 1 and 2 above provide examples where the scalar memory Ω(G) of a temporal network is different from the effective memory Ω eff (G), due to the emergence of virtual loops that affect the local memory structure of the for one realisation of model 1 (left column) and two realisations of model 2 (middle and right columns).In every case the temporal networks have 10 5 time steps.EDC estimates that the memory corresponds to the minimum of log(EDC(n)).For model 1, the ACF clearly shows a peak at the correct memory p1 + p2 and a succession of additional harmonics with smaller amplitude, and the EDC curve clearly captures the correct memory order.For model 2 the situation is less obvious: the ACF cannot be so easily interpreted.The two examples depict one where the theoretical memory p1 +p2 = 5 is identified, and another where the estimation is different (note in this case that the EDC curve is relatively flat around p1 + p2).
network.Now the question is, to what extent do the virtual loops have a truly measurable effect and are therefore relevant in practice?Here, we consider a spreading process over a temporal network.We show that the dynamics of this spreading process are indeed highly sensitive to the shape of memory, and in the event a representative scalar quantity had to be used, we demonstrate accordingly that Ω eff (G) is better suited to quantifying the real effects of memory than Ω(G), this last quantity being blind to any virtual loop contribution.
We have implemented a Susceptible-Infected (SI) model for spreading dynamics on our two network toy models.In the SI model a node can be in one of two states: Infected (I) and Susceptible (S).At each time step an infected node has a probability λ of passing an infection to any other node that it is connected to via a link.Once a node is infected, it cannot become susceptible again, the change is permanent.In our set-up we start the infection at node 1.The infection will then be transmitted over link 1 at time t with a probability λ if E 1 t = 1, while it will not be transmitted if E 1 t = 0. Hence node 1 can infect node 2, then from node 2 the infection can cross link 2 and finally infect the third node.To quantify the speed of the spreading process we will measure the expected time taken to infect node 3, starting from node 1, and call this the spreading time.This quantity can be evaluated either via Monte Carlo simulations (averaging over several realisations of the process), and also analytically.As we will discuss below, both numerical simulations and analytical results reveal that the expected spreading time does indeed depend on the virtual loops, i.e. on the precise structure of the co-memory matrix.Furthermore, we will show that these effects are well accounted for by the effective memory Ω eff (G), while conversely they are not captured by the network memory Ω(G).

Analytical solutions of the SI dynamics on network models 1 and 2
We here present the analytical derivation of the expected spreading time for an SI infection in our toy models.We will explicitly consider model 2.However, the same approach, with small differences, which will be noted below, also works for model 1.The stochastic processes E 1 = {E 1 t } t=1,2,... and E 2 = {E 2 t } t=1,2,... are higher-order Markov chains in the state-space {0, 1}.They can then be transformed into first order Markov chains in an expanded state space.The temporal network G = {E 1 , E 2 } = {E 1  t , E 2 t } t=1,2,... has realisation (e 1 t , e 2 t ) at time t.Since each link contains memory of the other, let us build this into a pair of "state variables" α 1 and α 2 , such that at a time t α 1 = {e 1  t , e 1 t−1 , ..., e 1 t−(p2−1) } and similarly for α 2 .Any pair (α 1 , α 2 ) then captures all of the useful past states of the network.Let the set of all such realisations be denoted by S, and the sets of possible values for α 1 and α 2 be S 1 and S 2 respectively.Link 1 has memory of the last p 1 steps of link 2, and link 2 has memory of the last p 2 steps of link 1, hence, since the link has two possible states at any one time, S 1 = 2 p2 and S 2 = 2 p1 .For this to be useful we must additionally introduce some concept of ordering to the values of α 1 and α 2 by means of a labelling function.The simplest form of this function, which we will here use, is and similarly for l(α 2 ).This is essentially taking the set of 0's and 1's that represent the link histories contained in α and converting them to a decimal number as if they were in binary.We will implicitly assume that wherever we use α, or any state in S 1 or S 2 , we are referring to the label l(α).
We can use this to describe the probabilistic evolution of each link over time.For initial states (α 1 , α 2 ) ∈ S and target state β 1 , the probability P(α 1 → β 1 α 1 , α 2 ) that link 1 goes from state α 1 to state β 1 given α 1 and β 1 defines a "transition tensor" (as opposed to the traditional transition matrix) in the following way: where h is the Hamming weight function, which counts the number of 1's in the binary representation of its argument.Note that in the case of model 1 this would be instead where s p (α) is the value of the p th most significant entry in the binary representation of α, which is the state of the link at p time steps in the past.It is now a simple task to incorporate the spreading an of infection across a link: we simply associate to each link another state ι which governs the infection.ι 1 = 1 if link 1 has passed an infection, and 0 if it has not.In this way the two pairs (α 1 , ι 1 ), (α 2 , ι 2 ) completely describe the state of both the links and the infection passage over the system.What remains is to find the probabilities of an infection passing over each link given the states of the links.That is . Denoting by H the set of link states where a link is present, i.e.H ∶= {α i ∶ l(α i ) ≥ 2 pj −1 j ≠ i}, and L ῑi = λδ(ῑ i , 1) + (1 − λ)δ(ῑ i , 0), we can then write: where χ H (α) is the indicator function for α in set H. Using the transition tensor method outlined in [23] we can then find the expected spreading times τ α 1 ,α 2 ,ι 1 ,ι 2 for an infection given a starting state α 1 , α 2 , ι 1 , ι 2 as the minimal solution to the following set of linear equations: This can then be averaged over some set of initial conditions to give the expected spreading time.In our case we will take the initial conditions to be the steady state of the network.In practice this can be viewed as the state of the network after a large number of steps.For numerical simulations we always allow the network to evolve to equilibrium before any spreading process is started, whereas for analytical calculations we take the steady state to be the left eigenvalue of the transition matrix for the system corresponding to eigenvalue 1.
To summarise, we fix the scalar memory Ω(G) of the temporal network in our toy models to be p by fixing p 1 = p (implicitly then p 1 ≥ p 2 ).We then allow the shape of the memory in the network to vary by changing the value of p 2 , and measure the expected time taken for an infection to spread over the three nodes in the network as a function of p 2 .This has been done for a number of values of p, for both model 1 and model 2. The results are reported in Fig. S5 where we plot the spreading times of the SI epidemics as a function of ∆ = Ω eff (G) − Ω(G).This value ∆ is used so that curves are aligned on the x-axis for any value of p 1 .Analytical results are in excellent agreement with Monte Carlo simulations and show that the quantity Ω eff (G) is able to describe well differences in the relevant quantities that describe the dynamics of the SI process.Conversely Ω(G) is not able to account for the different values of spreading times obtained in networks with the same p 1 and different p 2 .Indeed, in the event Ω(G) was well suited, then spreading rates should remain constant, as Ω(G) is actually constant for all temporal networks corresponding to each curve.Results indicate that the spreading time is actually not constant, and this variation correlates with the effective memory Ω eff (G).This demonstrates that the shape of the memory, as defined by the co-order matrix M -and hence the effective memory-are far better at characterising the spreading rate than the scalar memory of the temporal network.

Inter-event time statistics
With our toy models we have demonstrated that the scalar memory of a temporal network Ω(G) is not necessarily the right quantity to characterise the way that memory influences the spread of an infection over the network.In order to investigate further this situation we can take a different, complementary approach.It is well known that a memoryless (Poisson) stochastic process has an exponential inter-event time distribution.Processes with memory must have inter-event times which deviate from an exponential distribution.A so-called burstiness parameter has been proposed in [67] to quantify such deviations.The burstiness parameter B of a the time series is defined as: where ⟨τ ⟩ and σ are respectively mean and standard deviation of the inter-event times.The expression above is equal to zero when the time series corresponds to a (memoryless) Poisson process, since for an exponential distribution the mean and standard deviation coincide.When the time series is regular, σ = 0 and thus B = −1, so values of B in the range (0, −1) denote more regular behavior than Poisson.On the other end, for B > 0 the fluctuations in the inter-event times are larger than Poisson, denoting an increase of burstiness.Finally, in the limit of large σ, e.g. when the interevent time series is power law distributed, the value of B tends to 1.
Given our previous example of a spreading process, it is not clear that burstiness of the link evolution processes should a-priori be associated with the scalar memory Ω(G) of the temporal network.Indeed we show now that this is not the case.If the behaviour of a link in a network is bursty, then that link displays memory, and if it is not bursty then it does not have memory.Hence, we argue that for a measure of the memory of a network to accurately capture the behaviour of inter event time statistics it must reflect this: if links are not bursty then the network should not have memory, and if they are then the network should have memory.We explore this in the context of our two toy models.
We consider model 1 with two parameter settings, namely: (i) p 1 = 1 and p 2 = 0 and (ii) p 1 = p 2 = 1.Notice that for both choices of parameters, models 1 and 2 are equivalent.
In case (i) it is easy to see that the scalar memory and effective memory coincide, Ω(G) = Ω eff (G) = 1, and so we expect to see little or no burstiness, whereas in case (ii) the temporal network is still first-order Markov, i.e.Ω(G) = 1, but the effective memory is larger, Ω eff (G) = 2.In both cases, we have measured the burstiness parameter B as a function of the memory strength q of a link.The results reported in Fig. S6.show that case (i) is non-bursty as expected, while FIG.S6.Burstiness of link dynamics in model 1 as a function of the memory strength q.Two different cases have been considered (i) p1 = 1, p2 = 0, for which Ω(G) = Ω eff (G) = 1), where B ≈ 0 for the whole range, and (ii) p1 = p2 = 1), for which we still have Ω(G) = 1 but the effective memory is larger Ω eff (G) = 2, where for a large range of values of q we find B > 0 which denotes burstiness in the link activity, hence highlighting the presence of memory in the internal activity of the links, a signature which is captured by Ω eff (G) but not by Ω(G), as the effective memory places the networks in a regime where we can observe burstiness.Each case is simulated for 10 8 time steps.
case (ii) displays varying degrees of burstiness, and therefore denoting the presence of a non-trivial memory shape, even if the scalar memory is still Markov.This is another indication that the memory Ω(G) of the network does not provide a good description of dynamical dependencies and memories at the level of links and link pairs.Which links are bursty, this being an indication of the presence of memory, is far better captured by the effective memory.

VALIDATING THE FRAMEWORK ON SYNTHETIC TEMPORAL NETWORKS
In this section we provide more details on the models for generating temporal networks with specific memory characteristics that we have discussed in the main text.In particular, we have considered four different types of generators of synthetic temporal networks: the so-called "Discrete Auto-Regressive Network of order p" (DARN(p)), extended DARN(p), "Correlated Discrete Auto-Regressive Network of order p" (CDARN(p)) and extended CDARN(p) models.We will first introduce the models, find their scalar memories, and then present a full account of their memory estimation.We will then discuss the presence of virtual loops, and show that in some cases these loops become completely decoherent for large network sizes.
Model definitions and ground truth proofs for Ω(G) All of the models here are derived from the so called "Discrete Auto-Regressive process of order p", or DAR(p) process, introduced by Jacobs and Lewis [45].This time series incorporates a dependence on past states with random generation of new states in a simple way.In words, at each step of the process one first decides if the new state will be random or drawn from memory, if it is random then we decide what it will be, if not then we copy a single value chosen among the past p states.Formally Here Q t ∈ {0, 1} and Z t ∈ {1, ..., p}.For our purposes we only consider X t ∈ {0, 1} and so we must fix Y t ∈ {0, 1}.This gives us a random process with memory p that can be extended to generate temporal networks.
• DARN(p): This model, introduced in [23], represents the simplest possible extension of the DAR(p) process to a temporal network: for a network with N nodes we assign to each of the L = N (N − 1) 2 possible links an independent DAR(p) process.For each link α = 1, . . ., L with probability q the state of the link is copied from its past, sampling uniformly at random from its own history up to p steps in the past.With probability 1 − q the link state will be drawn at random following a Bernoulli process with probability y.Summing up, the model depends on three parameters {p, q, y}.The first parameter y controls the density of the network.The second, q, tunes the strength of the memory term in the process with respect to the memoryless term.The final parameter, p, controls the length of the memory, which can be thought of as the number of time steps before the autocorrelation function decays exponentially [23].Since each link in this model is independent we can make use of corollary 1, and see that we must have Ω(G) = p.
• eDARN(p): The extension to the DARN(p) model, as invented for this work, is then the case where each link is allowed to have a different memory length, rather than there being a single fixed value of p. Hence we specify that each link is governed by an independent DAR(p) process but the value of p is allowed to vary for each link.This gives us a way of generating temporal networks with independent links, but with varying link memories.This model therefore depends on parameters {ρ(p), q, y}, where ρ(p) is the distribution of memory lengths from which one samples the memory length of each link.If we define p as the maximum value for the memory which is drawn from the distribution ρ(p), then, again, since links are independent in this model, we must have Ω(G) = p by corollary 1.
• CDARN(p): This model, as introduced in [37], represents a simple way of extending the DARN(p) model to include correlations between the dynamics of links.Similarly to the DARN(p) model, however when the link is to copy its state from memory (with probability q), it does not necessarily copy this from its own past history: with probability 1 − c it will copy from its self, and with probability c it will choose one of the other L − 1 links uniformly at random and copy that link state uniformly at random from the near past up to p steps in the past.
Clearly the DARN(p) model is the special case of the CDARN(p) model in which c = 0.This model depends on parameters {p, q, y, c}.Formally, we can define the CDARN(p) model in terms of random variables as follows: the time varying adjacency matrix A t = {a ij t } is given by where ) and M ij t randomly picks among the available links in the network, so that for a link process a ij t , P M ij t = (i, j) = 1 − c, and for (i, j) ≠ (k, l) and L = N (N − 1) 2. We now state and prove a theorem on the scalar memory of this temporal network process: Theorem 4. Let G be a temporal network generated by CDARN(p).Then Ω(G) = p.
Proof.To prove that the scalar memory of a CDARN(p) network is indeed p, let us consider the conditional probability of observing a link directly.First, fix link (i, j), and let us associate with each link a linear label (i, j) → ∈ 1, ..., L, then the conditional probability of observing a link at time t given the past p steps is given by Since, by construction, and the dynamics of each link are symmetric under any relabelling of links in the temporal network, it is enough for us to consider a single link.Now consider the same conditional, but with some number extra past steps: P(a t = 1 A t−1 , ..., A t−(p+ ) ).Since the memory term t − Z t will never take the values t − (p + 1) to t − (p + ), the conditional in Eq.S64 will be unchanged, hence Ω(G) ≤ p.Now we look at what happens when we remove some number d < p of past state from the conditional, defining δ = p − d: From this not only do we see that P(a t = 1 A t−1 , ..., A t−δ ) ≠ P(a t = 1 A t−1 , ..., A t−p ), for any such δ, assuming that q, y ≠ 1 or 0, in which case there is no memory.Hence we must have Ω(G) ≥ p.The only remaining option then is Ω(G) = p, concluding our proof.
• eCDARN(p): The extension to the CDARN(p) model, as invented for this work, is one in which the memory kernel from which we sample when the link copies from its own history can be different to the memory kernel when the link copies from another link.That is, not only are links allowed to have different memory lengths, but also the memory lengths used when a link refers to its self are allowed to be different to the memory length used when referring to other links.In this way the random variable Z ij t now becomes dependent on the value of M ij t .If for a link (i, j) the value , for two, possibly different, values of p ij self and p ij other .The following statement can now be proved: Theorem 5. Let G be generated by the eCDARN(p) above.Then Ω(G) = max ij max(p ij self , p ij other ).
Proof.As before, we first write down the conditional probability: Again, by construction in this model we have ) for all and p = min n (P(a t A t−1 , ..., A t−n ) = P(a t A t−1 , ..., A t−∞ )) for at least one value of .Hence, if for each we have min n (P(a All that then remains to prove is that min n (P(a t A t−1 , ..., A t−n ) = P(a t A t−1 , ..., A t−∞ )) = max(p self , p other ).This is a trivial extension of the proof for the CDARN(p) model, but with the conditional probability now being of the form in Eq.S67.Hence we must have that Ω(G) = max(p self , p other ), as required.
Finally, each model in practice also depends on an additional variable: the length of time series T for which it is sampled.While this does not influence the dynamics of the model it will influence any estimated value for the memory, and so we consider it here.Altogether, these models give us a wide range of test cases with a number of features that we might expect from real world networks.
Remark 1.It's important to note that, in theory, virtual loops cannot emerge in the DARN(p) or the eDARN(p) models, but in principle should emerge in the CDARN(p) and eCDARN(p) models as in these latter cases we are probabilistically coupling links, and this coupling can induce casual loops (possibly of different orders) among sets of links.This means that we expect Ω eff (G) to coincide with Ω(G) for the DARN(p) or the eDARN(p) models, but we should find a difference in the CDARN(p) and eCDARN(p) models.In the next subsection we will investigate the extent of that mismatch, and the role played by virtual loop decoherence.
Remark 2. In general, when a given link α ∈ [1, 2, . . ., L] samples its future state from the past of a different link in an eCARN(p) model, one can specify which is the set of links from which α will sample from.As discussed in section IIIb, a natural way to encode this is by building up a Bayesian causal graph of L nodes, where each of the nodes corresponds to a link in the original temporal network.The L × L adjacency matrix C = {c αβ } is such that c αβ = 1 if β is in the set from which α can sample its future from, and 0 if β is not in this set.Different Bayesian causal graphs can thus be specified as to describe the set of links' past from which a given link copy its future state.In

Estimator accuracy on synthetic networks
For each of the four synthetic models we have described, we have studied two key quantities that help us to compare the estimated values of both Ω eff and Ω pair with the analytical value of Ω(G).These are the hit rate and the average distance, as earlier presented in Fig. S3.The hit rate gives the probability that a given estimate for the value Ω eff (or Ω pair ) is precisely the scalar memory of the network, while the average distance gives the value of either Ω eff (G) − Ω(G) or Ω pair (G) − Ω(G) , averaged over several realisations of the network model.In each case the networks generated have a fixed number of nodes N = 10.We then allow in turn one of the parameters q, y, T (and, where applicable c) to vary, while fixing the others to the following values: q = 0.9, y = 0.1, c = 0.1, T = 10 6 .For each set of parameters 10 3 realisations of the model are generated, each with randomly chosen values for the memory lengths (p for DARN(p) and CDARN(p), p ij for eDARN(p) and (p ij self , p ij other ) for eCDARN(p)) from the range 1, ..., 10.We then plot the hit rates and average distances for each model as a function of each free parameter in Fig. S8.FIG.S8.Hit rates to scalar memory Ω(G) for synthetic models.In each case we compute the percentage of the times within an ensemble of 10 3 realisations that the estimated effective memory Ω eff (G) and estimated pair memory Ωpair(G) exactly match the scalar memory Ω(G) (the memory parameter p is randomly sampled from Uniform{1,...,10} for each realisation).Models depend on parameters q, y and (where applicable) c, so each curve scans hit rates for the whole range of a given parameter and fix the values of the other parameters to q = 0.9, y = 0.1, c = 0.1 (in every case, time series size is T = 10 6 ).In DARN(p) and eDARN(p) models where virtual loops are by construction absent, Ω eff (G) = Ωpair(G) and their estimation typically coincide with Ω(G) for a large range of model parameters, as expected.In CDARN(p) and eCDARN(p) models, (probabilistic) virtual loops are expected to kick in, inducing a mismatch between Ω eff (G) and Ω(G) (the mismatch is notably smaller for Ωpair(G) as this quantity disregards diagonal entries of the co-memory matrix).
Let us discuss these results in detail.Considering the DARN(p) and eDARN(p) models: • First, we observe that both Ω eff (G) and Ω pair (G) coincide with Ω(G) and provide a very good estimates for a wide range of parameters.Indeed they are both 100% accurate when q = 0.9 and y is between ∼ 0.2 and ∼ 0.8.
• Notice that the ranges where the performance is worse are actually expected: when y is either very small or very large, then each link series E i t will be dominated by either 0 or 1, in the limit of all 1's or 0's we would observe no memory as the system is deterministic, and indeed close to this we would expect it to be hard to observe any memory.This manifests as a sharp drop in both hit rate and average distance.
• When q is small we would also expect memory to be harder to detect, as it is used less often, and hence any correlations with the past are less significant.However, the increase in hit rate is a more smooth function of q, as is the average distance.
• The two memories (Ω eff (G) and Ω pair (G)) also perform identically for both models.This is again expected: by construction there are no virtual loops of any kind in these network models, and so Ω eff (G) = Ω pair (G).
In the case of the CDARN(p) and eCDARN(p) models: • We observe that Ω eff (G) ≠ Ω pair (G).This is again expected: note that by construction the CDARN(p) and eCDARN(p) models should have a considerable number of virtual loops, as induced by the cross-correlations that are present between each link, and indeed we would expect these virtual loops to be of a variety of orders.By definition, Ω pair (G) cannot capture the effects of virtual loops of order greater than one, hence the mismatch between these two measures.
• Interestingly, we also observe that under a range of parameters, both Ω eff and Ω pair remain quite close to Ω(G), manifesting a strong virtual loop decoherence in these cases.
• Similarly to before, there are also ranges of both y and q where 'performance' is worse, and this is also expected.
• We also see that as c → 1 the hit rate significantly drops.This can be explained by an increase in the significance of virtual loops.When c = 0 then links are independent, and so there are no virtual loops to influence Ω eff (G).
As we increase c we allow for virtual loops, however they will be decoherent, and so the hit rate will still initially be high.As c approaches 1, we have made the influence of virtual loops as strong as possible, and so the hit rate will be at its lowest.
Virtual loops in the synthetic networks Let us further emphasise here the role played by virtual loops in the four synthetic models discussed above.The hit rate and average distance between Ω(G) and Ω eff (G) demonstrate a number of features of the effects of virtual loops on memory in temporal networks.For both the DARN(p) and eDARN(p) models, where loops are not present, the values of both Ω eff (G) and Ω pair (G) are almost identical, and for most parameters give a very good approximation of Ω(G).For the CDARN(p) and eCDARN(p) models this is not the case.The two correlated models should clearly display virtual loops, as they are inherent in the structure of the models in the same way as the toy models described earlier.Indeed, we know that when calculating Ω pair (G) then a number of the virtual loops will be accounted for, and so the significant differences between the hit rates and distances of Ω eff (G) and Ω pair (G) for both the CDARN(p) and eCDARN(p) models can confidently be ascribed to the presence of these loops.It is unexpected however that Ω pair (G) accounts for the full extend of these loops; virtual loops of varying orders should be present in this system.Furthermore, in both cases we observe that Ω eff (G) and Ω pair (G) are reasonable estimates for the scalar memory Ω(G) for a range of parameter values.This can only be the case if, in the same way as for our toy models, these virtual loops display decoherence, and are hence the resulting increase in pair and effective memory is masked.
A theorem on full virtual loop decoherence effect for large network size The CDARN(p) model, as we have presented here, maintains a high degree of symmetry: each link has the same memory strength q, link density y and memory length p.Each link also, with the same probability c, can look into the past p states of every other link, and will do so uniformly.In summary, the network would behave in exactly the same way if the links were swapped (in general if there were any isomorphism).One might be tempted to think that this "synchronicity" might bring out memory and virtual loops in an extreme way, since virtual loops of every order should be present in the system.Indeed we see that in our relatively small synthetic networks, when c is large, and hence there is a strong reliance on virtual loops, the scalar and effective memory of the network seldom coincide.When these networks get larger however, we observe something unexpected: all virtual loops become increasingly decoherent.As the number of links grows larger, so does the pool of links from which a past state can be drawn, because of this we find that the memory in virtual loops "averages out", a concept which we formally detail in the theorem below.Theorem 6.Consider a CDARN(p) network with L links, memory strength q, link density y and memory length p.The conditional probability of a link occurring at time t, given the past p states of the network is as follows: where φ self and φ other represent the contributions to the conditional from the past p states of the link and every other link respectively.As L → ∞, φ other tends to a constant, and hence the link has no memory of the past states of any other link.
Proof.The conditional probability of observing a link at time t given the past p states of the network is as follows: We can therefor see that our memory kernels φ self and φ other are given as: We need only focus on φ other .First, let us consider the average value The CDARN(p) network is taken to be in a stationary state, and so the symmetry of the links under isomorphism guarantees us that P (a ′ t−k ) is the same for each link ′ and for each time t − k.Hence we can write P (a ′ t−k ) = ā for some constant ā.Then we must have, for any of the L − 1 possible values of ′ , Now, φ other can be re-written as follows: Then, by the law of large numbers we can express this in terms of the sample average: Hence φ other → ā as L → ∞.Since there are no terms containing links other than in φ self , then we can conclude that the conditional probability is such that, in the same limit L → ∞, and so any memory of other links is lost.
To summarise, in this section we have validated that the co-memory matrix correctly displays the memory of synthetic temporal networks.The maximum of this co-memory matrix, defined as the effective network memory Ω eff (G) coincides with the scalar memory Ω(G), and so is a good estimate of it, when virtual loops are not present.In the cases where these loops are present, the two quantities in principle differ, and we know that one should consider Ω eff (G) rather than Ω(G) if in this case.Interestingly, virtual loop decoherence is also clearly present in these networks.Moreover, when the network size (number of links) gets larger, theorem 6 suggests that we asymptotically should expect full virtual loop decoherence.This is particularly important when considering real-world temporal networks.We can hence conclude that while virtual loops play a role in the memory of these networks, in practice virtual loop decoherence might limit that role, and as such we would expect that Ω eff (G) does not substantially differ from Ω(G) in real, complex temporal networks.

APPLICATIONS TO REAL-WORLD TEMPORAL NETWORKS Processing empirical network data
We have here presented temporal networks taken from 6 data sets, and at two different temporal resolutions.A varying amount of work has to be done to each of the used data sets before they can be fed into our memory estimator as a discrete time temporal network.As part of the software developed for this research, we have developed a method to convert a time stamped edge list of the form (v 1 , v 2 , t) into a set of time series X v1,v2 t which represents the edge process connecting nodes v 1 and v 2 .Once we have this time series then we can make use of our memory estimators.Hence the aim is now to take each data set and convert it into this time stamped edge list form.We will here give a brief overview of each dataset and the steps required to process it.
The first two datasets correspond to a type of network which we can label as online social communication networks: • Text message interactions between college students (CM) [34].This data represents messages sent between (anonymised) student users of an online communication platform at the University of California, Irvine, over a period of 7 months.Processing this is a simple task as it comes in what is almost a suitable format; the raw data is given as triplets (I.D 1 , I.D 2 , t) for each pair of individuals I.D 1 and I.D 2 interacting at time t measured in seconds from some starting value.Here we simply index the individuals from 0 to N and shift the t values so that the first link in the data occurs at time 0.
• Email communications (EM) data set [33].This data set covers internal e-mail communications between employees of a mid-sized manufacturing company over a period of nine months.The data is (ignoring other irrelevant data) in the form (I.D 1 , I.D 2 , DT ) where now DT is in a date-time format to a resolution of seconds.Hence in constructing our time stamped edge list we again index the I.D values appropriately, and again convert the date-time to the number of seconds elapsed since the first link in the data set occurred.
The third dataset is a social interaction or contact network (RM), that is to say, it is also a social network like the first two cases, but it is an 'offline' one and can also be seen as a mobility network.This data is taken from the "Reality Mining" data set [35], collected from the interaction of 94 students at MIT over 8 months.The data we have used here was taken from bluetooth interactions between the phones carried by the subjects of the study.Each interaction indicates that the two individuals were within at least 5-10 meters of each other.Each device scans for other devices in its proximity every 5 minutes, however because a pair of devices can recognise each other independently this can produce two interactions every 5 minutes with an average inter interaction duration of 2.5 minutes.The data is (ignoring other irrelevant data) in the form (I.D 1 , I.D 2 , DT ) where now DT is in a date-time format to a resolution of seconds.Hence in constructing our time stamped edge list we again index the I.D values appropriately, and again convert the date-time to the number of seconds elapsed since the first link in the data set occurred.
The three remaining datasets correspond to what we could label as engineered transportation networks designed for the public transport in Paris [36] via bus (PB), train (PT) and underground (PU), i.e. these are engineered, offline infrastructure networks.These are comprised of a weeks worth of records for the movements of public transport systems from stop to stop.This data is structured in a different way; each set contains the times taken for each journey occurring between any two stops on the bus (PB), train (PT) and underground (PU) public transport systems in Paris.This data (again ignoring irrelevant data) is of the form (I.D 1 , I.D 2 , t start , t stop ), where the values I.D 1 and I.D 2 represent the origin an destination stops respectively, and t start and t stop represent the date-times at which the origin was left and the destination was reached respectively, to a resolution of one second.
We here index the origin and destinations as expected, and convert the date-time values to the number of seconds elapsed since the first date-time value in the data set.We then take the link between the origin and destination to be present in every second which elapses between t start and t stop .If two journeys overlap then the link is kept active until the latter of the two journeys is completed.
Coarse-graining at different resolution timescales -Once we have a time stamped edge list for a data set we filter out the 100 links which occur the most.We then produce two temporal networks by, for each link, integrating over two different timescales ∆t = 1 minute (60 seconds) and ∆t = 10 minutes (600 seconds).That is to say, given a temporal network G with edge processes E i t , and with a unit timescale, which extends from time t = 0 to t = T , we define a new temporal network G with edge processes Ẽi k , and with timescale ∆t = 60, 600 seconds, which extends from time k = 0 to k = T ∆t − 1.The links in this second network are then drawn from the first in the following way: Ẽi k = 1 if for any t ∈ {k∆t, ..., (k + 1)∆t}, E i t = 1.In this way we effectively "integrate" the time series for each link over our time scale ∆t.

Heterogeneity of co-memory histograms: entropy and kurtosis
In Figure 3 of the main manuscript we have plotted the co-memory matrix histograms for the six temporal networks at the two resolution timescales.Here we give a further exploration about the shape of these histograms by computing their entropy and kurtosis.The entropy of a probability distribution function p(x) is defined as − ∑ p(x) log p(x) and characterises how 'uneven' the distribution is, reaching a maximum when the distribution is uniform and reaching the minimum (zero) when the probability is fully concentrated.Accordingly, entropy describes in a scalar metric how heterogeneous the microscopic memory kernel of a given temporal network is.The kurtosis is the fourth standardised moment of p(x), and is a measure of its "tailedness".
Results are shown in Fig. S9.The fact that transportation networks tend to have large entropy suggest that many different memory co-orders are detected, i.e. these networks display a highly heterogeneous memory kernel.On the other hand, we find that the online social networks have a strong kurtosis, meaning that even if most of the links have a weak memory kernel, there are a few whose co-order is large.Since Ω eff (G) is defined as the maximum over all co-orders, from the kurtosis analysis we can conclude that online social communication networks analysed in this work display large memory but this only comes from a handful of links.
Constructing memory communities and comparing networks in the (⟨Ω⟩ i in , ⟨Ω⟩ i out ) plane Given the memory heterogeneity displayed in real temporal networks, we wish to further understand how a given links activity influences the memory of the whole network, and, in turn, how much the whole network has an influence on it.To quantify this we define two quantities: given the time series E l t representing the evolution of each link l ∈ 1, ..., L, the average outgoing co-order of a link is defined as This quantity characterises the net memory effect that the network as a whole has on link i.On the other hand, we define the average incoming co-order characterising the net memory effect of the link i on the whole network.The duple (⟨Ω⟩  Let us start by comparing, for each temporal network, these scatter plots for the two different resolution times.Results are plotted in the six panels of Fig. S10.The first observation is that transportation networks (bus (PB), train (PT) and underground (PU)) display very different memory properties at the ∆t = 1 and ∆t = 10 min resolution timescales, and hence their links systematically cluster apart.More particularly, for the ∆t = 10 the average co-orders are notably lower than for the ∆t = 1 scale, suggesting indeed that all the memory structure is captured at the ∆t = 1 scale, i.e.only one memory scale manifests, as expected as expected due to strong planning and scheduling restrictions.At the other extreme, many links in the two online social communication networks overlap in the scatterplots for the two resolution timescales, and memory is systematically weak.This effect is more acute in the college text message (CM) network than in the email (EM) network.Incidentally, for the CM network we find that there is a single link which, for both ∆t = 1 and 10 min timescales, has a significantly larger ⟨Ω⟩ i out than the rest, meaning that there is one specific link whose activity is driven by the global activity of the network.The offline social contact network (RM) somehow interpolates the behaviour of the previous two groups: we can see that while links for the two timescales are closer together, they still cluster apart and the two different timescales are clearly visible.The memory of this network is strong at the two different timescales, concluding that we are indeed detecting two different memory timescales.
Secondly, we explore how these temporal networks compare to each other when projected in the (⟨Ω⟩ i in , ⟨Ω⟩ i out ) plane.In Fig. S11 we scatterplot the top 100 most active links for all six temporal networks at ∆t = 1 min (panel (a)) and ∆t = 10 min (panel (d)).The first observation is that at the ∆t = 1 min, the two social communication networks cluster together, and the same is true for the three engineered transportation networks.Furthermore, there is a very clear separation between these two groups, indicating that the memory structure is very different: the online networks clearly display weaker memory than the offline ones.Perhaps unexpectedly, the social contact network (RM) clusters together with the engineered transportation networks (in particular, there is a large overlap with the bus network).Note however that, even if RM is a social network, it is (i) offline, like the second group, and also (ii) it is a mobility network, and therefore it is not unreasonable to find that its memory structure is more akin to the one displayed by a transportation one.When we switch the resolution scale to ∆t = 10 min, we observe that the engineered transportation networks are still grouped together, however they lose memory and therefore come closer to the social communication networks.As discussed before, the social contact network retains its high memory at the larger timescale, this being a manifestation of the second memory timescale present in this network.
We turn now to explore and quantify the extent to which links are clustered in the (⟨Ω⟩ i in , ⟨Ω⟩ i out ) plane, by finding a 2-dimensional equivalent to their standard deviation σ of the set of L points representing each data set: where Ωin and Ωout are the sample averages of ⟨Ω⟩ i in and ⟨Ω⟩ i out respectively.If this value is large then there is a high average distance between points, and so we can think of them as being less clustered.If this value is small then the points are close together and we can think of them as being highly clustered.Results for this metric for each network are shown in table II.This shows us that for ∆t = 1 min the online social interaction networks (EM, CM) are far more clustered than the engineered networks, with the highest value of σ for the social networks being 0.132, while the lowest value for engineered networks being 0.461, a fact which remains true for ∆t = 10 min.The offline social contact network (RM) displays systematic spread, highlighting large memory link heterogeneity at different resolution times.
Finally we consider the input-output memory balance, which measures how balanced the influence of the network is on each link compared to each links influence on the network.For instance, if for all links we have ⟨Ω⟩ i in ≈ ⟨Ω⟩ i out then we know that no link has a strong influence on the memory of the network, and equally no link has long memory of the rest of the network.To assess this we take the average distance from the line ⟨Ω⟩ i out = ⟨Ω⟩ i in for each network.More concretely, for each empirical network we find the average distance D between the points in the scatter plot to the line ⟨Ω⟩ i in = ⟨Ω⟩ i out , given L sampled links, as follows: The values of these metrics for each network are shown again in table II.Here we find that, as before, online social networks are far more balanced than transportation networks, while the offline social network sits between the other two.
Altogether, analysis of the co-memory matrix (and its projection in the co-memory histogram and the (⟨Ω⟩ i in , ⟨Ω⟩ i out ) plane) at different resolutions provide insights on the microscopic memory heterogeneity displayed by real temporal networks, providing a simple method for discrimination between self-organised, online and offline engineered networks, and highlighting different and interpretable memory timescales.In particular, our results certify that while all networks considered display memory as estimated via Ω eff (G), online social networks have a weaker and more homogeneous microscopic memory kernel than infrastructure transportation networks, and in between we find the case of the social contact network (RM) -a mix case which is 'social', like the first group, but 'offline' like the second-, and this latter unveils the presencee of two different memory timescales, probably related to two distinct mechanisms of social interaction between the university students: casual social interaction vs sharing a class together.

METHOD IMPLEMENTATIONS
As part of this work we have provided a number of implementations of our approach to finding the memory of a temporal network, or the distribution of its co-orders, as based on the above mentioned efficient determination criterion (EDC) given either a file containing a time stamped edge list or the input from a suitable function.The intention here is to remove the complexity associated with implementing our method in an efficient way, and thus allow any future studies in this area to forge ahed without such an overhead.To this end we have provided versions in the following languages, each with their own advantages and disadvantages: • Python 2.7

• Rust
The versions written in Python, and to some extent Java are intended to be used for testing of smaller data sets and prototyping of other experiments.This is because though the languages are common and their implementations are hopefully easy to understand, they (again with the possible exception of Java) lack the raw speed and low memory overhead of the other languages.The C++ version is intended to be as fast as possible while maintaining the lowest possible memory overhead, and so is well suited to handling larger networks or the running of multiple experiments at once.The Rust version (provided with both parallel and serial approaches) is also intended to be as fast as possible, but does have a higher memory overhead when compared to the C++ version (though this is still the second lowest memory usage), as such it is suited to larger networks, but may not be as good as C++ for running multiple experiments at once.We have however provided a parallel implementation in Rust, which, provided there is no problem with memory requirements, is the fastest available.The Java version is intended to be usably fast in any situation and with a moderate memory overhead, while being easy to work with.
We have tested the run-time of each implementation by finding the effective memory of the edge list associated with the (EM) data set at a resolution of ∆t = 60seconds.This comprises of 100 links over 390507 time steps.These tests were run on a desktop PC (Ubuntu 18.04.3LTS (64-bit)), with a intel Core i7-6700k (4.00GHz, 8 core) processor, and 32GB FIG.S1.Models of temporal networks with two links and tunable memories.In both models the temporal activity of each link depends on the past of the other link.However the memory kernel is different for the two models.In model 1, link 1 can copy the t − p1 state of link 2, while link 2 can copy the t − p2 state of link 1, thereby inducing a virtual loop of memory p1 + p2 in the dynamics of each link.In model 2, the first link can copy a state of the second link, choosing uniformly at random from the previous t − 1, ..., t − p1 available states.Analogouly the second link can copy one of the t − 1, ..., t − p2 states of the first link.This model also induces virtual loops of memory p1 + p2, although, in practice the estimation method may not necessarily be able to identify these virtual loops due to virtual loop decoherence (see text for details).
FIG. S4.Autocorrelation function and Efficient Determination Criterion curves for models 1 and 2. Each column depicts the autocorrelation function ACF(τ ) (top) and the efficient determination criterion curve EDC(n) (bottom) of Ω(E 1 t E 1 t ),for one realisation of model 1 (left column) and two realisations of model 2 (middle and right columns).In every case the temporal networks have 10 5 time steps.EDC estimates that the memory corresponds to the minimum of log(EDC(n)).For model 1, the ACF clearly shows a peak at the correct memory p1 + p2 and a succession of additional harmonics with smaller amplitude, and the EDC curve clearly captures the correct memory order.For model 2 the situation is less obvious: the ACF cannot be so easily interpreted.The two examples depict one where the theoretical memory p1 +p2 = 5 is identified, and another where the estimation is different (note in this case that the EDC curve is relatively flat around p1 + p2).
FIG.S5.Spreading times for SI dynamics on temporal networks generated by model 1 (left) and model 2 (right).Each curve represents networks with the same value of the scalar memory Ω(G).For each of these networks the effective memory Ω eff (G) derived from the co-memory matrix is different due to the emergence of virtual loops.Spreading times are shown to vary and this variation is well captured by changes in ∆ = Ω eff (G) − Ω(G), i.e. in the co-memory matrix.Solid symbols are the results of Monte Carlo simulations, whereas hollow symbols are the theoretical predictions obtained by solving Eq.S60.
FIG. S7.Bayesian causal graph of the specific eCDARN(p) model used in the panel (b) of Fig.1 of the main manuscript.Nodes in this graph correspond to the links in the original temporal network, and two nodes are connected by a directed edge if the associated pair of links in the original network are causally connected.In this particular examples, only a subset of links in the original temporal network can actually update their state from the past of other links, where some others evolve independently.
FIG. S9.Assessing co-memory histogram heterogeneity Entropy (panel a) and kurtosis (panel b) of the co-memory histogram for each of the six real-world temporal networks, at ∆t = 1 min (blue) and ∆t = 10 min (red).

i
out ) is therefore a compact representation of the role played by each link i, in this section we explore scatterplots of ⟨Ω⟩ i out vs ⟨Ω⟩ i in .More concretly, for each of the six empirical temporal networks we have considered in this work, we focus on the top 100 most active links and make scatter plots of ⟨Ω⟩ i out vs ⟨Ω⟩ i in for the two different resolution timescales ∆t = 1 and 10 minutes.

FIG. S10 .
FIG. S10.Comparison between the two resolution timescales.For the six real-world temporal networks we scatter plot the average outgoing co-order ⟨Ω⟩ i out vs the average incoming co-order ⟨Ω⟩ i in of the 100 most active links, and we systematically compare these for two resolution timescales: ∆t = 1 min (red diamonds) and ∆t = 10 min (blue dots).The online social interaction networks (EM, CM) have weak memory kernels which are maintained at the two resolution timescales.The offline social contact network (RM) have strong memory at both resolutions, but memory is overall larger at the lower resolution, thereby making timescale separation easy.All three engineered transportation networks (PB, PT, PU) display strong memory kernels only at the lower resolution timescale.

FIG. S11 .
FIG. S11.Scatter plot of the average outgoing co-order ⟨Ω⟩ i out vs the average incoming co-order ⟨Ω⟩ i in of the 100 most active links for the six real-world temporal networks at the (a) ∆t = 1 min and (b) ∆t = 10 min resolution timescales.At the lower resolution timescale online and offline temporal networks cluster together, and offline networks evidence stronger microscopic memory kernels.At the larger resolution timescale, only the (offline) social contact network (RM) displays consistently strong memory.
In this way e t completely characterizes the state of the graph at time t.

TABLE I .
Summary of the situations where virtual loops are expected to emerge.