Modelling sequences and temporal networks with dynamic community structures

In evolving complex systems such as air traffic and social organisations, collective effects emerge from their many components’ dynamic interactions. While the dynamic interactions can be represented by temporal networks with nodes and links that change over time, they remain highly complex. It is therefore often necessary to use methods that extract the temporal networks’ large-scale dynamic community structure. However, such methods are subject to overfitting or suffer from effects of arbitrary, a priori-imposed timescales, which should instead be extracted from data. Here we simultaneously address both problems and develop a principled data-driven method that determines relevant timescales and identifies patterns of dynamics that take place on networks, as well as shape the networks themselves. We base our method on an arbitrary-order Markov chain model with community structure, and develop a nonparametric Bayesian inference framework that identifies the simplest such model that can explain temporal interaction data.


INTRODUCTION
To reveal the mechanisms of complex systems, researchers identify large-scale patterns in their networks of interactions with community-detection methods [1].Traditionally, these methods describe only static network structures without taking into account the dynamics that take place on the networks, such as people travelling by air, or the dynamics of the networks themselves, such as new routes in air traffic networks.While the dynamics on and of networks contain crucial information about the systems they represent, only recently have researchers showed how to incorporate higher-order Markov chains to describe dynamics on networks [2][3][4][5] and higher-order temporal structures to describe dynamics of networks [6][7][8][9][10][11][12][13][14][15][16][17][18].However, both avenues of research have encountered central limitations: First, methods that use higher-order memory to describe dynamics on networks rely on extrinsic methods to detect the appropriate memory order [2,19].Second, methods that attempt to describe dynamics of networks adapt static descriptions by aggregating time windows into discrete layers [13-15, 17, 20-22], and ignore dynamics within the time windows.Thus, both methods for dynamics on and of networks require or impose ad hoc timescales that can obscure essential dynamic community structure.
Furthermore, when trying to determine the timescales solely from data, the curse of dimensionality strikes: the large number of degrees of freedom makes the higherorder descriptions prone to overfitting when random fluctuations in high-dimensional data are mistaken for actual structure [23].Without a principled method with effective model selection to counteract this increasing complexity, it becomes difficult to separate meaningful dynamic community structure from artefacts.
To overcome these model selection and arbitrary timescale problems, we present a general and principled data-driven method by simultaneously tackling dynamics on and of networks (see Fig. 1).In contrast to approaches that incorporate temporal layers in methods for static network descriptions, we build our approach on describing the actual dynamics.We first formulate a generative model of discrete temporal processes based on arbitraryorder Markov chains with community structure [24][25][26][27].Since our model generates event sequences, it does not aggregate data in time windows [13][14][15]17], and, other than the Markov model assumption, needs no a priori imposed timescales.This model can be used to describe dynamics taking place on network systems that take into account higher-order memory effects [2,3] of arbitrary order.We then use the model to describe temporal networks, where the event sequence represents the occurrence of edges in the network [10].
In both cases, we employ a nonparametric Bayesian inference framework that allows us to select, according to the statistical evidence available, the most parsimonious model among all its variations.Hence we can, for example, identify the most appropriate Markov order and the number of communities without overfitting.In particular, if the dynamics on or of a network are random, our method will not identify any spurious patterns from noise but conclude that the data lack structure.As we also show, the model can be used to predict future network dynamics and evolution from past observations.Moreover, we provide publicly available and scalable code with log-linear complexity in the number of nodes independent of the number of groups. .Unified modelling of dynamics on and of networks.Our modeling framework simultaneously describes: (a) Arbitrary dynamics taking place on networks, represented as a sequence of arbitrary tokens that are associated with nodes, in this example {xt} = "It␣was␣the␣best␣of␣times".(b) Dynamics of networks themselves, where the tokens are node-pair edges that appear in sequence, in this example {xt} = {(1, 2), (4, 3), (5,2), (10,8), (7,2), (9,3), (3,4)}.

Inference of Markov chains
Here we consider general time-series composed of a sequence of discrete observations {x t }, where x t is a single token from an alphabet of size N observed at discrete time t, and x t−1 = (x t−1 , . . ., x t−n ) is the memory of the previous n tokens at time t (see Fig. 1).An nth-order Markov chain with transition probabilities p(x t |x t−1 ) generates such a sequence with probability where a x,x is the number of transitions x → x in {x t }.Given a specific sequence {x t }, we want to infer the transitions probabilities p(x|x).The simplest approach is to compute the maximum-likelihood estimate, that is where a x = x a x,x , which amounts simply to the frequency of observed transitions.Putting this back into the likelihood of Eq. 1, we have ln P ({x t }|p) = x,x a x,x ln a x,x a x .
This can be expressed through the total number of observed transitions E = x,x a x,x and the conditional entropy H(X|X) = − x p(x) x p(x|x) ln p(x|x) as ln P ({x t }|p) = −EH(X|X).Hence, the maximisation of the likelihood in Eq. 1 yields the transition probabilities that most compress the sequence.There is, however, an important caveat with this approach.It cannot be used when we are interested in determining the most appropriate Markov order n of the model, because the maximum likelihood in Eq. 3 increases with n.In general, increasing number of memories at fixed number of transitions leads to decreased conditional entropy.Hence, for some large enough value of n there will be only one observed transition conditioned on every memory, yielding a zero conditional entropy and a maximum likelihood of 1.This would be an extreme case of overfitting, where by increasing the number of degrees of freedom of the model it is impossible to distinguish actual structure from stochastic fluctuations.Also, this approach does not yield true compression of the data, since it does not describe the increasing model complexity for larger values of n, and thus is crucially incomplete.To address this problem, we use a Bayesian formulation, and maximise instead the complete evidence which is the sum of all possible models weighted according to prior probabilities P (p) that encode our a priori assumptions.This approach gives the correct model order for data sampled from Markov chains as long as there are enough statistics that balances the structure present in the data with its statistical weight, as well as meaningful values when this is not the case [28].
Although this Bayesian approach satisfactorily addresses the overfitting problem, it misses opportunities of detecting large-scale structures in data.As we show below, it is possible to extend this model in such a way as to make a direct connection to the problem of finding communities in networks, yielding a stronger explanatory power when modelling sequences, and serving as a basis for a model where the sequence itself represents a temporal network.

Markov chains with communities
Instead of directly inferring the transition probabilities of Eq. 1, we propose an alternative formulation: We assume that both memories and tokens are distributed in disjoint groups (see Fig. 2).
Here θ x is the relative probability at which token x is selected among those that belong to the same group, and λ rs is the overall transition probability from memory group s = b x to token group r = b x .The parameter θ x plays an analogous role to degree-correction in the SBM [29], and is together with the Bayesian description the main difference from the sparse Markov chains developed in Refs.[26,27].In the case n = 1, for example, each token appears twice in the model, both as token and memory.An alternative and often useful approach for n = 1 is to consider a single unified partition for both tokens and memories, as shown in Fig. 2b and described in detail in the Methods section The unified first-order model.In any case, the maximum likelihood parameter estimates are where e rs is the number of observed transitions from group s to r, e s = t e ts is the total outgoing transitions from group s if s is a memory group, or the total incoming transition if it is a token group.The labels r and s are used indistinguishably to denote memory and token groups, since it is only their numerical value that determines their kind.Finally, k x is the total number of occurrences of token x.Putting this back in the likeli-hood, we have ln P ({x t }|b, λ, θ) = r<s e rs ln e rs e r e s + x This is almost the same as the maximum likelihood of the degree-corrected stochastic block model (DCSBM) [29], where a x,x plays the role of the adjacency matrix of a bipartite multigraph connecting tokens and memories.The only differences are constant terms that do not alter the position of the maximum with respect to the node partition.This implies that for undirected networks without higher-order memory, there is no difference between inferring the structure directly from its topology or from dynamical processes taking place on it, as we show in detail in the Methods section Equivalence between structure and dynamics.As before, this maximum likelihood approach cannot be used if we do not know the order of the Markov chain, otherwise it will overfit.In fact, this problem is now aggravated by the larger number of model parameters.Therefore, we employ a Bayesian formulation and construct a generative process for the model parameters themselves.We do this by introducing prior probability densities for the parameters D r ({θ x }|α) and D s ({λ rs }|β) for tokens and memories, respectively, with hyperparameter sets α and β, and computing the integrated likelihood where we used b as a shorthand for {b x } and {b x }.Now, instead of inferring the hyperparameters, we can make a Table I.Summary of inference results for empirical sequences.Description length Σ = − log 2 P ({xt}, b) in bits, as well as inferred number of token groups BN and memory groups BM for different data sets and Markov order n (for detailed descriptions, see Methods section Datasets).The value Σ = − log 2 P ({xt}) corresponds to the direct Bayesian parametrisation of Markov chains of Ref. [28], with noninformative priors.Values in grey correspond to the minimum of each column.The bottom rows show the compression obtained with gzip and LZMA, two popular variations of Lempel-Ziv [30,31].noninformative choice for α and β that reflects our a priori lack of preference towards any particular model [32].Doing so in this case yields a likelihood (for details, see Methods section Bayesian Markov chains with communities), P ({x t }|b, {e s }) = P ({x t }|b, {e rs }, {k x }) × P ({k x }|{e rs }, b)P ({e rs }|{e s }), (9) where P ({x t }|b, {e rs }, {k x }) corresponds to the likelihood of the sequence {x t } conditioned on the transitions counts {e rs } and token frequencies {k x }, and the remaining terms are the prior probabilities on the discrete parameters {e rs } and {k x }.Since the likelihood above still is conditioned on the partitions {b x } and {b x }, as well as the memory group counts {e s }, we need to include prior probabilities on these as well to make the approach fully nonparametric.Doing so yields a joint likelihood for both the sequence and the model parameters, P ({x t }, b, {e s }) = P ({x t }|b, {e s })P (b)P ({e s }).(10) It is now possible to understand why maximizing this joint likelihood will prevent overfitting the data.If we take its negative logarithm, it can be written as The quantity Σ is called the description length of the data [33,34].It corresponds to the amount of information necessary to describe both the data and the model simultaneously, corresponding to the first and second terms in Eq. 12, respectively.As the model becomes more complex-either by increasing the number of groups or the order of the Markov chain-this will decrease the first term as the data likelihood increases, but it will simultaneously increase the second term, as the model likelihood decreases.The second term then acts as a penalty to the model likelihood, forcing a balance between model complexity and quality of fit.Unlike approximative penalty approaches based solely on the number of free parameters such as BIC [35] and AIC [36], which are not to valid for network models [37], the description length of the model is exact and fully captures its flexibility.Because of the complete character of the description length, minimizing it indeed amounts to achieving true compression of data, differently from the parametric maximum likelihood approach mentioned earlier.Because the whole process is functionally equivalent to inferring the SBM for networks, we can use the same algorithms [38] (for a details about the inference method, see Methods section Bayesian Markov chains with communities).Before we continue, we point out that the selection of priors in Eq. 9 needs to be done carefully to avoid underfitting the data.This happens when strong prior assumptions obscure structures in the data [39].We tackle this by using hierarchical priors, where the parameter themselves are modelled by parametric distributions, which in turn contain more parameters, and so on [40,41].Besides alleviating the underfitting problem, this allows us to represent the data in multiple scales by a hierarchical partition of the token and memories.We describe this in more detail in the Methods section Bayesian Markov chains with communities.
This Markov chain model with communities succeeds in providing a better description for a variety of empirical sequences when compared with the common Markov chain parametrisation (see Table I).Not only do we systematically observe a smaller description length, but we also find evidence for higher order memory in all examples.We emphasise that we are protected against overfitting: If we randomly shuffle the order of the tokens in each dataset, with dominating probability we infer a fully random model with n = 1 and B N = B M = 1, which is equivalent to an n = 0 memoryless model.We have verified that we infer this model for all analysed datasets.Accordingly, we are not susceptible to the spurious results of nonstatistical methods [23].
To illustrate the effects of community structure on the Markov dynamics, we use the US flight itineraries as an example.In this dataset, the itineraries of 1, 272, 696 passengers were recorded, and we treat each airport stop as a token in a sequence (for more details, see Methods section Datasets).When we infer our model, the itinerary memories are grouped together if their destination probabilities are similar.As a result, it becomes possible, for example, to distinguish transit hubs from destination hubs [2].We use Atlanta and Las Vegas to illustrate: Many roundtrip routes transit through Atlanta from the origin to the final destination and return to it two legs later on the way back to the origin.On the other hand, Las Vegas often is the final destination of a roundtrip such that the stop two legs later represents a more diverse set of origins (Fig. 3).Resembling the results of the map equation for network flows with memory [2], this pattern is captured in our model by the larger number of memory groups that involve Las Vegas than those that involve Atlanta.Moreover, the division between transit and destinations propagates all the way to the upper hierarchical levels of the memory partition.
In addition to this itinerary memory clustering, the co-clustering with airport tokens also divides the airports into hierarchical categories.For example, Atlanta is grouped with nearby Charlotte at the first hierarchy level, and with Detroit, Minneapolis, Dallas and Chicago at the third level.This extra information tells us that these airports serve as alternative destinations to itineraries that are similar to those that go through Atlanta.Likewise, Las Vegas is grouped together with alternative destinations Phoenix and Denver.This type of similarity between airports-which is not merely a reflection of the memory patterns-is not expressed with the assortative approach of the map equation, which solely clusters densely connected memories with long flow persistence times [2].A more direct comparison between our Bayesian inference framework and the map equation is not meaningful, since these two approaches represent the network divisions differently (for a detailed discussion, see Methods section Comparison with the map equation for network flows with memory).Indeed, it is the simultaneous division of memories and tokens that effectively reduce the overall complexity of the data, and provide better compression at higher memory order.Con-sequently, the community-based Markov model can capture patterns of higher-order memory that conventional methods obscure.

Temporal networks
A general model for temporal networks treats the edge sequence as a time series [6,42,43].We can in principle use the present model without any modification by considering the observed edges as tokens in the Markov chain, that is, x t = (i, j) t , where i and j are the endpoints of the edge at time t (see Fig. 1b).However, this can be suboptimal if the networks are sparse, that is, if only a relatively small subset of all possible edges occur, and thus there are insufficient data to reliably fit the model.Therefore, we adapt the model above by including an additional generative layer between the Markov chain and the observed edges.We do so by partitioning the nodes of the network into groups, that is, c i ∈ [1, C] determines the membership of node i in one of C groups, such that each edge (i, j) is associated with a label (c i , c j ).Then we define a Markov chain for the sequence of edge labels and sample the actual edges conditioned only on the labels.Since this reduces the number of possible tokens from O(N 2 ) to O(C 2 ), it has a more controllable number of parameters that can better match the sparsity of the data.We further assume that, given the node partitions, the edges themselves are sampled in a degree-corrected manner, conditioned on the edge labels, where κ i is the probability of a node being selected inside a group, with i∈r κ i = 1.The total likelihood conditioned on the label sequence becomes Since we want to avoid overfitting the model, we once more use noninformative priors, but this time on {κ i }, and integrate over them, Combining this result with Eq. 9, we have the complete likelihood of the temporal network, terms.One recovers the static DCSBM exactly by choosing B N = B M = 1-making the state transitions memoryless.
Finally, to make the model nonparametric, we again include the same prior as before for the node partition c, in addition to token and memory partition b, such that the total nonparametric joint likelihood is maximised, In this way, we again protect against overfitting, and we can infer not only the number of memory groups B N and token groups B M , as before, but also the number of groups in the temporal network itself, C. If, for example, the temporal network is completely randomthat is, the edges are placed randomly both in the aggregated network as well as in time-we again infer B N = B M = C = 1 with the largest probability.We refer to the Methods section Temporal networks for a complete derivation of the final likelihood.We employ this model in a variety of dynamic network datasets from different domains (for details, see Table II and Methods section Datasets).In all cases, we infer models with n > 0 that identify many groups for the tokens and memories, meaning that the model succeeds in capturing temporal structures.In most cases, models with n = 1 best describe the data, implying that there is not sufficient evidence for higher-order memory, with exception of the network of chess moves, which is best described by a model with n = 2.This result is different from the results for the comparably long nonnetwork sequences in table I, where we identified higherorder Markov chains.Again, this is because the alphabet size is much larger for temporal networks-corresponding to all possible edges that can be encountered.Hence, for the datasets in Table II the size of the alphabet is often comparable with the length of the sequence.In view of this, it is remarkable that the method can detect any structure at all.The intermediary layer where the Markov chain generates edge types instead of the edges directly is crucial.If we fit the original model without this modification, we indeed get much larger description lengths and we often fail to detect any Markov structure (not shown).
To illustrate how the model characterises the temporal structure of these systems, we focus on the proximity network of high school students, which corresponds to the voluntary tracking of 327 students for a period of 5 days [44].Whenever the distance between two students fell below a threshold, an edge between them was recorded at that time.In the best-fitting model for these data, the inferred groups for the aggregated network correspond exactly to the known division into 9 classes, except for the PC class, which was divided into two groups (Fig. 4).The groups show a clear assortative structure, where most connections occur within each class.The clustering of the edge labels in the second part of the model reveals the temporal dynamics.We observe that the edges connecting nodes of the same group cluster either in single-node or small groups, with a high incidence of self-loops.This means that if an edge that connects two students of the same class appears in the sequence, the next edge is most likely also inside the same class, indicating that the students of the same class are clustered in space and time.The remaining edges between students of different classes are separated into two large groups.This division indicates that the different classes meet each other at different times.Indeed, the classes are located in different parts of the school building and they typically go to lunch separately [44].Accordingly, our method can uncover the associated dynamical pattern from the data alone.

Temporal prediction
Using generative models to extract patterns in data yields more than a mere description, since they generalise observations and predict future events.For our Bayesian approach, the models can even be used to predict tokens and memories not previously observed.This ability is in strong contrast to more heuristic methods that are only designed to find partitions in networks or time series, and cannot be used for prediction.Furthermore, the predictive performance of a model is often used on its own to evaluate it, and serves as an alternative approach to model selection: since an overly complicated model incorporates noise in its description, it yields less accurate predictions.Thus, maximizing the predictive performance also amounts to a balance between quality of fit and model complexity, similarly to the minimum description length approach we have used so far.It is important, therefore, to determine if these two different criteria yield consistent results, which would serve as an additional verification of the overall approach.
We show this consistency by considering a scenario where a sequence is divided into two equal-sized contiguous parts, That is, a training set {x * t } and a validation set {x t }.We then evaluate the model by fitting it to the training set and use it to predict the validation set.If we observe only the training set, the likelihood of the validation set is bounded below by P ({x t }|{x * t }, b * ) ≥ exp(−∆Σ), where b * = argmax b P (b|{x * t }) is the best partition given the training set and ∆Σ is the description length difference between the training set and the entire data (for a proof, see Methods section Predictive held-out likelihood).This lower bound will become tight when both the validation and training sets become large, and hence can be used as an asymptotic approximation of the predictive likelihood.Table II shows empirical values for the same datasets as considered before, where n = 0 corresponds to using only the static DCSBM to predict the edges, ignoring any time structure.The temporal network model provides better prediction in all cases, and the best Markov order always coincides with the one that yields the minimum description length, thus confirming a full agreement between both criteria in these cases.

DISCUSSION
We presented a dynamical variation of the degreecorrected stochastic block model that can capture long pathways or large-scale structures in sequences and temporal networks.The model does not require the optimal Markov order or number of groups as inputs, but infers them from data because the underlying arbitraryorder Markov chain model is nonparametric.Its nonparametric nature also evades a priori imposed timescales.We showed that the model successfully finds meaningful large-scale temporal structures in real-world systems and that it predicts their temporal evolution.Moreover, in the Methods section we extend the model to situations where the dynamics take place in continuous time or is nonstationary.In contrast to approaches that force network-formation dynamics into discrete time windows, and require a priori knowledge about the appropriate amount of dynamical memory, our approach provides a principled and versatile alternative.
As described in the main text, a Bayesian formulation of the Markov model consists in specifying prior probabilities for the model parameters, and integrating over them.In doing so, we convert the problem from one of parametric inference where the model parameters need to be specified before inference, to a nonparametric one where no parameters need to be specified before infer-ence.In this way, the approach possesses intrinsic regularisation, where the order of the model can be inferred from data alone, without overfitting [32,45].
To accomplish this, we rewrite the model likelihood, using Eqs. 1 and 5, as and observe the normalisation constraints x∈r θ x = 1, and r λ rs = 1.Since this is just a product of multinomials, we can choose conjugate Dirichlet priors probability densities D r ({θ x }|{α x }) and D s ({λ rs }|{β rs }), which allows us to exactly compute the integrated likelihood, where A r = x∈r α x and B s = r β rs .We recover the Bayesian version of the common Markov chain formulation (see Ref. [28]) if we put each memory and token in their own groups.This remains a parametric distribution, since we need to specify the hyperparameters.However, in the absence of prior information it is more appropriate to make a noninformative choice that encodes our a priori lack of knowledge or preference towards any particular model, which amounts to choosing α x = β rs = 1, making the prior distributions flat.If we substitute these values in Eq. 19, and re-arrange the terms, we can show that it can be written as the following combination of conditional likelihoods, P ({x t }|b, {e s }) = P ({x t }|b, {e rs }, {k x }) × P ({k x }|{e rs }, b)P ({e rs }|{e s }), (20) where with being the multiset coefficient, that counts the number of m-combinations with repetitions from a set of size n.
The expression above has the following combinatorial interpretation: P ({x t }|b, {e rs }, {k x }) corresponds to the likelihood of a microcanonical model [41] where a random sequence {x t } is produced with exactly e rs total transitions between groups r and s, and with each token x occurring exactly k x times.In order to see this, consider a chain where there are only e rs transitions in total between token group r and memory group s, and each token x occurs exactly k x times.For the first transition in the chain, from a memory x 0 in group s to a token x 1 in group r, we have the probability Now, for the second transition from memory x 1 in group t to a token x 2 in group u, we have the probability Proceeding recursively, the final likelihood for the entire chain is which is identical to Eq. 21.
The remaining terms in Eqs.22 and 23 are the prior probabilities on the discrete parameters {k x } and {e rs }, respectively, which are uniform distributions of the type 1/Ω, where Ω is the total number of possibilities given the imposed constraints.We refer to Ref. [41] for a more detailed discussion on those priors.
Since the integrated likelihood above gives P ({x t }|b, {e s }), we still need to include priors for the node partitions {b x } and {b x }, as well as memory group counts, {e s }, to make the above model fully nonparametric.This is exactly the same situation encountered with the SBM [39][40][41].Following Refs.[40,41], we use a nonparametric two-level Bayesian hierarchy for the partitions, P ({b i }) = P ({b i }|{n r })P ({n r }), with uniform distributions where n r is the number of nodes in group r, M = r n r , which we use for both {b x } and {b x }, that is, P (b) = P ({b x })P ({b x }).Analogously, for {e s } we can use a uniform distribution The above priors make the model fully nonparametric with a joint and marginal probability P ({x t }, b) = P ({x t }, b, {e s }) = P ({x t }|b, {e s })P (b)P ({e s }).This approach successfully finds the most appropriate number of groups according to statistical evidence, without overfitting [34,39,40,46].This nonparametric method can also detect the most appropriate order of the Markov chain, again without overfitting [28].However, in some ways it is still sub-optimal.The use of conjugate Dirichlet priors above was primarily for mathematical convenience, not because they closely represent the actual mechanisms believed to generate the data.Although the noninformative choice of the Dirichlet distribution (which yields flat priors for {e rs } and {e s }) can be well justified by maximum entropy arguments (see Ref. [32]), and are unbiased, it can in fact be shown that it can lead to underfitting of the data, where the maximum number of detectable groups scales sub-optimally as √ N [39].As shown in Ref. [40], this limitation can be overcome by departing from the model with Dirichlet priors, and replacing directly the priors P ({e rs }|{e s }) and P ({e s }) of the microcanonical model by a single prior P ({e rs }), and noticing that {e rs } corresponds to the adjacency matrix of bipartite multigraph with E edges and B N + B M nodes.With this insight, we can write P ({e rs }) as a Bayesian hierarchy of nested SBMs, which replaces the resolution limit above by N/ ln N , and provides a multilevel description of the data, while remaining unbiased.Furthermore, the uniform prior in Eq. 8 for the token frequencies P ({k x }|{e rs }, b) intrinsically favours concentrated distributions of k x values.This distribution is often skewed.We therefore replace it by a two-level Bayesian hierarchy P ({k x }|{e rs }, b) = r P ({k x }|{n r k })P ({n r k }|e r ), with and P ({n r k }|e r ) = q(e r , n r ) −1 , where q(m, n) is the number of restricted partitions of integer m into at most n parts (see Ref. [41] for details).
As mentioned in the main text, in order to fit the model above we need to find the partitions {b x } and {b x } that maximise P ({x t }, b), or fully equivalently, minimise the description length Σ = − ln P ({x t }, b) [33].Since this is functionally equivalent to inferring the DCSBM in networks, we can use the same algorithms.In this work we employed the fast multilevel MCMC method of Ref. [38], which has log-linear complexity O(N log 2 N ), where N is the number of nodes (in our case, memories and tokens), independent of the number of groups.

The unified first-order model
The model defined in the main text is based on a coclustering of memory and tokens.In the n = 1 case, each memory corresponds to a single token.In this situation, we consider a slight variation of the model where we force the number of groups of each type to be the same, that is, B N = B M = B, and both partitions to be identical.Instead of clustering the original bipartite graph, this is analogous to clustering its projection into a directed transition graph with each node representing a specific memory and token simultaneously.When considering this model, the likelihoods computed in the main text and above remain exactly the same, with the only difference that we implicitly force both memory and token partitions to be identical, and omit the partition likelihood of Eq. 27 for one of them.We find that for many datasets this variation provides a slightly better description than the co-clustering version, although there are also exceptions to this.
We used this variation of the model in Fig. 4 because it yielded a smaller description length for that dataset, and simplified the visualisation and interpretation of the results in that particular case.

Temporal networks
Here we show in more detail how the likelihood for the temporal network model is obtained.As we discuss in the Results section Temporal networks, the total likelihood of the network conditioned on the label sequence is where d i is the degree of node i, and m rs is the total number of edges between groups r and s.Maximum likelihood estimation gives κi = d i /e ci .But since we want to avoid overfitting the model, we once more use noninformative priors, this time on {κ i }, integrate over them, henceforth omitting the trivial Kronecker delta term above and obtain with P ({d i }) = r nr er −1 . Combining this with Eq. 9 as P ({(i, j) t }|c, b) = P ({(i, j) t }|{(r, s) t }, c)P ({(r, s) t }|b), we have the complete likelihood of the temporal network ×P ({d i }|c)P ({m rs }) u<v e uv !u e u !v e v !P ({e uv }).(33) This likelihood can be rewritten in such a way that makes clear that it is composed of one purely static and one purely dynamic part, The first term of Eq. 34 is precisely the nonparametric likelihood of the static DCSBM that generates the aggregated graph with adjacency matrix if Stirling's approximation is used.The second term in Eq. 34 is the likelihood of the Markov chain of edge labels given by Eq. 9 (with {x t } = {(r, s) t }, and {k x } = {m rs }).This model, therefore, is a direct generalisation of the static DCSBM, with a likelihood composed of two separate static and dynamic terms.One recovers the static DCSBM exactly by choosing B N = B M = 1-making the state transitions memoryless-so that the second term in Eq. 34 above contributes only with a trivial constant 1/E! to the overall likelihood.Equivalently, we can view the DCSBM as a special case with n = 0 of this temporal network model.

Predictive held-out likelihood
Given a sequence divided in two contiguous parts, {x t } = {x * t } ∪ {x t }, that is, a training set {x * t } and a validation set {x t }, and if we observe only the training set, the predictive likelihood of the validation set is where b * = argmax b P (b|{x * t }) is the best partition given the training set.Moreover, we have where b corresponds to the partition of the newly observed memories (or even tokens) in {x t }.Generally we have P (b |b * ) = P (b , b * )/P (b * ), so that where b = argmax b P ({x t } ∪ {x * t }|b * , b )P (b * , b ) and ∆Σ is the difference in the description length between the training set and the entire data.Hence, computing the minimum description length of the remaining data by maximising the posterior likelihood relative to the partition of the previously unobserved memories or tokens, yields a lower bound on the predictive likelihood.This lower bound will become tight when both the validation and training sets become large, because then the posterior distributions concentrate around the maximum, and hence can be used as an asymptotic approximation of the predictive likelihood.

Continuous time
So far, we have considered sequences and temporal networks that evolve discretely in time.Although this is the appropriate description for many types of data, such as text, flight itineraries and chess moves, in many other cases events happen instead in real time.In this case, the time series can be represented-without any loss of generality-by an embedded sequence of tokens {x t } placed in discrete time, together with an additional sequence of waiting times {∆ t }, where ∆ t ≥ 0 is the real time difference between tokens x t and x t−1 .Employing a continuous-time Markov chain description, the data likelihood can be written as P ({x t }, {∆ t }|p, λ) = P ({x t }|p) × P ({∆ t }|{x t }, λ) (39) with P ({x t }|p) given by Eq. 1, and where is a maximum-entropy distribution governing the waiting times, according to the frequency λ.Substituting this in Eq. 40, we have where ∆ x = t ∆ t δ xt,x .To compute the nonparametric Bayesian evidence, we need a conjugate prior for the frequencies λ x , where α and β are hyperparameters, interpreted, respectively, as the number and sum of prior observations.A fully noninformative choice would entail α → 0 and β → 0, which would yield the so-called Jeffreys prior [48], P (λ) ∝ 1/λ.Unfortunately, this prior is improper because it is not a normalised distribution.In order to avoid this, we use instead α = 1 and β = x λ x /M , taking into account the global average.While this is not the only possible choice, the results should not be sensitive to this prior since the data will eventually override any reasonable assumption we make.Using this prior, we obtain the Bayesian evidence for the waiting times as Hence, if we employ the Bayesian parametrisation with communities for the discrete embedded model as we did previously, we have with P ({x t }, b) given by Eq. 10.
Since the partition of memories and tokens only influences the first term of Eq. 46, corresponding to the embedded discrete-time Markov chain, P ({x t }, b), the outcome of the inference for any particular Markov order will not take into account the distribution of waiting timesalthough the preferred Markov order might be influenced by it.We can change this by modifying the model above, assuming that the waiting times are conditioned on the group membership of the memories, where η r is a frequency associated with memory group r.The Bayesian evidence is computed in the same manner, integrating over η r with the noninformative prior of Eq. 43, yielding where ∆ r = t ∆ t δ bx t ,r .Since this assumes that the waiting times will be sampled from the same distribution inside each group, the inference procedure will take the waiting time into account, and will place memories with significantly different delays into different groups.
As an example of the use of this model variation, we consider a piano reduction of Beethoven's fifth symphony (extracted in MIDI format from the Mutopia project at http://www.mutopiaproject.org),represented as a sequence of E = 4, 223 notes of an alphabet of size N = 63.We consider both model variants where the timings between notes are discarded, and where they are included.If individual notes are played simultaneously as part of a chord, we order them lexicographically and separate them by ∆t = 10 −6 seconds.The results of the inference can be seen in Table III.The discrete-time model favours an n = 1 Markov chain, whereas the continuous-time model favours n = 2.This is an interesting result that shows that the timings alone can influence the most appropriate Markov order.We can see in more detail why by inspecting the typical waiting times conditioned on the memory groups, as shown in Fig. 5.For the discretetime model, the actual continuous waiting times (which are not used during inference) are only weakly correlated with the memory groups.On the other hand, for the continuous-time model we find that the memories are divided in such a way that they are strongly correlated with the waiting times: There is a group of memories for which the ensuing waiting times are always ∆t = 10 −6 -corresponding to node combinations that are always associated with chords.The remaining memories are divided into further groups that display at least two distinct timescales, that is, short and long pauses between notes.These statistically significant patterns are only visible for the higher order n = 2 model.In the above model the waiting times are distributed according to the exponential distribution of Eq. 41, which has a typical timescale given by 1/λ.However, one often encounters processes where the dynamics are bursty, that is, the waiting times between events lack any characteristic scale, and are thus distributed according to a power-law for ∆ > ∆ m , otherwise P (∆|β) = 0.One could in principle repeat the above calculations with the above distribution to obtain the inference procedure for this alternative model.However, this is in fact not necessary, since by making the transformation of variables we obtain for Eq.49 which is the same exponential distribution of Eq. 41.
Hence, we need only to perform the transformation of Eq.50 for the waiting times prior to inference, to use the bursty model variant, while maintaining the exact same algorithm.

Nonstationarity and hidden contexts
An underlying assumption of the Markov model proposed is that the same transition probabilities are used for the whole duration of the sequence, that is, the Markov chain is stationary.Generalisations of the model can be considered where these probabilities change over time.Perhaps the simplest generalisation is to assume that the dynamics is divided into T discrete epochs, such that one replaces tokens x t by a pair (x, τ ) t , where τ ∈ [1, T ] represents the epoch where token x was observed.In fact, τ does not need to be associated with a temporal variable-it could be any arbitrary covariate that describes additional aspects of the data.By incorporating this type of annotation into the tokens, one can use a stationary Markov chain describing the augmented tokens that in fact corresponds to a nonstationary one if one omits the variable τ from the token descriptors-effectively allowing for arbitrary extensions of the model by simply incorporating appropriate covariates, and without requiring any modification to the inference algorithm.Another consequence of this extension is that the same token x can belong to different groups if it is associated with two or more different covariates, (x, τ 1 ) and (x, τ 2 ).Therefore, this inherently models a situation where the group membership of tokens and memory vary in time.
As an illustration of this application of the model, we consider two literary texts: an English translation of "War and peace", by Leo Tolstoy, and the French original of "À la recherche du temps perdu", by Marcel Proust.First, we concatenate both novels together, treating it as a single text.If we fit our Markov model to it, we obtain the n = 3 model shown in Fig. 6a.In that figure, we have highlighted tokens and memories that involve letters that are exclusive to the French language, and thus most of them belong to the second novel.We observe that the model essentially finds a mixture between English and French.If, however, we indicate in each token to which novel it belongs, for example (x, wp) t and (x, temps) t , we obtain the model of Fig. 6b.In this case, the model is forced to separate between the two novels, and one indeed learns the French patterns differently from English.Since this nonstationary model possesses a larger number of memory and tokens, one would expect a larger description length.However, in this cases it has a smaller description length than the mixed alternative, indicat-ing indeed that both patterns are sufficiently different to warrant a separate description.Therefore, this approach is capable of uncovering change points [49], where the rules governing the dynamics change significantly from one period to another.
The above extension can also be used to uncover other types of hidden contexts.For example, in a temporal network of student proximity, we know that pairs of individuals that are far away are unlikely to be conditionally dependent on each other.If this spatial information is not available in the data, it may be inferred in same way it was done for language above.If the information is available, it can be annotated on the transitions, yielding a multilayer version of the model, similar to the layered SBM [15].

Equivalence between structure and dynamics
The likelihood of Eq. 4 in the main text is almost the same as the DCSBM [29].The only exceptions are trivial additive and multiplicative constants, as well as the fact that the degrees of the memories do not appear in it.These differences, however, do not alter the position of the maximum with the respect to the node partition.This allows us to establish an equivalence between inferring the community structure of networks and modelling the dynamics taking place on it.Namely, for a random walk on a connected undirected graph, a transition i → j is observed with probability A ij p i (t)/k i , with p i (t) being the occupation probability of node i at time t.Thus, after equilibration with p i (∞) = k i /2E, the probability of observing any edge (i, j) is a constant: p i (∞)/k i + p j (∞)/k j = 1/E.Hence, the expected edge counts e rs between two groups in the Markov chain will be proportional to the actual edge counts in the underlying graph given the same node partition.This means that the likelihood of Eq. 4 in the main text (for the n = 1 projected model described above) and of the DCSBM will differ only in trivial multiplicative and additive constants, such that the node partition that maximises them will be identical.This is similar to the equivalence between network modularity and random walks [50], but here the equivalence is stronger and we are not constrained to purely assortative modules.However, this equivalence breaks down for directed graphs, higher order memory with n > 1 and when model selection chooses the number of groups.

Comparison with the map equation for network flows with memory
Both the community-based Markov model introduced here and the map equation for network flows with memory [2] identify communities in higher-order Markov chains based on maximum compression.However, the two approaches differ from each other in some central aspects.The approach presented here is based on the Bayesian formulation of a generative model, whereas the map equation finds a minimal modular entropy encoding of the observed dynamics projected on a node partition.Thus, both approaches seek compression, but of different aspects of the data.
The map equation operates on the internal and external transitions within and between possibly nested groups of memory states and describes the transitions between physical nodes [x t is the physical node or token in memory states of the form x = (x t , x t−1 , x t−2 , . ..)].The description length of these transitions is minimised for the optimal division of the network into communities.By construction, this approach identifies assortative modules of memory states with long flow persistence times.Moreover, for inferring the most appropriate Markov order, this dynamics approach requires supervised approaches to model selection that uses random subsets of the data such as bootstrapping or cross validation [51].
On the other hand, the model presented here yields a nonparametric log-likelihood for the entire sequence as well as the model parameters, with its negative value corresponding to a description length for the entire data, not only its projection into groups.Minimizing this description length yields the optimal co-clustering of memories and tokens, and hence assumes no inherent assortativity.Therefore it can be used also when the underlying Markov chain is disassortative.Moreover, the description length can also be used for unsupervised model selection, where the Markov order and number of groups are determined from the entire data, obviating the need for bootstrapping or cross validation.Furthermore, the present approach can be used to generate new data and make predictions based on past observations.These distinctions mean that the two different approaches can give different results and that the problem at hand should decide which method to use.

Datasets
Below we give a description of the datasets used in this work.Taxi movements: This dataset contains GPS logs from 25, 000 taxi pickups in San Francisco, collected by the company Uber (retrieved from http://www.infochimps.com/datasets/uber-anonymized-gps-logs, also available at https://github.com/dima42/uber-gps-analysis).The geographical locations were discretised into 416 hexagonal cells (see Ref. [2] for details), and the taxi rides were concatenated together in a single sequence with a special separator token indicating the termination of a ride.In total, the sequence has an alphabet of N = 417 and a length of 819, 172 tokens.

US flight itineraries:
RockYou password list: This dataset corresponds to a widely distributed list of 32, 603, 388 passwords from the RockYou video game company (retrieved from http://downloads.skullsecurity.org/passwords/rockyou-withcount.txt.bz2).The passwords were concatenated in a single sequence, with a special separator token between passwords.This yields a sequence with an alphabet of size N = 215 (letters, numbers and symbols) and a total length of 289, 836, 299 tokens.
High school proximity: This dataset corresponds to a temporal proximity measurement of students in a french high school [44].A total of N = 327 students were voluntarily tracked for a period of five days, generating E = 5, 818 proximity events between pairs of students.
Enron email: This dataset corresponds to timestamped collection of E = 1, 148, 072 emails between directed pairs of N = 87, 273, senders and recipients of the former Enron corporation, disclosed as part of a fraud investigation [52].
Internet AS: This dataset contains connections between autonomous systems (AS) collected by the CAIDA project (retrieved from http://www.caida.org).It corresponds to a time-stamped sequence of E = 500, 106 directed connections between AS pairs, with a total of N = 53, 387 recorded AS nodes.The time-stamps correspond to the first time the connection was seen.
APS citations: This dataset contains E = 4, 262, 443 time-stamped citations between N = 425, 760 scientific articles published by the American Physical Society for a period of over 100 years (retrieved from http://journals.aps.org/datasets).
Chess moves: This dataset contains 38, 365 online chess games collected over the month of July 2015 (retrieved from http://ficsgames.org/ download.html).The games were converted into a bipartite temporal network where each piece and position correspond to different nodes, and a movement in the game corresponds to a time-stamped edge of the type piece → position.The resulting temporal network consists of N = 76 nodes and E = 3, 130, 166 edges.
Hospital contacts: This dataset corresponds to a temporal proximity measurement of patients and health care workers in the geriatric unit of an university hospital [53].A total of N = 75 individuals were voluntarily tracked for a period of four days, generating E = 32, 424 proximity events between pairs of individuals.
Infectious sociopatterns: This dataset corresponds to a temporal proximity measurement of attendants at a museum exhibition [54].A total of N = 10, 972 participants were voluntarily tracked for a period of three months, generating E = 415, 912 proximity events between pairs of individuals.
Reality mining: This dataset corresponds to a temporal proximity measurement of university students and faculty [55].A total of N = 96 people were voluntarily tracked for a period of an entire academic year, generating E = 1, 086, 404 proximity events between pairs of individuals.
Beethoven's fifth symphony: A piano reduction of Beethoven's fifth symphony, extracted in MIDI format from the Mutopia project at http://www.mutopiaproject.org,represented as a sequence of E = 4, 223 notes of an alphabet of size N = 63.

Figure 2 .
Figure 2. Schematic representation of the Markov model with communities.The token sequence {xt} = "It␣was␣the␣best␣of␣times" represented with nodes for memories (top row) and tokens (bottom row), and with directed edges for transitions in different variations of the model.(a) A partition of the tokens and memories for an n = 1 model.(b) A unified formulation of an n = 1 model, where the tokens and memories have the same partition, and hence can be represented as a single set of nodes.(c) A partition of the tokens and memories for an n = 2 model.

Figure 3 .
Figure 3. Selection of US flight itineraries for a thirdorder model.The itineraries contain stops in Atlanta or Las Vegas.Edges incident on memories of the type x = (xt−1, Atlanta, xt−3) in red and x = (xt−1, Las Vegas, xt−3) in blue.The node colours and overlaid hierarchical division derive from the n = 3 model inferred for the whole dataset.

Figure 4 .
Figure 4. Inferred temporal model for a high school proximity network [44].(a) The static part of the model divides the high school students into C = 10 groups (square nodes) that almost match the known classes (text labels).(b) The dynamic part of the model divides the directed multigraph group pairs in a into BN = BM = 9 groups (grey circles).The model corresponds to an n = 1 unified Markov chain on the edge labels, where the memory and tokens have identical partitions, as described in detail in the Methods section The unified first-order model.

Figure 5 .
Figure 5. Waiting times for a discrete-and a continuous-time Markov model.Results inferred from Beethoven's fifth symphony.(a) n = 1 discrete-time model, ignoring the waiting times between notes.(b) n = 2 continuous-time model, with waiting times incorporated into the inference.The error bars correspond to the standard deviation of the mean.

Figure 6 .
Figure 6.Markov model fit for a concatenated text.'The texts are 'War and peace" by Leo Tolstoy and "À la recherche du temps perdu", by Marcel Proust.Edge endpoints in red and blue correspond to token and memories, respectively, that involve letters exclusive to French.(a) Version of the model with n = 3 where no distinction is made between the same token in the different novels, yielding a description length − log 2 P ({xt}, b) = 7, 450, 322.(b) Version with n = 3 where each token is annotated with its novel, yielding a description length − log 2 P ({xt}, b) = 7, 146, 465.

Table II .
) conditioned only on the partitions.As we show in detail in the Methods section Temporal networks, this model is a direct generalisation of the static DCSBM, with a likelihood composed of two separate static and dynamic Summary of inference results for empirical temporal networks.Description length Σ = − log 2 P ({(i, j)t}, c, b) in bits as well as inferred number of node groups C, token groups BN, and memory groups BM for different data sets and different Markov order n (see Methods section Datasets).The value −∆Σ ≤ ln P ({x t }|{x * t }, b * ) is a lower-bound on the predictive likelihood of the validation set {x t } given the training set {x * t } and its best parameter estimate.Values in grey correspond to the minimum of each column.

Table III .
Joint likelihoods for discrete-and continuous-time Markov models.Results inferred from Beethoven's fifth symphony for different Markov orders n.Values in grey correspond to the maximum likelihood of each column.
This dataset corresponds to a sample of flight itineraries in the US during 2011 collected by Bureau of Transportation Statistics of the United States Department of Transportation (http://www.transtats.bts.gov/).The dataset contains 1, 272, 696 itineraries of varied lengths (airport stops).We aggregate all itineraries into a single sequence by concatenating the individual itineraries with a special separator token that marks the end of a single itinerary.There are 464 airports in the dataset, and hence we have an alphabet of N = 465 tokens, and a single sequence with a total length of 83, 653, 994 tokens.War and Peace: This dataset corresponds to the entire text of the english translation of the novel War and Peace by Leo Tolstoy, made available by the Project Gutenberg (extracted verbatim from https://www.gutenberg.org/cache/epub/2600/pg2600.txt).This corresponds to a sequence with an alphabet of size N = 84 (including letters, space, punctuation and special symbols) and a total length of 3, 226, 652 tokens.