Superstatistical analysis and modelling of heterogeneous random walks

Stochastic time series are ubiquitous in nature. In particular, random walks with time-varying statistical properties are found in many scientific disciplines. Here we present a superstatistical approach to analyse and model such heterogeneous random walks. The time-dependent statistical parameters can be extracted from measured random walk trajectories with a Bayesian method of sequential inference. The distributions and correlations of these parameters reveal subtle features of the random process that are not captured by conventional measures, such as the mean-squared displacement or the step width distribution. We apply our new approach to migration trajectories of tumour cells in two and three dimensions, and demonstrate the superior ability of the superstatistical method to discriminate cell migration strategies in different environments. Finally, we show how the resulting insights can be used to design simple and meaningful models of the underlying random processes.

, these statistical measures are computed by averaging certain quantities of interest over the time index t. An additional average may be performed over an ensemble of different measured time series s. In particular, the SWD in x-direction is defined as and the MSD is defined as In the case of an uncorrelated random walk, the SWD is a Gaussian distribution with respect to ∆x, with a variance that increases linearly as a function of the lag-time ∆t. The MSD also increases linearly for all lag-times. For a homogeneous correlated random walk with correlation time τ , the behavior of the SWD and MSD depend on the lag-time ∆t. In the regime of short lag-times, ∆t τ , the SWD is non-Gaussian and the MSD grows quadratically (ballistic motion). In the regime of long lag-times, ∆t τ , the SWD is Gaussian and the MSD grows linearly (diffusive motion), just as in an uncorrelated random walk. The correlation time τ thus defines the cross-over time between the slope 2 and slope 1 in a double-logarithmic plot of the MSD versus lag-time (colored lines in Supplementary Figure 1A). Step width distributions (SWD, colored lines) at a sufficiently long lag-time ∆t 0 are Gaussians with a variance proportional to D s . Averaging over an ensemble of narrow and wide normal distributions results in a leptocurtic distribution (black dashed line).
Next, we consider a heterogeneous ensemble R (s) t of temporally homogeneous correlated random walks, each with a different correlation time τ s . In this case, the SWD in the regime of long lag-times is an average over Gaussians of different variance (colored curves in Supplementary Figure 1B), which results in a leptocurtic and therefore anomalous distribution (black dashed curve in Supplementary Figure 1B). Similarly, for a sufficiently wide and dense distribution of τ s , the individual cross-over times between ballistic (slope 2) and diffusive (slope 1) regimes average out, and the MSD assumes an anomalous shape. In particular, it can be shown that heterogeneity of this kind can lead to an almost exponential SWD and a MSD that resembles a power law with a fractional exponent (dashed line in Supplementary Figure 1A). For an analytical derivation, see arXiv:1207.2242).
Finally, note that a temporally heterogeneous random walk in which the correlation time τ varies slowly as a function of time t has a similar effect as the heterogeneous ensemble above. This demonstrates that anomalous features emerge naturally in the SWD and MSD if these statistics are applied to heterogeneous random walks.

Random walk models in the superstatistical framework
In a superstatistical framework, random time series are described, on the lowest hierarchical level, by a simple stochastic 'base model' with a small set of parameters. In this paper, we have chosen the AR-1 process with a persistence parameter q and an activity parameter a. Keeping these superstatistical model parameters constant corresponds to a temporally homogeneous random process. For example, the AR-1 process with constant parameters q and a corresponds to a homogeneous persistent random walk (Supplementary Figure 2).
In order to describe more complex time series, in particular measured random walks that show obvious temporal heterogeneity, it is possible to resort to higher order base models, such as the class of AR-n processes with n > 1. However, this approach is misguided in cases where the temporal heterogeneity is caused by time-or positiondependent external influences that cannot be easily incorporated into the base model. A standard example of a random walk that becomes heterogeneous due to external effects is a particle diffusing in a medium with different local temperatures.
The superstatistical framework acknowledges external time-dependent effects and models them directly as a higher level stochastic process that affects the parameters of the base model. In our case, the two parameters q t and a t of the AR-1 process are allowed to change at any time step, according to some higher order process. By this way, heterogeneous random walks of arbitrary complexity can be described by a hierarchy of low-order stochastic models. A simple example that highlights externally controlled heterogeneity is the intermittent random walk, in which a particle switches randomly between phases of ballistic and diffusive motion. This heterogeneous process can be described by a two-level superstatistical model with an AR-1 process as the base model and a Markov switching process as the second level model (Supplementary Figure 3).

Supplementary Note 3
Step width distributions In the main part of the paper, we have argued that the traditional, globally averaging statistical measures, such as the mean squared displacements (MSD) and the step width distributions (SWD), are often masking important differences between distinct types of random walks. In particular, we have shown that the SWD of MDA-MB-231 breast carcinoma cells migrating in a collagen network and on a plastic dish are practically indistinguishable on a time scale of 120 min. Here, we demonstrate that the SWDs in the two environments are also nearly identical for all lag times between 5 min and 500 min (Supplementary Figure 4, blue and red lines). The form of the distributions remains approximately exponential for the whole range of lag-times. In contrast, the SWD for migration on fibronectin coated surfaces (green lines) is significantly different from the other two environments for all lag times larger than 10 min. Step width distributions (SWDs). Shown are SWDs (for six different lag-times) of MDA-MB-231 breast carcinoma cells in a collagen network (blue), on uncoated plastic (red) and on plastic coated with the adhesive ligand fibronectin. The distributions in collagen and on plastic are similar for all lag-times, although the migration mechanism of the cells is drastically different in these two environments.

Comparison of Bayesian parameter inference with sliding window-based Maximum Likelihood estimation
In the main text, we have shown that the presented Bayesian method yields accurate results for simulated randomwalk trajectories with abruptly switching (Fig. 3a,b) as well as linearly changing parameter values (Fig. 3c,d).
Here, we provide an additional test case by combining both abrupt and gradual, sinusoidal parameter changes (Supplementary Figure 5). Based on these three test cases (regime-switching, linear, and sinusoidal), we compare our Bayesian method to an alternative inference method for heterogeneous random walks, namely the Maximum Likelihood estimation using a sliding window. The time-averaged posterior distribution corresponding to a single trajectory reveals the complex correlations between the two parameters, persistence and activity.

Sliding window analysis
The standard method for inferring time-dependent parameter changes in time series is the sliding window method, whereby the time series is subdivided into overlapping segments. For each segment, the parameters (activity and persistence in our case) are assumed to be constant and can be estimated using a maximum likelihood approach. Considering a homogeneous AR(1)-process for times t within the interval I t = {t − w/2, ..., t + w/2} of length w, centered around some point in time t, with persistence q t and activity a t , one can state the log-likelihood for these parameters as follows: Here, we consider a two-dimensional trajectory. Maximizing the log-likelihood with respect to q t and a t yields the following Maximum Likelihood estimators: where ( . ) denotes the dot product. Note that the estimator of a t depends on the estimated value of q t . To analyze a heterogeneous time series using these estimators, the time series is partitioned into overlapping segments of length w -thus the term 'sliding window'. This segmentation allows us to estimate the local persistence and activity over the entire trajectory.

Method comparison
We use the mean squared error (mse) of the estimated parameter sequences with respect to the true parameter sequences to assess the quality of the two methods: where N denotes the length of the estimated parameter sequence. The ratio of the errors between both methods allows for a direct comparison: Supplementary Figure 6: Comparison of Bayesian inference with the Maximum Likelihood (ML) approach. Based on the mean squared error ratio, the Bayesian method is directly compared to the ML approach for a variety of window sizes. With an average mse ratio smaller than one for all three test cases, the Bayesian method is found to be superior, regardless of the window size for the ML approach.
We estimate r for different window widths ranging from 3 to 200 data points. For all three test cases (regimeswitching, linear, and sinusoidal), and regardless of the chosen width of the sliding window, the Bayesian algorithm attains a smaller error on average (r < 1, see Supplementary Figure 6). Apart from the smaller estimation error, the Bayesian approach provides an advantage especially for small data sets. For the sliding window analysis, no parameters can be estimated during a time period of w/2 at the beginning and the end of the time series, due to the finite width of the sliding window. By contrast, the Bayesian method yields N − 1 estimates for N data points. Furthermore, the assumption of constant parameters q and a within each window of length w is an inherent limitation of the sliding window approach, since it is violated for most heterogeneous time series. Finally, the choice of the window size w strongly affects the resulting reconstruction of the time-varying parameters. For small window sizes, the Maximum Likelihood method can detect sudden parameter changes but shows large, erroneous fluctuations in the estimated parameter sequence (Supplementary Figure 7,

Supplementary Note 5
Choice of the kernel parameters p min and R.
In the main text, we have presented a sequential Bayesian method to reconstruct the superstatistical parameters q t and a t from a measured random walk time series. In each time step, the posterior distribution P t (q, a) is given a minimum probability p min = 10 −7 , which accounts for the possibility of abrupt parameter jumps to any value within the grid. Furthermore, the posterior distribution is convolved with a box kernel of radius R = 0.03, which assigns small (gradual) parameter changes a larger probability compared to large jumps.
In general, the two control parameters p min and R do not require a precise fine tuning, but setting them outside a certain viable range leads to predictable problems: In general, too small values for the control parameters lead to a blurring of rapid parameter changes, whereas too large values lead to overfitting. We illustrate the effects of wrongly chosen control parameters p min and R in Supplementary Figures 8 and 9. Optimum values for the two control parameters p min and R can be found systematically by generating synthetic data sets that resemble the actual data and by quantifying how well the known superstatistical parameter changes of (q t , a t ) are reconstructed by the method. The values for p min and R given in the paper have been obtained in this way.
If the dynamics of the superstatistical parameters (q t , a t ) is entirely unknown, the generation of realistic synthetic data sets is impossible. In such cases, the choice of the control parameters p min and R can be viewed as a Bayesian model selection problem. Using measured data only, the plausibility of a given set (p min , R) can then be quantified by the global marginal likelihood L: For each time step t, the momentary marginal likelihood L t is obtained by integrating the likelihood function over the (q t , a t )-plane. This integral can be evaluated readily within our grid based approach. The global marginal likelihood L is then given by the product of all L t . The optimal set of control parameters, given the measured data, can then be found by maximizing L.

Persistence time and cell speed
In a superstatistical data analysis, the parameters q t and a t used in this work can be replaced by different variables that may be more appropriate for specific questions. For example, persistence can also be quantified by a persistence time τ t , defined as the half decay time of the velocity correlations. Assuming 0 < q t < 1, and measuring the correlation time in units of the sampling time, τ t is monotonically increasing with q t , according to Furthermore, rather than investigating superstatistical parameter correlations between q t and a t , new insights can also be gained from the 'inter-level' correlations between superstatistical parameters and directly measured variables, such as the momentary cell speed u t .
The relationship between cell speed, activity and cell persistence can be seen most easily in the case of a homogeneous AR-1 process in 2D. Assuming 0 ≤ q t < 1, the momentary speed u t is then Rayleigh distributed with the most probable valueû t being given byû In a hypothetical situation with a = constant, there is always a positive correlation betweenû t and q t (or, equivalently, between u t and τ t ). However, in the case of migrating cells, the situation is more complicated as the activity a t is also a strongly time-varying quantitiy that may be correlated positively or negatively with the persistence q t . On uncoated plastic, where persistence is considerably lower, τ t also increases with u t , but without clear signs of saturation (Supplementary Figure 11