Abstract
Scientific inference involves obtaining the unknown properties or behavior of a system in the light of what is known, typically without changing the system. Here we propose an alternative to this approach: a system can be modified in a targeted way, preferably by a small amount, so that its properties and behavior can be inferred more successfully. For the sake of concreteness we focus on inferring the future and past of Markov processes and illustrate our method on two classes of processes: diffusion on random spatial networks, and thermalizing quantum systems.
Similar content being viewed by others
Introduction
Much of science revolves around inference, reconstructing the unknown from what is known1,2,3. Observable patterns here and now inform us of inaccessible patterns out and away. For example, using inferential techniques, one can reconstruct the history of life from available fossils4,5,6, or predict the fate of the universe by observing the present night sky7,8,9; one can infer hidden states and transition probabilities10,11,12, connections and weights of neural networks13,14,15 or parameters, initial states and interaction structures of complex systems16,17,18,19,20,21,22,23,24,25.
Ordinarily, inference is a passive, non-disruptive process. Unlike engineering, natural sciences are motivated by knowing nature, rather than changing it. However, knowing and changing are not necessarily mutually exclusive. Earlier, it was established that attempting to describe and predict a system can inadvertently influence it, potentially even rendering it indescribable and unpredictable26,27. Here we study the converse case of how the intrinsic properties of a system can be purposefully modified so that its past or future is more inferable.
A number of authors addressed the problem of predicting the future and retrodicting the past of a stochastic process28,29,30,31. In this study, we are concerned not with finding strategies or algorithms to predict or retrodict stochastic systems, but rather with optimally modifying systems so that their predictability or retrodictability increases.
An engineer might use control theory to balance a bipedal robot, stabilize the turbulent flows surrounding a wing, or maximize the signal to noise ratio in an electric circuit32. Here we do the same, but optimize the susceptibility of a system to the inquiry of its past and future.
Forward in time, the entropy associated with the probability distribution of the system state will increase monotonically, as per H-theorems33,34,35. A similar trend also holds backwards in time31. Here we determine how transition rates should be perturbed infinitesimally as to minimize the generation of inferential entropy in either temporal direction. After establishing a general theoretical framework, we implement these ideas to two specific example systems. The first is a diffusion process taking place on a spatial random network. The second is a quantum harmonic oscillator with a time-dependent temperature.
Results
Quantifying predictability and retrodictability
The past and future of a stochastic system with a concentrated and sharply peaked probability distribution can be inferred with high certainty. Accordingly, we use the Gibbs–Shannon entropy to quantify the inferrability of a system36, and later on show that this indeed is a good measure. Given a stochastic process, Xt, characterized by its transition matrix Tα(ω) = Pr(Xt = ω|X0 = α), and initial state α, the entropy of the process at a final time t is
When Xt is the state of a thermodynamic system, this is the standard thermodynamic entropy. In the present information-theoretical context, we refer to ST as the “prediction entropy”.
Naturally, the average entropy generated by a process depends on how it is initialized—the prior distribution P(0). To characterize the the process itself, we marginalize over the initial state, α, \(\langle S_{\mathrm{T}}\rangle = \mathop {\sum}\limits_\alpha {P^{(0)}} (\alpha )S_{\mathrm{T}}(\alpha ),\) where P(0)(α) is the probability of starting at α. Likewise, we quantify the retrodictability of a process by a “retrodiction entropy”, \(\langle S_{\mathrm{R}}\rangle = \mathop {\sum}\limits_\omega {P^{(t)}} (\omega )S_{\mathrm{R}}(\omega )\). Here, Rω(α) is the probability the system started in state α given that the observed final state was ω, SR is its entropy analogous to Eq. (1), and P(t)(ω) is the probability that the process is in state ω at time t unconditioned on its initial state.
Interestingly, the predictability and retrodictability of a system are tightly connected: Since ST and SR are related by the Bayes’ theorem, \(R_\omega (\alpha ) = T_\alpha (\omega )P^{(0)}(\alpha )/\mathop {\sum}\limits_{\alpha \prime } {T_{\alpha \prime }} (\omega )P^{(0)}(\alpha \prime )\), it follows that 〈ST〉 and 〈ST〉 are also related31,
where S0 is the entropy of the prior probability distribution P(0), and St is the entropy of \(P^{(t)}(\omega ) = \mathop {\sum}\limits_\alpha {P^{(0)}} (\alpha ){\mkern 1mu} T_\alpha (\omega )\).
We use 〈ST〉 and 〈SR〉 to measure how well we can predict the future and retrodict the past of a stochastic process. The higher the entropies, the less certain the inference will be.
Variations of Markov processes
In a Markov process, the state of a system fully determines its transition probability to other states. Markov processes accurately describe a number of phenomena ranging from molecular collisions through migrating species to epidemic spreads37,38,39,40,41.
Consider such a Markov process defined by a transition matrix T, with elements Tji, which we will visualize as a weighted network. We assume that we know the transition rates and prior distribution over states at t = 0 with perfect accuracy, but do not know what state the system is in, except at the initial (final) time. From this data we will predict (retrodict) the final (initial) state of the system.
A system initialized in state i with probability \(P_i^{(0)}\), upon evolving for t steps, will follow a new distribution \(P_i^{(t)} = \mathop {\sum}\limits_j {P_j^{(0)}} (T^t)_{ji}\). Accordingly,
Thus both entropies depend on the duration of the process t. Note that probability is normalized \(\mathop {\sum}\limits_i {(T^t)_{ji}} = 1\) for all j, t.
Suppose that it is somehow possible to change the physical parameters of a system slightly, so that the probability of transitions are perturbed, \(T_{ji} \to T_{ji}^\prime = T_{ji} + \epsilon {\mkern 1mu} q_{ji}\), where \(\epsilon\) is a small parameter. For now, we do not assume any structure on q, other than implicitly demanding that it retains probabilities within [0, 1] and preserves the normalization of rows. This variation leads to a change in the t-step transition matrix,
where ξk = qTk. The superscripts of η(t,p) refer to the power of the transition matrix, t, and the order of the contribution, p, which is analogous to the order of the derivative of a function. So η(t,p) is the p-th order contribution to the varied t-step transition matrix. This defines a set of p-th order effects for the nth power of the transition matrix. In the sequel, we will be studying first variations, therefore, we will only need
The difference between the the entropies of the perturbed and the original systems is Δ〈ST,R〉 ≡ 〈ST,R(T′)〉 − 〈ST,R(T)〉. Whenever Δ〈ST,R〉 is of order \(\epsilon\) and higher, we can evaluate the variation
which in essence is the derivative of 〈SR〉 or 〈ST〉 in the q “direction”.
With little algebra, we can show that the first order perturbations of the t-step entropies 〈SR〉 and 〈ST〉 are
The Kronecker functions \({\mathbb{1}}_{{\rm{T}}}\) and \({\mathbb{1}}_{\mathrm{T}}^c\) which implicitly depend on the indices i, j, and the time, t, are defined to be \({\mathbb{1}}_{{\rm{T}}}\) = 0 if (Tt)ji = 0 and to equal 1 otherwise, and \({\mathbb{1}}_{\mathrm{T}}^c = 1 - {\mathbb{1}}_{\mathrm{T}}\).
While these equations ostensibly require a summation over all paths of the system, path integration in this setting is simply matrix multiplication, e.g., Tt. This allows the calculation to be easily defined, and acomplished in polynomial time, \({\cal{O}}(n^3\log t)\). The sums over states in Eqs. (7) and (8) are also of polynomial complexity, \({\cal{O}}(n^2)\).
As we see, the \(\epsilon {\mathrm{log}}\epsilon\) terms can cause the limit, Eq. (6), to diverge, causing a sharp, singular change in entropy generation. This is expected. The divergence will happen only when the perturbation enables a path between two states where there was none. This is because (Tt)ji = 0 only if i could not be reached from j in t steps, but if this is still true after perturbation, the ηji term will be zero.
On the other hand, if the perturbation does not enable a path between two isolated states, but preserves the topology of the transition matrix, then Eq. (7) simplifies considerably; the divergent \({\mathbb{1}}^{c}\) terms vanish, and we take the limit,
Having established a very general theoretical framework, we now implement these ideas on two broad classes of stochastic systems for which the structure of the perturbing matrix q is specified further. We first consider random transition matrices drawn from a matrix ensemble. Second, we study a physical application—we enhance the predictability and retrodictability of thermalizing quantum mechanical systems by means of an external potential.
Improving the inferribility of Markov processes
We start by studying a general class of perturbations that can be applied to an arbitrary Markov process, and evaluate the associated entropy gradient, which can be thought as the direction in matrix space that locally changes 〈SR〉 or 〈ST〉 the most (Fig. 1). As we climb up or down the entropy gradient, we show how the transition matrix evolves (Fig. 2).
We consider a family of perturbations that vary the relative strength of any transition rate. This involves changing one element in the transition matrix while reallocating the difference to the remaining nonzero rates so that the total probability remains normalized. In other words,
To first order in \(\epsilon\), this is the same as adding \(\epsilon\) to the (i, j) element, and then dividing the row by \(1 + \epsilon\) to normalize, so it is a natural choice for a perturbation operator. It also obeys \(\Delta ^{(\epsilon )}\Delta ^{( - \epsilon )}{\mathbf{T}} = {\mathbb{1}} + {\cal{O}}(\epsilon ^2)\). We define the perturbation acting on a zero element to be zero if \(\epsilon < 0\) since elements of the transition matrix must be non-negative. From Eqs. (10) and (4), we obtain the perturbed matrices and perturbed 〈SR〉, 〈ST〉.
To study the effect of successive perturbations of the form Eq. (10), we carry out a gradient ascent algorithm in matrix space. At each iteration, we change the transition rates infinitesimally to maximally increase or decrease retrodiction or prediction entropy. We parameterize the gradient ascent by L2 distance in matrix space, i.e., \({\mathrm{d}}(A,B) = \left\| {A - B} \right\|_2 = \left[ {\mathop {\sum}\limits_{i,j} {(A_{ji} - B_{ji})^2} } \right]^{1/2}\).
In a gradient descent algorithm, one descends over a function f(r) over a parameter t (time) by solving \({\dot{\mathbf{r}}} = \nabla f({\mathbf{r}})/\left\| {\nabla f({\mathbf{r}})} \right\|\), where the normalization ensures that ∥∂tr(t)∥ = 1, so the total distance of the path r(t) just t.
Similarly, we define our gradients to be either Δji〈SR(T)〉 or Δji〈ST(T)〉, depending on whether we are optimizing retrodiction or prediction. We parameterize our path, T(λ), so that the total distance of the path (in L2 matrix space) is λ,
Since we carry out this scheme numerically, using a finite difference method, it does not matter if the limit in Eq. (11) exists. In these cases, our numerical scheme returns large, finite jumps.
To illustrate our formalism in action, we solve Eq. (11) for a particular example system: diffusion taking place on a directed spatial random network. We build a spatial network, such that neighboring nodes are placed at regular intervals on a circle, and are also cross connected with probability \(P(S_{ji} = 1) = {\mathrm{e}}^{ - \beta {\kern 1pt} d_{ij}}\), that decay with distance dij42. The transition matrix T is obtained by normalizing the rows of S. For our prior, we use a uniform prior over all states.
An example is shown in Fig. 2, where the 3-step (t = 3) predictability and retrodictability change as the transition matrix is perturbed iteratively.
Snapshots of the perturbed matrix, for different λ values, when predictability is extremized can be seen in Fig. 2a–d, represented as networks with edge thicknesses proportional to the transition rates. Similarly, sample transition network when retrodictability is optimized can be seen in Fig. 2e–h. The behavior of prediction and retrodiction entropy as λ is varied can be seen in Fig. 2i, j, and dotted lines mark the λ values corresponding to the networks in Fig. 2a–h. We observe that inferential success can be improved up to 10–20% with only minor changes in the network structure. This will be quantified in further detail below.
We now interpret our results to ensure that our theoretical framework makes qualitative sense and works as expected. First, we observe that perturbations that maximize both 〈ST〉 and 〈SR〉 displace the transition matrix toward the same point: in both cases T evolves to a point where (Tt)ji = pi, a probability vector. In other words the probability of transition does not depend on what state the system is currently in. Taking the 3rd power of the T matrix for large values of λ reveals that this is indeed the case, although, of course, T itself can retain some complex structure. As expected, when a system moves from any state to any other state with equal likelihood, it is most difficult to infer its past or future.
In contrast, minimizing entropy produces two very different transition matrices, for λ ≪ 0, depending on the type of entropy we minimize. The global minima of the prediction entropy are transition matrices in which all probability in each connected component flows toward a single node, reachable in t steps. A process where the initial state uniquely determines the final state is indeed trivial to predict. As we increase the number of time steps over which we optimize predictability, the number of “layers” of the transition network increases (Fig. 3a, b).
On the other hand, minimizing retrodiction entropy tends to eliminate branches and fragmenting the network into linear chains (including isolated nodes). The start of this process can be seen in Fig. 2e. When the splitting is complete, probability flows through these fragments unidirectionally, thus retrodiction involves nothing more than tracing back a linear path (Fig. 3c, d).
This also explains why 〈SR〉 tends to stay the same in the λ < 0 direction when minimizing 〈ST〉. If St = 〈ST〉 = 0, then Eq. (2) implies 〈SR〉 = S0, which is the maximum possible value for 〈SR〉. This can also be understood intuitively—if when a final measurement is made, the system is always found to be in a unique accumulating state, this yields no information about what state the system started in. If, however, the minimal 〈ST〉 network instead has multiple connected components and collector nodes, {kj}, then there can be a decrease in 〈SR〉 since \(R_{k_0}\), \(R_{k_1}\),… are different distributions.
So far, we have only extremized entropy, but have not shown that this leads to a significant difference in our ability to infer the past or future. We will do so by reporting how often, on average, we can identify the correct initial (final) state of the system, given the final (initial) state. Note that this metric is not extensive; with increasing number of states, the probability mass for even the “best guess” approaches zero. Nevertheless, we will adopt this difficult metric for ourselves. For both predicting the final state and retrodicting the initial state, we perform a maximum likelihood inference; we pick the state with highest probability to be our guess, conditioned on the observed final or initial state. From the transition probability, Tji, and retrodiction probability, Rji, we can calculate the probability that our guess at the initial or final state will be correct (cf. “inference performance” in the Methods section).
We plot how inference performance changes as we manipulate transition rates in Fig. 4. The transition matrices we did our test is the same as those shown in Fig. 2. The success rate of predicting final states and retrodicting initial states while optimizing either 〈SR〉 or 〈ST〉 is plotted. Since there are 30 states in our network, the baseline accuracy is 1/30 = 3.3%, which is marked with a dashed gray line. Our success rate aligns well with the entropy in Fig. 2i. If we were to continue to larger values of λ, we reach almost 100% accuracy when we minimize 〈ST〉 or 〈SR〉.
The improvement in retrodictability always lags behind predictability. This is because 〈SR〉 must be greater than 〈ST〉, as per Eq. (2).
Naturally, descending an entropy landscape all the way returns transition matrices with trivial structure and dynamics. In our diffusion example, one could have guessed from the beginning, that a network with only inward branches, or one with disconnected linear chains, would be much more predictable than an all-to-all network with equally distributed weights. However, our formulation is useful not because it eventually transforms every network into a trivial network, but because it provides the steepest direction toward a trivial network. Second, our formulation is useful because, among many trivial networks, it moves us toward the direction of the closest one. Thus, we must determine the effectiveness of small pertubations, far before the system turns into a trivial one.
We find indeed, that significant differences to inferential success can be made with relatively small changes to the transition matrix. Table 1 quantifies how much the transition matrix has been modified, versus how much our retrodictive (top three rows) and predictive (bottom three rows) success have improved. For example, the fifth row shows that if we would like to be spot-on correct in predicting the final state of a stochastic process with 30 states and 900 transitions, our success rate can be improved by ~5% by modifying only 8 out of 900 transition rates by more than 0.1, with none being larger than 0.2. The cumulative change in all transition rates for this perturbation totals to 4.34, an equivalent of adding four edges. The changes required to improve our success rate by 10% are not much larger (Table 1).
As a final point of interest, we see that for all the ±5% in Table 1, λ and the L2 distance are almost identical. This means that to get from the initial matrix to the perturbed matrices, one could follow the gradient calculated at the initial matrix in a straight line—the path is roughly straight in matrix space for at least that distance.
Improving the inferribility of quantum systems via external fields
In a physically realistic scenario, it is unlikely to have full control over individual transitions. An experimentalist can only tune physical parameters, such as external fields or temperature, which influence the transition matrix indirectly. Furthermore, it is often not practical to vary physical parameters by arbitrarily large amounts. Thus ideally we should improve predictability and retrodictability optimally, while only applying small fields.
To meet these goals, we consider a class of quantum systems in or out of equilibrium with a thermal bath. These systems are fully characterized by eigenstates ψ1, …, ψn with energies E1, …, En undergoing Metropolis–Hastings dynamics43 where a system attempts to transition to an energy level above or below with equal probability; an attempt to decay always succeeds, while an attempt to excite succeeds with probability exp[−β(Ek+1 − Ek)].
Furthermore we assume that the ground state E0 cannot decay, and the highest state En is unexcitable. For the regime of validity of Markovian descriptions of thermalized quantum systems, we refer to refs. 44,45.
We now determine the effects of a small perturbing potential v(x). The perturbation will shift the energy levels, which changes the transition matrix, which in turn changes the average prediction and retrodiction entropies of the system. Our goal is to identify what perturbing potential would maximally change these entropies. Since we are concerned with the first order variation in entropy, it will suffice to also use first order perturbation theory to calculate energy shifts.
The perturbed k-th energy level is \(E_k = E_k^{(0)} + \epsilon \cdot \delta E_k\). When the perturbation is applied the exponential terms in T change as
From this, we can find our first order change \(T_{ji}^\prime = T_{ji} + \epsilon {\mkern 1mu} q_{ji}\) in terms of the change in energy levels, δEk,
Now we will write the prediction and retrodiction entropy δ〈ST,R〉 variations as a functional of a perturbing potential, and then use calculus of variations to obtain the extremizing potential. For clarity, we will derive our equations in one dimension; the generalization to higher dimensions is straightforward.
We partition the spatial domain, Ω, into N intervals, [xi, xi+1), of width Δx and let our potential be a piecewise constant function of the form \(v(x) = \mathop {\sum}\limits_{i = 0}^{N - 1} {v_i} {\mkern 1mu} {\mathbb{1}}_{x \in [x_i,x_{i + 1})}.\) As N → ∞, the first order change in the k-th energy level is
since \({\int}_{x_i}^{x_{i + 1}} {v_i} |\psi (x)|^2\sim v_i|\psi (x_i)|^2\Delta x\). We substitute the δEs, Eq. (14), into Eq. (13) to get the q matrix,
we substitute this in into Eq. (5) to get
and therefore,
where δ2〈ST,R〉/δxδv is Eq. (9) with \(\tilde \eta _{ji}^{(t)}(x)\) substituted in for \(\eta _{ji}^{(t)}\).
Last, we ensure the smallness of the perturbation by introducing a penalty functional, \(C[v] = \frac{1}{2}\gamma {\int} v (x)^2dx\) and ask what potential v(x) extremizes
We take a variational derivative with respect to v(x) and set it to zero to obtain the extremizing potential,
This vT,R is the external potential that extremizes the gradient of entropy minus the penalty functional.
Improving inferribility for a thermalizing quantum oscillator
We can now ask what perturbing external field should be applied a quantum harmonic oscillator that is in the process of warming up or cooling down, in order to improve its predictability or retrodictability. For this system \(V(x) = \frac{1}{2}m\omega ^2x^2\), and \(E_k = (k + \frac{1}{2})\hbar \omega\). The stationary eigenfunctions are \(\psi _k(x) = \frac{1}{{\sqrt {2^kk!} }}\pi ^{ - 1/4}\exp \left( { - \frac{{x^2}}{2}} \right)H_k(x)\) where Hk is the k-th Hermite polynomial, \(H_k(x) = ( - 1)^k{\mathrm{e}}^{x^2}\frac{{d^k}}{{dx^k}}{\mathrm{e}}^{ - x^2}\). For concreteness, we also have to choose a prior distribution on states. We choose the prior distribution to be an equilibrium distribution at a (possibly different) temperature, \(P_k \propto {\mathrm{e}}^{ - \beta _2E_k}\). We truncate the transition matrix at an energy En ≫ 1/β1, 1/β2 so that edge effects are negligible. We take m = ħ = ω = 1, and choose U to be the negative of Eq. (18) so that adding them to V(x) decreases the corresponding entropy, and increases inference performance.
The initial and final temperatures determine the flow of probability. The equilibrium distribution at a high temperature has much more probability mass at higher energy states than an equilibrium distribution at a low temperature, so if we start with a high temperature and quench to a low temperature, there will tend to be a flow of probability from high states to low states. The opposite will happen when we quench from a low to a high temperature. We use T = 1 as the low temperature, and T = 10 as the high temperature.
To ensure that each perturbing potential, U(x), actually increase or decrease predictability/retrodictability (depending on whether we add or subtract it from V(x)), we calculate the “Δ%” for retrodiction and prediction: the percent difference in how often we can correctly guess the initial or final state, upon perturbing the system. The performance is obtained similarly to that in Fig. 4 (cf. Methods section). The perturbation potential is normalized to up(x) = U(x)/∥U∥ so that the L2 norm of up(x) is 1, and the strength, λ, with which up is applied is varied so that the total potential is V(x) + λup(x).
Figure 5 shows some extremizing potentials for a system that was at one temperature, and is then suddenly quenched to a different temperature. Figure 5a shows optimal potentials for a system quenched from a high temperature to a low temperature while optimizing 〈SR〉 for t = 1, 3, and 5 time steps. Figure 5b, c shows the change in inference success as the potential is applied at varying strengths, λ. Figure 5d–f shows the same quantities (extremizing potential and change in inference) for a system at a low temperature quenched to a high temperature, while optimizing 〈SR〉. This potential is also optimal for 〈ST〉. Finally, Fig. 5g–i shows a high temperature system quenched to a low temperature while optimizing 〈ST〉.
To quantify how significantly the perturbations change the quantum system, we keep track of the L1 difference in eigenvalue spacing, i.e., \({\cal{S}} \equiv \mathop {\sum}\limits_{k = 1}^{30} {|E_k^\prime - E_{k - 1}^\prime - \hbar \omega |}\). The largest values \({\cal{S}}\) achieves for any potential and applied strength shown in Fig. 5 is ~1.2. In other words, we can get few percent change in success rate by introducing a change to all energy levels that amounts to one level spacing. Note that this is a single step perturbation along a single direction, rather than an iterated one.
This example illustrates how to combine real, physical, continuous quantities, such as perturbation potentials, with the more abstract formalism of evaluating the entropy of Markov transition matrices with discrete states. The general procedure we outlined in this section can also be applied to other thermal systems, quantum or otherwise.
Discussion
We developed a formalism to describe exactly how predictability and retrodictability changes in response to small changes in a transition matrix, and used it to descend entropy landscapes to optimally improve the accuracy with which the past or future of a stochastic system can be inferred. Our main results are the equations relating perturbations of Markov processes to the change in average entropy and retrodiction entropy of the system, Eqs. (4), (7), and (9).
We specifically focused on Markov processes, not only because it yields to mathematical analysis, but also because many important processes in physical, biological and social sciences are Markovian. That being said, the general principle outlined here can also be used in systems with memory, or in other inference problems such as the determination of unknown boundary conditions, system parameters, or driving forces.
As examples of manipulating predictability and retrodictability, we studied two specific types of perturbations, Eqs. (10) and (13), and used these to study how certain types of transition matrices evolve as they flow along the trajectory of maximal increase in retrodictability and predictability. We found that the transition networks tend to cull their connections and split into cycles and chains when we try to minimize retrodiction entropy. Conversely, the transition networks become fully connected when we attempt to maximize either inferential entropy. If one does not have full control over transition rates, one can steer a system toward the direction of either extreme by a small amount. Finally, as a physical example, we studied how to find the perturbing potential that extremally changes the predictability and retrodictability of a thermalizing quantum system.
Our formulas lead us to intuitive results such as the divergence of entropy generation when a path between two otherwise isolated states is enabled. However, they also lead us to less obvious conclusions, such as how predictability changes when retrodictability is optimized (and vice versa); or the shape of optimal potentials perturbing a thermalizing quantum system.
Our basic equations, Eqs. (4), (7), and (9), are very generally applicable to any discrete-time Markov process. The type of transition matrix perturbations we chose to study, namely those in Eqs. (10) and (13) are natural and practical choices, but of course, they are not the only two possibilities. For example, an operator that takes two matrix elements 0 < Tja, Tjb < 1 and “transfers” probability between them, changing them to \(T_{ja} + \epsilon ,T_{jb} - \epsilon\) would make an interesting future study.
In our work we observed an intriguing asymmetry between prediction and retrodiction. In particular, we observe that predictability is more easily improved than retrodictability. This a byproduct of how we set up our problem: We took the initial distribution, the probability vector P(0), and the forward dynamics, T, as givens, and found the probability, P(t), via propagating P(0) with T. If we had done the opposite by picking the distribution P(t) and the backwards dynamics, \(\widetilde {\mathbf{T}}\), then we could find P(0) to be the back-evolved distribution, then our results would reverse.
An experimenter only has control over the prior distribution at the current time, P(0), but cannot in general decide what distribution she wants at an arbitrary future time, P(t), and pick a P(0) that results in a specified P(t). The fact that we set up the problem so that t = 0 was the “controlled” time, and the state at the final time is the result of the choices made at t = 0 ultimately lead to the seeming emergence of an “arrow of time”46.
Since our method makes changes to a system to extremize the average of a function over a set of trajectories, it could well be considered within the domain of stochastic control theory47,48. However, there are various elements in our approach that depart from classical stochastic control, which typically deals with problems of the form
where Xt is the system trajectory, \(\hat \xi\) is a Weiner process, v is a control parameter, C is a cost function, and ϕ and R are the target cost and some function that quantifies cost-of-control, cost-of-space, cost-of-dynamics, etc49. The goal is to find the \(\tilde v\) that minimizes C.
One difference is that we do not restrict ourselves to a Weiner process, but allow any valid transition matrix. The control parameter, v, could be the perturbation to the original transition matrix, or it could be some other external parameter which indirectly results in a change in the transition matrix, as in the thermalizing quantum oscillator example.
The second difference is the structure of our cost function. In our case, the cost is an average weighted over priors. For prediction entropy,
For a delta function prior, this reduces to the standard control theory cost function, which depends on the initial condition of the system. For retrodiction entropy SR(XT) the cost depends on the final state, and is then averaged over the posterior distribution of XT,
The third difference is a philosophical one. Standard stochastic control aims to find a control protocol that is a global minimum of the cost function—one obtains the field v such that C[v + δv] = 0 for all δv. In contrast, we look for the variation δv such that C[δv] is maximal, where C is 〈SR〉 or 〈ST〉. Our method descends entropy gradients in a space of system parameters, and is only guaranteed to be optimal locally. This could then be paired with a stochastic gradient descent algorithm or simulated annealing to find optima in a larger neighborhood. In passing, we note that for systems with a very large number of states, it would probably be computationally advantageous to use a stochastic algorithm even to compute the local gradient.
There is still plenty of room to make our framework more useful and general. Currently, we assume constant transition rates, and perturb the transition matrix at a single instant. However, transition rates can be time-dependent, in which case we would have to perturb the transition rates differently at different times. A second interesting avenue would be to further explore the costs associated with changing the transition probabilities. Another natural generalization is to extend the problem to continuous time.
Methods
Extremization of entropy
We started with a random geometric graph, T(λ = 0), from the ensemble described in the text, where nodes i and j are connected with probability e−βd(i,j). We used n = 30 node graphs, with β = 0.5. The extremization is done numerically and iteratively, as outlined in Eq. (11). The entropy was the entropy for a t = 3 step process, and we use a perturbation size \(\epsilon = 0.05\), and step size dλ = 0.05.
At each step, the matrix of change in entropy (per \(\epsilon\)) due to perturbation of an element is calculated, \(S_{ji} = \frac{1}{\epsilon }\Delta _{ji}^{(\epsilon )}\langle S[{\mathbf{T}}(\lambda )]\rangle\), where the S in the angled brackets is whichever entropy we seek to extremize T over—either 〈SR〉 or 〈ST〉. To get the updated transition matrix, the j, i element of T is perturbed using the standard perturbation operator, Eq. (10), and strength \(\epsilon \prime = d\lambda /\left\| {S_{ji}} \right\|\). The order that we apply these operators is irrelevant up to order \((\epsilon \prime )^2\). The updated transition matrix is then the result of applying all the perturbation operators, one for each element of T. At each step, the prediction and retrodiction entropy of the Markov process were calculated and saved, along with the actual matrix Tji(λ), for plotting purposes. The change in λ at each step is just the L2 distance between the previous matrix and the new, perturbed matrix.
Inference performance
The inference performance can also be calculated analytically as long as we have the transition matrix, Tji, and the prior, P(0), which we do to generate Figs. 4 and 5. Since we are guessing that the maximally likely state is the correct, the formulas for this are
These formulas give us the expected fraction of times we correcly guess the final state given the initial state (CT), or initial state given the final state (CR). The expression maxi(Tt)ji is the probability that you guess the the final state correctly given that the initial state is i, and the normalized sum simply averages your performance across all possible initial states. The CR equation is analogous, simply substituting the retrodiction probability matrix for the transition matrix.
As expected, the performance obtained via random trials fits CT, CR almost exactly since we are using a large number of trials.
Thermalizing quantum harmonic oscillator
While it would be difficult to analytically solve Eqs. (18) and (9) to find the extremal change in potential, it is a simple matter to calculate it analytically.
For the harmonic oscillator \(V(x) = \frac{1}{2}m\omega x^2\), and \(E_k = (k + \frac{1}{2})\hbar \omega\). The stationary eigenfunctions are \(\psi _k(x) = \frac{1}{{\sqrt {2^kk!} }}\pi ^{ - 1/4}\exp \left( { - \frac{{x^2}}{2}} \right)H_k(x)\) where Hk is the k-th Hermite polynomial, \(H_k(x) = ( - 1)^ke^{x^2}\frac{{d^k}}{{dx^k}}e^{ - x^2}\). As mentioned in the text, we choose the prior distribution to be an equilibrium distribution at a given temperature, \(P_k \propto e^{ - \beta _2E_k}\). Since we can only store finite vectors on a computer, we only track the first n = 30 energy eigenstates, which is enough that the total probability mass (sum of Gibbs factors) the prior misses by truncation would only be <5% of the total probability mass. We take m = ħ = ω = 1 for simplicity. The perturbation matrix η(x) can be calculated numerically—it is a high order (order 60) polynomial in x times \(e^{ - x^2}\)—and substituted into Eq. (9) to get the (negative) extremal potential, U(x). The potential is then normalized by the L2 norm of U, up(x) = U(x)/∥U∥ where \(\left\| U \right\| = \left( {{\int}_{ - \infty }^\infty U (x)^2{\mkern 1mu} {\mathbf{d}}x} \right)^{1/2}\).
Inference performance for quantum harmonic
We solve for the energy eigenvalues of the harmonic oscillator potential plus the perturbation potentials using the method of shooting. For each up, and for each strength, γ, we numerically solve Schrodinger’s equation for the potential \(\frac{1}{2}m\omega ^2x^2 + \gamma {\mkern 1mu} u_p(x)\) at different energies, Etrial. We pick our shooting point to be far outside our region of interest, at x = 15, and evaluate whether the value of the numerical solution is positive or negative at the shooting point. Near an energy eigenvalue, the sign of ψtrial will be (without loss of generality) <0 for energies a little below the true eigenvalue, and >0 for energies a little above the true eigenvalue. We use the bisection method of root finding to approximate the energy eigenvalue, with as much precision as we want. Our energy eigenvalues are correct up to 10−6.
Once we have found the first n eigenvalues, we compute the transition matrix using Eq. (12), which is determined by the final temperature, and the prior distribution on states, \(P_j^{(0)} = {\mathrm{e}}^{ - \beta _iE_j}/Z\), which is determined by the initial temperature (\(Z = \mathop {\sum}\limits_{j = 1}^n {{\mathrm{e}}^{ - \beta _iE_j}}\)). We then calculate the average percentage of times the final state can be inferred given the initial state after t = 1, 3, 7 steps. We use Eq. (19) to do this.
Data availability
Both the data and the code used to create and analyze the data during the current study are available in the Github repository, https://github.com/nrupprecht/Retrodiction-Data/.
References
Anderson, T. W. An Introduction to Multivariate Statistical Analysis 2 (Wiley, New York, 1958).
Le Cam, L. Maximum likelihood: an introduction. Int. Stat. Rev. 58, 153–171 (1990).
Box, G. E. & Tiao, G. Bayesian Inference in Statistical Analysis (John Wiley & Sons, New York, 2011).
Turner, D. The functions of fossils: inference and explanation in functional morphology. Stud. Hist. Philos. Sci. Part C: Stud. Hist. Philos. Biol. Biomed. Sci. 31, 193–212 (2000).
Slater, G. J., Harmon, L. J. & Alfaro, M. E. Integrating fossils with molecular phylogenies improves inference of trait evolution. Evol.: Int. J. Org. Evol. 66, 3931–3944 (2012).
Gavryushkina, A., Welch, D., Stadler, T. & Drummond, A. J. Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration. PLoS Comput. Biol. 10, e1003919 (2014).
Krauss, L. M. & Starkman, G. D. Life, the universe, and nothing: life and death in an ever-expanding universe. Astrophys. J. 531, 22 (2000).
Ulanowicz, R. E. Increasing entropy: heat death or perpetual harmonies? Int. J. Des. Nat. Ecodynamics 4, 83–96 (2009).
Frautschi, S. Entropy in an expanding universe. Science 217, 593–599 (1982).
Baum, L. E. & Petrie, T. Statistical inference for probabilistic functions of finite state markov chains. Ann. Math. Stat. 37, 1554–1563 (1966).
Nasrabadi, N. M. Pattern recognition and machine learning. J. Electron. imaging 16, 049901 (2007).
Fine, S., Singer, Y. & Tishby, N. The hierarchical hidden markov model: analysis and applications. Mach. Learn. 32, 41–62 (1998).
Boyen, X. & Koller, D. Tractable inference for complex stochastic processes. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 33–42 (Morgan Kaufmann Publishers Inc., San Francisco, 1998).
Stevenson, I. H., Rebesco, J. M., Miller, L. E. & Körding, K. P. Inferring functional connections between neurons. Curr. Opin. Neurobiol. 18, 582–588 (2008).
Nguyen, H. C., Zecchina, R. & Berg, J. Inverse statistical problems: from the inverse ising problem to data science. Adv. Phys. 66, 197–261 (2017).
Ghonge, S. & Vural, D. C. Inferring network structure from cascades. Phys. Rev. E 96, 012319 (2017).
Besag, J. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Series B 36, 192–225 (1974).
Cocco, S. & Monasson, R. Reconstructing a random potential from its random walks. EPL (Europhys. Lett.) 81, 20002 (2007).
Iba, H. Inference of differential equation models by genetic programming. Inf. Sci. 178, 4453–4468 (2008).
Gomez Rodriguez, M., Leskovec, J. & Krause, A. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1019–1028 (ACM, New York, 2010).
Lenglet, C., Deriche, R. & Faugeras, O. Inferring White Matter Geometry from Diffusion Tensor MRI: Application to Connectivity Mapping. European Conference on Computer Vision, 127–140 (Springer, Heidelberg, 2004).
Haas, K. R., Yang, H. & Chu, J.-W. Expectation-maximization of the potential of mean force and diffusion coefficient in langevin dynamics from single molecule fret data photon by photon. J. Phys. Chem. B 117, 15591–15605 (2013).
Ghahramani, Z. & Hinton, G. E. Parameter estimation for linear dynamical systems. Tech. Rep., Technical Report CRG-TR-96-2, University of Totronto, Dept. of Computer Science (1996).
Lokhov, A. Y., Mézard, M., Ohta, H. & Zdeborová, L. Inferring the origin of an epidemic with a dynamic message-passing algorithm. Phys. Rev. E 90, 012801 (2014).
Altarelli, F., Braunstein, A., Dall’Asta, L., Ingrosso, A. & Zecchina, R. The patient-zero problem with noisy observations. J. Stat. Mech.: Theory Exp. 2014, P10016 (2014).
Vural, D. C. Vural dc. when models interact with their subjects: the dynamics of model aware systems. PLoS One 6, e20721 (2011).
Rupprecht, N. & Vural, D. C. Collective motion of predictive swarms. PloS One 12, e0186785 (2017).
Crutchfield, J. P., Ellison, C. J. & Mahoney, J. R. Time’s barbed arrow: irreversibility, crypticity, and stored information. Phys. Rev. Lett. 103, 094101 (2009).
Ellison, C. J., Mahoney, J. R. & Crutchfield, J. P. Prediction, retrodiction, and the amount of information stored in the present. J. Stat. Phys. 136, 1005 (2009).
Tatem, A. J., Rogers, D. J. & Hay, S. Global transport networks and infectious disease spread. Adv. Parasitol. 62, 293–343 (2006).
Rupprecht, N. & Vural, D. C. Limits on inferring the past. Phys. Rev. E 97, 062155 (2018).
Farid Golnaraghi, B. C. K. Automatic Control Systems (John Wiley & Sons, Hoboken, 1972).
Carnevale, G., Frisch, U. & Salmon, R. H theorems in statistical fluid dynamics. J. Phys. A: Math. Gen. 14, 1701 (1981).
Ramshaw, J. D. H-theorems for the tsallis and renyi entropies. Phys. Lett. A 175, 169–170 (1993).
Shiino, M. Free energies based on generalized entropies and H-theorems for nonlinear Fokker–Planck equations. J. Math. Phys. 42, 2540–2553 (2001).
Shannon, C. E. A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5, 3–55 (2001).
Lemons, D. S. & Langevin, P. An Introduction to Stochastic Processes in Physics (JHU Press, Baltimore, 2002).
Prinz, J.-H. et al. Markov models of molecular kinetics: generation and validation. J. Chem. Phys. 134, 174105 (2011).
Urban, D. L. Modeling ecological processes across scales. Ecology 86, 1996–2006 (2005).
Black, A. J. & McKane, A. J. Stochastic formulation of ecological models and their applications. Trends Ecol. Evol. 27, 337–345 (2012).
Rohlf, K., Fraser, S. & Kapral, R. Reactive multiparticle collision dynamics. Comput. Phys. Commun. 179, 132–139 (2008).
Barthélemy, M. Spatial networks. Phys. Rep. 499, 1–101 (2011).
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953).
Gardiner, C., Zoller, P. & Zoller, P. Quantum Noise: a Handbook of Markovian and Non-Markovian Quantum Stochastic Methods with Applications to Quantum Optics, vol. 56 (Springer Science & Business Media, Heidelberg, 2004).
Kapral, R. Progress in the theory of mixed quantum-classical dynamics. Annu. Rev. Phys. Chem. 57, 129–157 (2006).
Coveney, P. & Highfield, R. The Arrow of Time: A Voyage Through Science to Solve Time’s Greatest Mystery (Fawcett Columbine, New York, 1992).
Åström, K. J. Introduction to Stochastic Control Theory (Academic Press, Inc., New York, 1970).
Forte, G. & Vural, D. C. Iterative control strategies for nonlinear systems. Phys. Rev. E 96, 012102 (2017).
Chernyak, V. Y., Chertkov, M., Bierkens, J. & Kappen, H. J. Stochastic optimal control as non-equilibrium statistical mechanics: calculus of variations over density and current. J. Phys. A: Math. Theor. 47, 022001 (2013).
Acknowledgements
This study was supported by National Science Foundation grants CBET-1805157.
Author information
Authors and Affiliations
Contributions
N.R. and D.C.V. conceived the problem, interpreted the results, and wrote the paper. N.R. carried out the calculations and simulations.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rupprecht, N., Vural, D.C. Enhancing the predictability and retrodictability of stochastic processes. Commun Phys 2, 57 (2019). https://doi.org/10.1038/s42005-019-0159-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42005-019-0159-z
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.