Enhancing the predictability and retrodictability of stochastic processes

Scientific inference involves obtaining the unknown properties or behavior of a system in the light of what is known, typically, without changing the system. Here we propose an alternative to this approach: a system can be modified in a targeted way, preferably by a small amount, so that its properties and behavior can be inferred more successfully. For the sake of concreteness we focus on inferring the future and past of Markov processes and illustrate our method on two classes of processes: diffusion on random spatial networks, and thermalizing quantum systems.

Ordinarily, inference is a passive, non-disruptive process. A scientist is not typically motivated to change nature, but to describe it. However describing and changing are not necessarily mutually exclusive. Earlier, it was established that attempting to describe and predict a system can inadvertently influence it, potentially even rendering it indescribable and unpredictable [26,27]. Here we study the converse case of how the intrinsic properties of a system can be purposefully modified so that it becomes more inferable. One could call this approach "active inference." A number of authors addressed the problem of predicting the future and retrodicting the past of stochastic processes [28][29][30][31]. Forward in time, the entropy associated with the probability distribution of the system state will increase monotonically, as per H−theorems [32][33][34]. A similar trend also holds backwards in time [31].
In this study, we are concerned not with finding ways to optimally predict or retrodict stochastic systems, but rather finding ways to optimally modify stochastic systems so that they become more predictabile or retrodictable. To this end, we determine how transition rates should be perturbed infinitesimally as to minimize the generation of inferential entropy. After establishing a general theoretical framework, we implement these ideas to two specific systems. The first is a diffusion process taking place on a spatial random network. The second is a quantum harmonic oscillator with a time-dependent temperature.
An engineer might use control theory to balance a bipedal robot, stabilize the turbulent flows surrounding a wing, or maximize the signal to noise ratio in an electric circuit [35]. Our approach will be mathematically simi-lar, except, we will be controlling the susceptibility of a system to the inquiry of its past and future.

QUANTIFYING PREDICTABILITY AND RETRODICTABILITY
Concentrated, sharply peaked probability distributions informing the future or past states of a stochastic system have high information content. Accordingly, we use the Gibbs-Shannon entropy to measure the degree of inferrability [36], and later on show that this indeed is a good measure. Given a stochastic process, X t , characterized by its transition matrix T α (ω) = P r(X t = ω|X 0 = α), and initial state α the entropy of the at a final time t is When X t is the state of a thermodynamic system, this is the standard thermodynamic entropy. In the present information theoretical context we refer to S T as the "prediction entropy." Naturally, the entropy generated by a process depends on how it is initialized -the prior distribution P (0) . To characterize the the process itself, we marginalize over the initial state, α, S T = α P (0) (α)S T (α), where P (0) (α) is the probability of starting at α. Likewise, we quantify the retrodictability of a process by a "retrodiction entropy", S R = ω P (t) (ω)S R (ω). Here, R ω (α) is the probability the system started in state α given that the observed final state was ω, S R is its entropy analogous to [1], and P (t) (ω) is the probability that the process is in state ω at time t unconditioned on its initial state.
Interestingly, the predictability and retrodictability of a system are tightly connected: Since S T and S R are related by the Bayes' theorem, R ω (α) = T α (ω)P (0) (α)/ α T α (ω)P (0) (α ), it follows that S T and S R are also related [31], where S 0 is the entropy of the prior probability distribution P (0) , and S t is the entropy of P (t) (ω) = α P (0) (α) T α (ω). We use S T and S R to measure how well we can predict the future and retrodict the past of a stochastic process. The higher the entropies, the less certain the inference will be.

VARIATIONS OF MARKOV PROCESSES
Consider a Markov process defined by a transition matrix T ji , which we view as a weighted network. A system initialized in state i with probability P (0) i , upon evolving for t steps, will follow a new distribution P Thus both entropies depend on the duration of the process t. Note that probability is normalized i (T t ) ji = 1 for all t. Suppose that it is somehow possible to change the physical parameters of a system slightly, so that the probability of transitions are perturbed, T ji → T ji = T ji + q ji , where is a small parameter. For now, we do not assume any structure on q, other than implicitly demanding that it retains probabilities within [0, 1] and preserves the normalization of rows. This variation leads to a change in the t-step transition matrix, where ξ k = q T k . The superscripts of η (t,p) refer to the power of the transition matrix, t, and the order of the contribution, p, which is analogous to the order of the derivative of a function. So η (t,p) is the p-th order contribution to the varied t-step transition matrix. This defines a set of pth order effects for the nth power of the transition matrix. In the sequel, we will be studying first variations, therefore, we will only need The difference between the the entropies of the perturbed and the original systems is ∆ S T,R ≡ S T,R (T ) − S T,R (T ) . Whenever ∆ S T,R is of order and higher, we can evaluate the variation which in essence is the derivative of S T or S R in the q "direction." With little algebra, we can show that the first order perturbations of the t-step entropies S R and S T are The Kronecker functions 1 T and 1 c T which implicitly depend on the indices i, j, and the time, t, are defined to be 1 T = 0 if (T t ) ji = 0 and to equal 1 otherwise, and As we see, the log terms can cause the limit [6] to diverge, causing a sharp, singular change in entropy generation. This is expected. The divergence will happen only when the perturbation enables a path between two states where there was none. This is because (T t ) ji = 0 only if i could not be reached from j in t steps, but if this is still true after perturbation, the η ji term will be zero.
On the other hand, if the perturbation does not enable a path between two isolated states, but preserves the topology of the transition matrix, then [7] simplifies considerably; the divergent 1 c terms vanish, and we take the limit, Having established a very general theoretical framework, we now implement these ideas on two broad classes of stochastic systems for which the structure of the perturbing matrix q is specified further. We first consider arbitrary transition matrices where only a single matrix element is varied at a time. Secondly, we study a physical application; we enhance the predictability and retrodictability of thermalizing quantum mechanical systems by means of external perturbing fields.

Improving the predictability and retrodictability of Markov processes through targeted perturbations
We start by studying a general class of perturbations that can be applied to an arbitrary Markov process, and evaluate the associated entropy gradient, which can be thought as the direction in matrix space that locally changes S R or S T the most (Fig.1). As we climb up or down the entropy gradient, we show how the transition matrix evolves (Fig.2).

Space of Transition Matrices
Ascending the space of transition matrices to maximize predictability and retrodictability. Each point in the space of Markov transition matrices, represented by the x,y plane has an associated predictive and retrodictive entropy. Equation [11] allow us to find the direction in network space -parameterized by the transition rates Tjiin which entropy locally increases (or decreases) the most. Perturbations can then be applied to move the network in that direction, leading to a system that is more susceptible to inference. Red dots represent different starting networks which climb along the black paths, via gradient ascent, to an entropy maximum, represented by a green dot.
We consider perturbations that vary the relative strength of only a single transition rate. This involves changing one element in the transition matrix while reallocating the difference to the remaining nonzero rates so that the total probability remains normalized. In other words, To first order in , this is the same as adding to the (i, j) element, and then dividing the row by 1 + to normalize, so it is a natural choice for a perturbation operator. It also obeys ∆ ( ) ∆ (− ) T = 1+O( 2 ). We define the perturbation acting on a zero element to be zero if < 0 since elements of the transition matrix must be non-negative. From [10] and [4], we obtain the perturbed matrices and perturbed S R . In a gradient descent algorithm, one descends a function f (r) over time t by solving In the same fashion, we will descend or ascend entropy gradients over some fictitious "time" λ, which quantifies how many infinitesimal perturbations are added to the transition rates. At λ = 0 the system is unperturbed, in its original state. Perturbations of the matrix in the positive λ direction optimally climbs up the prediction entropy or retrodiction entropy landscape, ultimately maximizing them. Perturbations in the negative λ direction do the opposite.
The gradient descent equation for a transition matrix is Our perturbation operator ∆ βα acts as a displacement operator in matrix space, increasing T βα by an amount proportional to the corresponding increase in S R or S T . To illustrate our formalism in action, we solve [11] for a particular process: diffusion taking place on a directed spatial random network: nodes are placed at regular intervals on a circle, and have additional connections with probability P (S ji = 1) = e −βdij , that decay with distance d ij [37]. The transition matrix T is obtained by normalizing the rows of S.
An example is shown in Fig.2 where the 5-step (t = 5) predictability and retrodictability change as the transition matrix is perturbed iteratively. We use perturbation operators that extremize retrodictability (left column) or predictability (right column). The transition matrices are displayed as networks, shown at three different λ values (λ = −15, 0, 4) corresponding to minimal entropy state, initial state and maximal entropy state. For clarity's sake, we display only edges with weights above 0.025.
We now interpret our results to ensure that our theoretical framework makes qualitative sense and works as expected. First, we observe that perturbations that maximize both S T and S R displace the transition matrix towards the same point: in both cases T evolves to a point where (T t ) ji = p i , a probability vector. In other words the probability of transition does not depend on what state the system is currently in. Taking the 5 th power of the maximum entropy T reveals that this is indeed the case, although, of course, T itself can retain some complex structure (see Fig.2C, F). As expected, when a system moves from any state to any other state with equal likelihood, it is most difficult to infer its past or future.
In contrast, minimizing entropy produces two very different transition matrices depending on the type of entropy we minimize. Minimizing retrodiction entropy tends to eliminate branches and fragmenting the network into linear chains. Probability flows through these unidirectionally, thus retrodiction involves nothing more than tracing back a linear path. For example, in Fig.2D, the network has fragmented into nine pieces that are in the Entropy extremization of a Markov Process. Entropy during the evolution of a Markov network according to the extremization procedures [11]. The parameter λ corresponds to "how many times" the perturbation operator has been applied, or how far along the gradient curve, ∇S, we have pushed the transition network. The graphs in panels (A-F) are pictorial representations of the Markov transition matrices. The points in the evolution that we sample graphs from are marked with red lines and a letter. We choose to look at the minimal entropy graph (λ = −15), initial graph (λ = 0), and maximal entropy (λ = 4) graph both for prediction entropy and retrodiction entropy. The entropy curves, ST and SR correspond to how easy it is to predict the final state or retrodict the initial state of the Markov process. The network is a geometric network as described above, with β = 0.5 and n = 30 states. We optimize our entropies for a t=5 step process. Left: How the entropies change as we extremize SR , and three samples of transition probability networks. Right: How the entropies change as we extremize ST , and three samples of transition probability networks.
process of becoming chains. On the other hand, the global minima of the prediction entropy yields transition matrices in which all probability flows towards a single node, that is to be reached in t steps. A process that takes all initial states to the same final state is indeed trivial to predict. It can happen, however, that the network instead evolves into a local minimum of S T where all the probability flows towards one of two or more nodes. For example in Fig.2A, the network leads all initial states to one of three nodes that serve as terminal states, two in one of the connected components, and one in the other.
This also explains why S R tends to stay the same or even increase in the λ < 0 direction when minimizing S T : If S t = S T = 0, then [2] implies S R = S 0 , which is the maximum possible value for S R . This can also be understood intuitively -when a final measurement is made, the system is always found to be in the accumulating state, and this yields no information about what state the system started in. If, however, the minimal S T network instead gets trapped in a local minimum with a set of (more than one) collector nodes, {k j }, then there can be a decrease in S R since in general R k0 , R k1 , ... are in general different distributions. This is what occurs in Fig. 10 when we extremize predictability -there are three collector nodes, so retrodiction entropy is able to decrease.
Observing the entropy curves for both figures, there are obvious non-differentiable points in the entropies at λ = 0. This is because there are sites that are not connected to one another after the requisite number of time steps, so the log terms in [7] are non-zero. It is also apparent that there are other non-differentiable points for λ < 0. These are due to matrix elements vanishing, essentially hitting the boundary of the simplex that the elements can travel in (the space of possible matrix elements for each row is a simplex because of the conditions T ji > 0 and i T ji = 1).
Lastly, we observe in the figures that S T ≤ S R . This follows from [2]. S 0 is maximal for our prior, a uniform distribution, so S t − S 0 ≤ 0.
So far, we have only extremized entropy, but have not shown that this leads to a significant difference in our ability to infer the past or future. To do so, we must run stochastic processes, predict/retrodict final/initial states, and report how often our inference is correct. For systems with very large number of states, the probability mass per state will be very small. Nevertheless, we will adopt the "hard" metric of how often our inferred state, which we take to be the state with highest likelihood, is correct.
For predictions we pick an initial state, i, with a uniform prior, evolve it forward stochastically, and then  Fig. 2. The four cases plotted are either correct inferences of the initial state (retrodiction) or correct inferences of the final state (prediction) while either optimizing SR or optimizing ST . As a baseline, making random guesses, the strategy would obtain the initial or final state correctly 3.3% of the time (since there are 30 states). This baseline is depicted as a dashed gray line. make our inference by picking the state with the highest probability T (j|i) and checking if we are correct. Retrodictions are carried out similarly, except we observe the outcome of the stochastic simulation, j, and infer the initial state by picking the state with highest probability R(i|j). For both inferences we counted how often our estimation agreed, and ran the stochastic simulation and inference many times.
The result can be seen in Fig. 3. The transition matrices we did our test is the same as those shown in Fig. 2. The success rate of predicting final states and retrodicting initial states while extremizing either S R or S T is plotted. Since there are 30 states in our network, the baseline accuracy is 1/30 = 3.3%, as marked with a dashed gray line. Our success rate aligns nicely with the entropy in Fig. 2, left column. We reach almost 100% accuracy when we minimize S T (Fig. 3, purple line), which corresponds to the point where S T nearly vanishes.
The improvement in retrodictability always lags behind predictability. This is because S R must be greater than S T , as per 2. Since S R is always fairly high, our accuracy of predicting the initial state is always low, though it does improve when we minimize S R (Fig. 3, red line).
Improving the predictability and retrodictability of a physical system through external fields In a physically realistic scenario, it is unlikely to have full control over individual transitions. An experimentalist can only tune physical parameters, such as external fields or temperature, which influence the transition ma-trix indirectly. Furthermore, it is often not practical to vary physical parameters by arbitrarily large amounts. Thus ideally we should improve predictability and retrodictability optimally, while only applying small fields.
To meet these goals, we consider a class of physical systems in or out of equilibrium with a thermal bath. These systems will be fully characterized by eigenstates ψ 1 , ..., ψ n with energies E 1 , ..., E n undergoing Metropolis-Hastings-like dynamics [38] where a system attempts to transition to an energy level above or below with equal probability; an attempt to decay always succeeds, while an attempt to excite succeeds with probability exp[−β(E k+1 − E k )]. Furthermore we assume that the ground state E 0 cannot decay, and the highest state E n is unexcitable.
For the regime of validity of Markovian descriptions of thermalized quantum systems, we refer to [39,40].
We will now consider the effects of a small perturbing potential v(x). The perturbation will shift the energy levels, which changes the transition matrix, which in turn changes the average prediction and retrodiction entropies of the system. Our goal is to identify what perturbing potential would maximally change these entropies. Since we are concerned with the first order variation in entropy, it will suffice to also use first order perturbation theory to calculate energy shifts.
The perturbed k-th energy level is E k = E (0) k + · δE k . When the perturbation is applied the exponential terms in T change as From this, we can find our first order change T ji = T ji + q ji in terms of the change in energy levels, δE k , Now we will write the prediction and retrodiction entropy δ S T,R variations as a functional of a perturbing potential, and then use calculus of variations to obtain the extremizing potential. For clarity, we will derive our equations in one dimensions; the generalization to higher dimensions is straightforward.
We partition the spatial domain, Ω, into N intervals, [x i , x i+1 ), of width ∆x and let our potential be a piecewise constant function of the form v(x) = xi+1) . As N → ∞, the first order change in the k-th energy level is We substitute the δEs, [14], into [13] to get the q matrix, we substitute this in into [5] to get and therefore, where δ 2 S T,R /δx δv is [9] with η (t) Lastly, we ensure the smallness of the perturbation by introducing a penalty functional, C[v] = 1 2 γ v(x) 2 dx and ask what potential v(x) extremizes We take a variational derivative with respect to v(x) and set it to zero to obtain the extremizing potential, Improving the predictability and retrodictability of a thermalizing quantum harmonic oscillator Since evaluating [17] with [9] requires the spectrum and eigenstates of the system, we illustrate our formalism on a particular physical setup. We will ask what perturbing external field should be applied a quantum harmonic oscillator that is in the process of warming up or cooling down, in order to improve its predictability or retrodictability. For this system U (x) = 1 2 mωx 2 , and E k = (k + 1 2 ) ω. The stationary eigenfunctions are ψ k (x) = 1 For concreteness, we also have to choose a prior distribution FIG. 4. The external field that optimally increases the retrodictability of a thermalizing quantum harmonic oscillator. We take = ω = m = 1 and plot perturbations that minimize SR for processes taking t = 1, 3, 7 time steps [17]. Top: A high temperature (T = 10) equilibrium system is quenched to a low temperature (T = 1) system. Middle: A low temperature (T = 1) system quenched to a high temperature (T = 10) system. Note the large scale shape of the potential is similar to that of the top panel. This potential extremizes both SR and ST . Bottom: A high temperature (T = 10) equilibrium system is quenched to a low temperature (T = 1) system. This is the potential that extremizes ST .
on states. We choose the prior distribution to be an equilibrium distribution at a (possibly different) temperature, P k ∝ e −β2E k . We truncate the transition matrix at an energy E n 1/β 1 , 1/β 2 so that edge effects are negligible. We take m = = ω = 1. Fig 4 shows some extremizing potentials for a system that was at one temperature, and is then suddenly quenched to a different temperature. Potentials that extremize retrodiction entropy and prediction entropy, at different times are shown. For a high initial temperature and low final temperature, the extremal perturbation has small scale oscillations, Fig. 4A, C. In Fig. 4B, the system is initially at a lower temperature, and then is quenched to a high temperature. This perturbation potential happens to extremize both S R and S T , and exhibits only large scale oscillations.
These two types of oscillations, the large scale one and the small scale ones, seem to both exist no matter what the initial and final temperatures, but change in scale relative to one another depending on whether the system transitions from a high to low temperature or a low to high temperature. It is only when they are comparable in size that both are visible, as in Fig. 4A, C.
We have illustrated how to combine real, physical, continuous quantities, like perturbation potentials, with the more abstract formalism of evaluating the entropy of Markov transition matrices with discrete states. The general procedure we outlined in this section can also be applied to other thermal systems, quantum or otherwise. To do so, one would want to first acquire a more detailed description of the actual system dynamics. DISCUSSION We developed a formalism to describe exactly how predictability and retrodictability changes in response to small changes in a transition matrix, and used it to climb entropy gradients to optimally increase the accuracy with which the past or future of a stochastic system can be inferred. Our main results are the equations relating perturbations of Markov processes to the change in average entropy and retrodiction entropy of the system, [4,7,9].
Our formulas lead us to intuitive results such as the the divergence of entropy generation when a path between two otherwise isolated states is enabled. However, they also lead us to less obvious conclusions, such as how predictability changes when retrodictability is optimized (and vice versa); or the shape of optimal potentials perturbing a thermalizing quantum system.
Our basic equations, [4,7,9] are very generally applicable to any discrete-time Markov process. The type of transition matrix perturbations we chose to study, namely [10] and [13] are natural and practical choices, but of course, they are not the only two possibilities. For example, an operator that takes two matrix elements 0 < A ja , A jb < 1 and "transfers" probability between them, changing them to A ja + , A jb − would make an interesting future study.
In our work we stumbled upon an interesting asymmetry between prediction and retrodiction. In particular, we observe that predictability is more easily improved than retrodictability. This a byproduct of how we set up our problem: We took the initial distribution, P (0) , and the forward dynamics, T , as givens, and found the probability, P (t) , via propagating P (0) with T . If we had done the opposite by picking the distribution P (t) and the backwards dynamics,T , then we could find P (0) to be the back-evolved distribution, then our results would reverse.
An experimenter only has control over the prior distribution at the current time, P (0) , but cannot in general decide what distribution she wants at an arbitrary future time, P (t) , and pick a P (0) that results in a specified P (t) . The fact that we set up the problem so that t = 0 was the "controlled" time, and the state at the final time is the result of the choices made at t = 0 ultimately lead to the seeming emergence of an "arrow of time" [41].
For example, consider a system consisting of smokers and non-smokers. A state in this system is a pair, N = {N s , N n }, and state transitions occur when people diesmokers at a higher rate than non-smokers. It would be very hard, though perhaps possible, to pick a distribution over different states N , so that the probability of each state occurring is the same after a period of five years. It would however be impossible to pick a distribution over N so that the probability of each state occurring is the same after 100 years. The state {0, 0} is overwhelmingly probable no matter what the initial state of the system is.

CONCLUSION
Here we demonstrated the principle of active inference by subjecting systems to small perturbations so that the accuracy in inferring their past or future changes maximally. We were not interested in heuristic strategies applicable to specific systems, but in determining how changes to a system effect the fundamental limits of that system's predictability retrodictability.
We specifically focused on Markov processes, not just because it is a manageable starting point for analysis, but because many important processes in physical, biological and social sciences are Markovian. That being said, the principle of active inference can be also used in systems with memory, or in other inference problems such as the determination unknown boundary conditions, system parameters, or driving forces.
As examples of using active inference, we defined two specific types of perturbations, [10,13], and used these to study how certain types of transition matrices evolve as they flow along the trajectory of maximal increase in retrodictability and predictability. We found that the transition networks tend to cull their connections and split into cycles and chains when we try to minimize retrodiction entropy. Conversely, the transition networks become fully connected when we attempt to maximize either entropy. If one does not have full control over transition rates, one can steer the system towards the "direction" of either extreme by a small amount. Finally, as a physical example, we studied how to find the perturbing potential that extremally changes the predictability and retrodictability of a thermalizing quantum system. Specifically, we studied a thermalizing quantum harmonic oscillator as a prototypical quantum system.