Introduction

Much of science revolves around inference, reconstructing the unknown from what is known1,2,3. Observable patterns here and now inform us of inaccessible patterns out and away. For example, using inferential techniques, one can reconstruct the history of life from available fossils4,5,6, or predict the fate of the universe by observing the present night sky7,8,9; one can infer hidden states and transition probabilities10,11,12, connections and weights of neural networks13,14,15 or parameters, initial states and interaction structures of complex systems16,17,18,19,20,21,22,23,24,25.

Ordinarily, inference is a passive, non-disruptive process. Unlike engineering, natural sciences are motivated by knowing nature, rather than changing it. However, knowing and changing are not necessarily mutually exclusive. Earlier, it was established that attempting to describe and predict a system can inadvertently influence it, potentially even rendering it indescribable and unpredictable26,27. Here we study the converse case of how the intrinsic properties of a system can be purposefully modified so that its past or future is more inferable.

A number of authors addressed the problem of predicting the future and retrodicting the past of a stochastic process28,29,30,31. In this study, we are concerned not with finding strategies or algorithms to predict or retrodict stochastic systems, but rather with optimally modifying systems so that their predictability or retrodictability increases.

An engineer might use control theory to balance a bipedal robot, stabilize the turbulent flows surrounding a wing, or maximize the signal to noise ratio in an electric circuit32. Here we do the same, but optimize the susceptibility of a system to the inquiry of its past and future.

Forward in time, the entropy associated with the probability distribution of the system state will increase monotonically, as per H-theorems33,34,35. A similar trend also holds backwards in time31. Here we determine how transition rates should be perturbed infinitesimally as to minimize the generation of inferential entropy in either temporal direction. After establishing a general theoretical framework, we implement these ideas to two specific example systems. The first is a diffusion process taking place on a spatial random network. The second is a quantum harmonic oscillator with a time-dependent temperature.

Results

Quantifying predictability and retrodictability

The past and future of a stochastic system with a concentrated and sharply peaked probability distribution can be inferred with high certainty. Accordingly, we use the Gibbs–Shannon entropy to quantify the inferrability of a system36, and later on show that this indeed is a good measure. Given a stochastic process, Xt, characterized by its transition matrix Tα(ω) = Pr(Xt = ω|X0 = α), and initial state α, the entropy of the process at a final time t is

$$S_{\mathrm{T}}(\alpha ) = - \mathop {\sum}\limits_\omega {T_\alpha } (\omega )\log T_\alpha (\omega ).$$
(1)

When Xt is the state of a thermodynamic system, this is the standard thermodynamic entropy. In the present information-theoretical context, we refer to ST as the “prediction entropy”.

Naturally, the average entropy generated by a process depends on how it is initialized—the prior distribution P(0). To characterize the the process itself, we marginalize over the initial state, α, \(\langle S_{\mathrm{T}}\rangle = \mathop {\sum}\limits_\alpha {P^{(0)}} (\alpha )S_{\mathrm{T}}(\alpha ),\) where P(0)(α) is the probability of starting at α. Likewise, we quantify the retrodictability of a process by a “retrodiction entropy”, \(\langle S_{\mathrm{R}}\rangle = \mathop {\sum}\limits_\omega {P^{(t)}} (\omega )S_{\mathrm{R}}(\omega )\). Here, Rω(α) is the probability the system started in state α given that the observed final state was ω, SR is its entropy analogous to Eq. (1), and P(t)(ω) is the probability that the process is in state ω at time t unconditioned on its initial state.

Interestingly, the predictability and retrodictability of a system are tightly connected: Since ST and SR are related by the Bayes’ theorem, \(R_\omega (\alpha ) = T_\alpha (\omega )P^{(0)}(\alpha )/\mathop {\sum}\limits_{\alpha \prime } {T_{\alpha \prime }} (\omega )P^{(0)}(\alpha \prime )\), it follows that 〈ST〉 and 〈ST〉 are also related31,

$$\langle S_{\mathrm{R}}\rangle = \langle S_{\mathrm{T}}\rangle - (S_t - S_0)$$
(2)

where S0 is the entropy of the prior probability distribution P(0), and St is the entropy of \(P^{(t)}(\omega ) = \mathop {\sum}\limits_\alpha {P^{(0)}} (\alpha ){\mkern 1mu} T_\alpha (\omega )\).

We use 〈ST〉 and 〈SR〉 to measure how well we can predict the future and retrodict the past of a stochastic process. The higher the entropies, the less certain the inference will be.

Variations of Markov processes

In a Markov process, the state of a system fully determines its transition probability to other states. Markov processes accurately describe a number of phenomena ranging from molecular collisions through migrating species to epidemic spreads37,38,39,40,41.

Consider such a Markov process defined by a transition matrix T, with elements Tji, which we will visualize as a weighted network. We assume that we know the transition rates and prior distribution over states at t = 0 with perfect accuracy, but do not know what state the system is in, except at the initial (final) time. From this data we will predict (retrodict) the final (initial) state of the system.

A system initialized in state i with probability \(P_i^{(0)}\), upon evolving for t steps, will follow a new distribution \(P_i^{(t)} = \mathop {\sum}\limits_j {P_j^{(0)}} (T^t)_{ji}\). Accordingly,

$$\begin{array}{l}\langle S_{\mathrm{R}}\rangle = - \mathop {\sum}\limits_{i,j} {P_j^{(t)}} (R^t)_{ji}\log (R^t)_{ji}\\ \langle S_{\mathrm{T}}\rangle = - \mathop {\sum}\limits_{i,j} {P_j^{(0)}} (T^t)_{ji}\log (T^t)_{ji}.\end{array}$$
(3)

Thus both entropies depend on the duration of the process t. Note that probability is normalized \(\mathop {\sum}\limits_i {(T^t)_{ji}} = 1\) for all j, t.

Suppose that it is somehow possible to change the physical parameters of a system slightly, so that the probability of transitions are perturbed, \(T_{ji} \to T_{ji}^\prime = T_{ji} + \epsilon {\mkern 1mu} q_{ji}\), where \(\epsilon\) is a small parameter. For now, we do not assume any structure on q, other than implicitly demanding that it retains probabilities within [0, 1] and preserves the normalization of rows. This variation leads to a change in the t-step transition matrix,

$$\begin{array}{c}({\mathbf{T}} + \epsilon {\mkern 1mu} {\mathbf{q}})^t = {\mathbf{T}}^t + \mathop {\sum}\limits_{p = 1}^t {\epsilon ^p} {\boldsymbol{\eta }}^{(t,p)}\\ {\boldsymbol{\eta }}^{(t,p)} = \mathop {\sum}\limits_{1 \le k_1 < ... < k_p \le t} {{\mathbf{T}}^{k_1 - 1}} {\boldsymbol{\xi }}_{k_2 - k_1 - 1}{\boldsymbol{\xi }}_{k_3 - k_2 - 1} \ldots {\boldsymbol{\xi }}_{t - k_p}\end{array}$$
(4)

where ξk = qTk. The superscripts of η(t,p) refer to the power of the transition matrix, t, and the order of the contribution, p, which is analogous to the order of the derivative of a function. So η(t,p) is the p-th order contribution to the varied t-step transition matrix. This defines a set of p-th order effects for the nth power of the transition matrix. In the sequel, we will be studying first variations, therefore, we will only need

$${\boldsymbol{\eta }}^{(t,1)} \equiv {\boldsymbol{\eta }}^{(t)} = {\mathbf{qT}}^{t - 1} + {\mathbf{TqT}}^{t - 2} + \ldots + {\mathbf{T}}^{t - 1}{\mathbf{q}}.$$
(5)

The difference between the the entropies of the perturbed and the original systems is Δ〈ST,R〉 ≡ 〈ST,R(T′)〉 − 〈ST,R(T)〉. Whenever Δ〈ST,R〉 is of order \(\epsilon\) and higher, we can evaluate the variation

$$\delta \langle S_{{\mathrm{T,R}}}\rangle = \mathop {{\lim }}\limits_{\epsilon \to 0} \Delta \langle S_{{\mathrm{T,R}}}\rangle /\epsilon ,$$
(6)

which in essence is the derivative of 〈SR〉 or 〈ST〉 in the q “direction”.

With little algebra, we can show that the first order perturbations of the t-step entropies 〈SR〉 and 〈ST〉 are

$$\begin{array}{c}\Delta \langle S_{\mathrm{T}}\rangle = - \epsilon \log \epsilon \mathop {\sum}\limits_{i,j} {{\mathbb{1}}_{\mathrm{T}}^c} P_j^{(0)}\eta _{ji}^{(t)}\\ - \epsilon \mathop {\sum}\limits_{i,j} {P_j^{(0)}} \eta _{ji}^{(t)}\left[ {{\mathbb{1}}_{\mathrm{T}}(1 + \log (T^t)_{ji}) + {\mathbb{1}}_{\mathrm{T}}^c\log \eta _{ji}^{(t)}} \right]\end{array}$$
(7)
$$\begin{array}{c}\Delta \langle S_{\mathrm{R}}\rangle = - \epsilon \log \epsilon \mathop {\sum}\limits_{i,j} {{\mathbb{1}}_{\mathrm{T}}^c} P_j^{(0)}\eta _{ji}^{(t)}\\ - \epsilon \mathop {\sum}\limits_{i,j} {P_j^{(0)}} \eta _{ji}^{(t)}\left[ {\log [(T^t)_{ji}/P_i^{(t)}] + {\mathbb{1}}_{\mathrm{T}}^c\left( {\log [\eta _{ji}^{(t)}/P_i^{(t)}] - 1} \right)} \right]\end{array}$$
(8)

The Kronecker functions \({\mathbb{1}}_{{\rm{T}}}\) and \({\mathbb{1}}_{\mathrm{T}}^c\) which implicitly depend on the indices i, j, and the time, t, are defined to be \({\mathbb{1}}_{{\rm{T}}}\) = 0 if (Tt)ji = 0 and to equal 1 otherwise, and \({\mathbb{1}}_{\mathrm{T}}^c = 1 - {\mathbb{1}}_{\mathrm{T}}\).

While these equations ostensibly require a summation over all paths of the system, path integration in this setting is simply matrix multiplication, e.g., Tt. This allows the calculation to be easily defined, and acomplished in polynomial time, \({\cal{O}}(n^3\log t)\). The sums over states in Eqs. (7) and (8) are also of polynomial complexity, \({\cal{O}}(n^2)\).

As we see, the \(\epsilon {\mathrm{log}}\epsilon\) terms can cause the limit, Eq. (6), to diverge, causing a sharp, singular change in entropy generation. This is expected. The divergence will happen only when the perturbation enables a path between two states where there was none. This is because (Tt)ji = 0 only if i could not be reached from j in t steps, but if this is still true after perturbation, the ηji term will be zero.

On the other hand, if the perturbation does not enable a path between two isolated states, but preserves the topology of the transition matrix, then Eq. (7) simplifies considerably; the divergent \({\mathbb{1}}^{c}\) terms vanish, and we take the limit,

$$\begin{array}{l}\delta \langle S_{\mathrm{T}}\rangle = - \mathop {\sum}\limits_{i,j} {P_j^{(0)}} \eta _{ji}^{(t)}[1 + \log (T^t)_{ji}]\\ \delta \langle S_{\mathrm{R}}\rangle = - \mathop {\sum}\limits_{i,j} {P_j^{(0)}} \eta _{ji}^{(t)}\log [(T^t)_{ji}/P_i^{(t)}]\end{array}$$
(9)

Having established a very general theoretical framework, we now implement these ideas on two broad classes of stochastic systems for which the structure of the perturbing matrix q is specified further. We first consider random transition matrices drawn from a matrix ensemble. Second, we study a physical application—we enhance the predictability and retrodictability of thermalizing quantum mechanical systems by means of an external potential.

Improving the inferribility of Markov processes

We start by studying a general class of perturbations that can be applied to an arbitrary Markov process, and evaluate the associated entropy gradient, which can be thought as the direction in matrix space that locally changes 〈SR〉 or 〈ST〉 the most (Fig. 1). As we climb up or down the entropy gradient, we show how the transition matrix evolves (Fig. 2).

Fig. 1
figure 1

Ascending the space of transition matrices to maximize predictability and retrodictability. Each point in the space of Markov transition matrices, represented by the x, y plane has an associated predictive and retrodictive entropy. Equation (11) allows us to find the direction in network space—parameterized by the transition rates Tji—in which entropy locally increases (or decreases) the most. Perturbations can then be applied to move the network in that direction, leading to a system that is more susceptible to inference. Red dots represent different starting networks which climb along the black paths, via gradient ascent, to an entropy maximum, represented by a green dot

Fig. 2
figure 2

Entropy extremization of a Markov process. Entropy during the evolution of a Markov network according to the extremization procedures, Eq. (11). The parameter λ corresponds to “how many times” the perturbation operator has been applied—it is the integrated L2 distance of how far along the gradient curve, S, we have pushed the transition network. The graphs in panels (a)–(h) are pictorial representations of the Markov transition matrices. The points in the evolution that we sample graphs from are marked with lines and the letter corresponding to the panels of graphs. The entropy curves, 〈ST〉 and 〈SR〉 correspond to how easy it is to predict the final state or retrodict the initial state of the Markov process. The network is a random geometric network, with adjacency matrix, S, picked from an ensemble \(P(S_{ji} = 1) = {\mathrm{e}}^{ - \beta {\kern 1pt} d_{ij}}\), where dij is the distance between i and j in a circular metric (where node n is adjacent to node 1). This is turned into a discrete diffusion matrix, T, by normalizing the rows of S. Here, β = 0.5 and n = 30 states. We optimize our entropies for a t = 3 step process. The purple line along the top marks the maximum possible entropy. ad The graphs corresponding to 20%, 10%, 0% (original network), and −10% inference improvement, respectively, when extremizing 〈ST〉. eh The graphs corresponding to 20%, 10%, 0% (original network), and −9% inference improvement, respectively, when extremizing 〈SR〉. i How the entropies change as we extremize 〈ST〉, and four samples of transition probability networks. j How the entropies change as we extremize 〈SR

We consider a family of perturbations that vary the relative strength of any transition rate. This involves changing one element in the transition matrix while reallocating the difference to the remaining nonzero rates so that the total probability remains normalized. In other words,

$$\Delta _{\beta \alpha }^{(\epsilon )}T_{ji} = T_{ji} + \epsilon \cdot {\mathbb{1}}_{j\beta }({\mathbb{1}}_{i\alpha } - T_{\beta i}).$$
(10)

To first order in \(\epsilon\), this is the same as adding \(\epsilon\) to the (i, j) element, and then dividing the row by \(1 + \epsilon\) to normalize, so it is a natural choice for a perturbation operator. It also obeys \(\Delta ^{(\epsilon )}\Delta ^{( - \epsilon )}{\mathbf{T}} = {\mathbb{1}} + {\cal{O}}(\epsilon ^2)\). We define the perturbation acting on a zero element to be zero if \(\epsilon < 0\) since elements of the transition matrix must be non-negative. From Eqs. (10) and (4), we obtain the perturbed matrices and perturbed 〈SR〉, 〈ST〉.

To study the effect of successive perturbations of the form Eq. (10), we carry out a gradient ascent algorithm in matrix space. At each iteration, we change the transition rates infinitesimally to maximally increase or decrease retrodiction or prediction entropy. We parameterize the gradient ascent by L2 distance in matrix space, i.e., \({\mathrm{d}}(A,B) = \left\| {A - B} \right\|_2 = \left[ {\mathop {\sum}\limits_{i,j} {(A_{ji} - B_{ji})^2} } \right]^{1/2}\).

In a gradient descent algorithm, one descends over a function f(r) over a parameter t (time) by solving \({\dot{\mathbf{r}}} = \nabla f({\mathbf{r}})/\left\| {\nabla f({\mathbf{r}})} \right\|\), where the normalization ensures that tr(t) = 1, so the total distance of the path r(t) just t.

Similarly, we define our gradients to be either ΔjiSR(T)〉 or ΔjiST(T)〉, depending on whether we are optimizing retrodiction or prediction. We parameterize our path, T(λ), so that the total distance of the path (in L2 matrix space) is λ,

$$\dot T_{ji}(\lambda ) = \mathop {{\lim }}\limits_{\epsilon \to 0} \Delta _{ji}^{(\epsilon )}\langle S_{{\mathrm{T,R}}}(T_{ji})\rangle /\left\| {\Delta _{ji}^{(\epsilon )}\langle S_{{\mathrm{T,R}}}(T_{ji})\rangle } \right\|.$$
(11)

Since we carry out this scheme numerically, using a finite difference method, it does not matter if the limit in Eq. (11) exists. In these cases, our numerical scheme returns large, finite jumps.

To illustrate our formalism in action, we solve Eq. (11) for a particular example system: diffusion taking place on a directed spatial random network. We build a spatial network, such that neighboring nodes are placed at regular intervals on a circle, and are also cross connected with probability \(P(S_{ji} = 1) = {\mathrm{e}}^{ - \beta {\kern 1pt} d_{ij}}\), that decay with distance dij42. The transition matrix T is obtained by normalizing the rows of S. For our prior, we use a uniform prior over all states.

An example is shown in Fig. 2, where the 3-step (t = 3) predictability and retrodictability change as the transition matrix is perturbed iteratively.

Snapshots of the perturbed matrix, for different λ values, when predictability is extremized can be seen in Fig. 2a–d, represented as networks with edge thicknesses proportional to the transition rates. Similarly, sample transition network when retrodictability is optimized can be seen in Fig. 2e–h. The behavior of prediction and retrodiction entropy as λ is varied can be seen in Fig. 2i, j, and dotted lines mark the λ values corresponding to the networks in Fig. 2a–h. We observe that inferential success can be improved up to 10–20% with only minor changes in the network structure. This will be quantified in further detail below.

We now interpret our results to ensure that our theoretical framework makes qualitative sense and works as expected. First, we observe that perturbations that maximize both 〈ST〉 and 〈SR〉 displace the transition matrix toward the same point: in both cases T evolves to a point where (Tt)ji = pi, a probability vector. In other words the probability of transition does not depend on what state the system is currently in. Taking the 3rd power of the T matrix for large values of λ reveals that this is indeed the case, although, of course, T itself can retain some complex structure. As expected, when a system moves from any state to any other state with equal likelihood, it is most difficult to infer its past or future.

In contrast, minimizing entropy produces two very different transition matrices, for λ 0, depending on the type of entropy we minimize. The global minima of the prediction entropy are transition matrices in which all probability in each connected component flows toward a single node, reachable in t steps. A process where the initial state uniquely determines the final state is indeed trivial to predict. As we increase the number of time steps over which we optimize predictability, the number of “layers” of the transition network increases (Fig. 3a, b).

Fig. 3
figure 3

Extremal networks. We start with a random transition network and show the structure of the transition matrix as it undergoes large amounts of optimization. We optimize predictability to the extent that the final state, given the initial state, could be correctly guessed on the first try 99% of the time (panels (a), (b)). We optimize retrodictability to the extent that given the final state, the initial state could be correctly guessed on the first try 75% (panels (c), (d)) of the time. Panels (a, c) optimize entropy for s = 3 step processes, whereas panels (b, d) do so for s = 6 step processes. Panels (a, b) optimize 〈ST〉, while panels (c, d) optimize 〈SR

On the other hand, minimizing retrodiction entropy tends to eliminate branches and fragmenting the network into linear chains (including isolated nodes). The start of this process can be seen in Fig. 2e. When the splitting is complete, probability flows through these fragments unidirectionally, thus retrodiction involves nothing more than tracing back a linear path (Fig. 3c, d).

This also explains why 〈SR〉 tends to stay the same in the λ < 0 direction when minimizing 〈ST〉. If St = 〈ST〉 = 0, then Eq. (2) implies 〈SR〉 = S0, which is the maximum possible value for 〈SR〉. This can also be understood intuitively—if when a final measurement is made, the system is always found to be in a unique accumulating state, this yields no information about what state the system started in. If, however, the minimal 〈ST〉 network instead has multiple connected components and collector nodes, {kj}, then there can be a decrease in 〈SR〉 since \(R_{k_0}\), \(R_{k_1}\),… are different distributions.

So far, we have only extremized entropy, but have not shown that this leads to a significant difference in our ability to infer the past or future. We will do so by reporting how often, on average, we can identify the correct initial (final) state of the system, given the final (initial) state. Note that this metric is not extensive; with increasing number of states, the probability mass for even the “best guess” approaches zero. Nevertheless, we will adopt this difficult metric for ourselves. For both predicting the final state and retrodicting the initial state, we perform a maximum likelihood inference; we pick the state with highest probability to be our guess, conditioned on the observed final or initial state. From the transition probability, Tji, and retrodiction probability, Rji, we can calculate the probability that our guess at the initial or final state will be correct (cf. “inference performance” in the Methods section).

We plot how inference performance changes as we manipulate transition rates in Fig. 4. The transition matrices we did our test is the same as those shown in Fig. 2. The success rate of predicting final states and retrodicting initial states while optimizing either 〈SR〉 or 〈ST〉 is plotted. Since there are 30 states in our network, the baseline accuracy is 1/30 = 3.3%, which is marked with a dashed gray line. Our success rate aligns well with the entropy in Fig. 2i. If we were to continue to larger values of λ, we reach almost 100% accuracy when we minimize 〈ST〉 or 〈SR〉.

Fig. 4
figure 4

Performance in predicting initial or final states. The performance of prediction and retrodiction on evolving random Markov transition networks. The four cases plotted are either correct inferences of the initial state (retrodiction) or correct inferences of the final state (prediction) while either optimizing 〈SR〉 or optimizing 〈ST〉. As a baseline, making random guesses, the strategy would obtain the initial or final state correctly 3.3% of the time (since there are 30 states). This baseline is depicted as a dashed gray line

The improvement in retrodictability always lags behind predictability. This is because 〈SR〉 must be greater than 〈ST〉, as per Eq. (2).

Naturally, descending an entropy landscape all the way returns transition matrices with trivial structure and dynamics. In our diffusion example, one could have guessed from the beginning, that a network with only inward branches, or one with disconnected linear chains, would be much more predictable than an all-to-all network with equally distributed weights. However, our formulation is useful not because it eventually transforms every network into a trivial network, but because it provides the steepest direction toward a trivial network. Second, our formulation is useful because, among many trivial networks, it moves us toward the direction of the closest one. Thus, we must determine the effectiveness of small pertubations, far before the system turns into a trivial one.

We find indeed, that significant differences to inferential success can be made with relatively small changes to the transition matrix. Table 1 quantifies how much the transition matrix has been modified, versus how much our retrodictive (top three rows) and predictive (bottom three rows) success have improved. For example, the fifth row shows that if we would like to be spot-on correct in predicting the final state of a stochastic process with 30 states and 900 transitions, our success rate can be improved by ~5% by modifying only 8 out of 900 transition rates by more than 0.1, with none being larger than 0.2. The cumulative change in all transition rates for this perturbation totals to 4.34, an equivalent of adding four edges. The changes required to improve our success rate by 10% are not much larger (Table 1).

Table 1 Matrix retrodictability and structure

As a final point of interest, we see that for all the ±5% in Table 1, λ and the L2 distance are almost identical. This means that to get from the initial matrix to the perturbed matrices, one could follow the gradient calculated at the initial matrix in a straight line—the path is roughly straight in matrix space for at least that distance.

Improving the inferribility of quantum systems via external fields

In a physically realistic scenario, it is unlikely to have full control over individual transitions. An experimentalist can only tune physical parameters, such as external fields or temperature, which influence the transition matrix indirectly. Furthermore, it is often not practical to vary physical parameters by arbitrarily large amounts. Thus ideally we should improve predictability and retrodictability optimally, while only applying small fields.

To meet these goals, we consider a class of quantum systems in or out of equilibrium with a thermal bath. These systems are fully characterized by eigenstates ψ1, …, ψn with energies E1, …, En undergoing Metropolis–Hastings dynamics43 where a system attempts to transition to an energy level above or below with equal probability; an attempt to decay always succeeds, while an attempt to excite succeeds with probability exp[−β(Ek+1 − Ek)].

$$T_{k,j} = \left( {\begin{array}{*{20}{l}} {\frac{1}{2}{\mathrm{exp}}[ - \beta (E_{k + 1} - E_k)]} \hfill & {j = k + 1} \hfill \\ {\frac{1}{2}(1 - {\mathrm{exp}}[ - \beta (E_{k + 1} - E_k)])} \hfill & {j = k} \hfill \\ {\frac{1}{2}} \hfill & {j = k - 1} \hfill \\ 0 \hfill & {|j - k| > 1} \hfill \end{array}} \right.$$
(12)

Furthermore we assume that the ground state E0 cannot decay, and the highest state En is unexcitable. For the regime of validity of Markovian descriptions of thermalized quantum systems, we refer to refs. 44,45.

We now determine the effects of a small perturbing potential v(x). The perturbation will shift the energy levels, which changes the transition matrix, which in turn changes the average prediction and retrodiction entropies of the system. Our goal is to identify what perturbing potential would maximally change these entropies. Since we are concerned with the first order variation in entropy, it will suffice to also use first order perturbation theory to calculate energy shifts.

The perturbed k-th energy level is \(E_k = E_k^{(0)} + \epsilon \cdot \delta E_k\). When the perturbation is applied the exponential terms in T change as

$$\begin{array}{l}{\mathrm{e}}^{ - \beta (E_{k + 1} - E_k)} \to {\mathrm{e}}^{ - \beta (E_{k + 1} - E_k) - \epsilon {\kern 1pt} \beta (\delta E_{k + 1} - \delta E_k)}\\ = \left[ {1 - \epsilon {\mkern 1mu} \beta (\delta E_k - \delta E_{k - 1})} \right]{\mathrm{e}}^{ - \beta (E_k - E_{k - 1})} + {\cal{O}}(\epsilon ^2).\end{array}$$

From this, we can find our first order change \(T_{ji}^\prime = T_{ji} + \epsilon {\mkern 1mu} q_{ji}\) in terms of the change in energy levels, δEk,

$$\begin{array}{l}q_{kj} = - \beta (\delta E_{k + 1} - \delta E_k)\exp [ - \beta (E_{k + 1} - E_k)]\cdot S_{kj}\\ S_{kj} = {\mathbb{1}}_{j,k + 1} - {\mathbb{1}}_{j,k} = \left\{ {\begin{array}{*{20}{l}} { + 1} \hfill & {j = k + 1} \hfill \\ { - 1} \hfill & {j = k} \hfill \\ 0 \hfill & {j \ne k,j \ne k + 1} \hfill \end{array}} \right..\end{array}$$
(13)

Now we will write the prediction and retrodiction entropy δST,R〉 variations as a functional of a perturbing potential, and then use calculus of variations to obtain the extremizing potential. For clarity, we will derive our equations in one dimension; the generalization to higher dimensions is straightforward.

We partition the spatial domain, Ω, into N intervals, [xi, xi+1), of width Δx and let our potential be a piecewise constant function of the form \(v(x) = \mathop {\sum}\limits_{i = 0}^{N - 1} {v_i} {\mkern 1mu} {\mathbb{1}}_{x \in [x_i,x_{i + 1})}.\) As N → ∞, the first order change in the k-th energy level is

$$\delta E_k = \langle \psi _k|v|\psi _k\rangle \sim \mathop {\sum}\limits_{i = 0}^{N - 1} {v_i} |\psi (x_i)|^2\Delta x$$
(14)

since \({\int}_{x_i}^{x_{i + 1}} {v_i} |\psi (x)|^2\sim v_i|\psi (x_i)|^2\Delta x\). We substitute the δEs, Eq. (14), into Eq. (13) to get the q matrix,

$$\begin{array}{c}q_{kj} = \mathop {\sum}\limits_{i = 0}^{N - 1} {v_i} {\mkern 1mu} \beta {\mkern 1mu} \left[ {|\psi _k(x_i)|^2 - |\psi _{k + 1}(x_i)|^2} \right]{\mathrm{e}}^{ - \beta (E_{k + 1} - E_k)}S_{kj}\Delta x\\ \equiv \mathop {\sum}\limits_{i = 0}^{N - 1} {v_i} {\mkern 1mu} q_{kj}(x_i){\mkern 1mu} \Delta x \to {\int}_\Omega v (x)\tilde q_{kj}(x)dx\end{array}$$
(15)
$$\tilde q_{kj}(x) \equiv \beta \left( {|\psi _k(x)|^2 - |\psi _{k + 1}(x)|^2} \right){\mkern 1mu} {\mathrm{e}}^{\beta (E_{k + 1} - E_k)}\cdot S_{kj}$$
(16)

we substitute this in into Eq. (5) to get

$$\eta _{ji}^{(t)} = {\int}_\Omega d x{\mkern 1mu} v(x)\mathop {\sum}\limits_{k = 0}^t {({\mathbf{T}}^k{\tilde{\mathbf{q}}}(x){\mathbf{T}}^{t - k - 1})_{ji}} \equiv {\int}_\Omega d x{\mkern 1mu} v(x){\mkern 1mu} \tilde \eta _{ji}^{(t)}(x)\\ \tilde \eta _{ji}^{(t)}(x) \equiv \mathop {\sum}\limits_{k = 0}^t {({\mathbf{T}}^k{\tilde{\mathbf{q}}}(x){\mkern 1mu} {\mathbf{T}}^{t - k - 1})_{ji}}$$

and therefore,

$$\delta \langle S_{{\mathrm{T,R}}}\rangle [v] = - {\int}_\Omega d x{\mkern 1mu} v(x)\frac{{\delta ^2\langle S_{{\mathrm{T,R}}}\rangle }}{{\delta x{\mkern 1mu} \delta v}}$$
(17)

where δ2ST,R〉/δxδv is Eq. (9) with \(\tilde \eta _{ji}^{(t)}(x)\) substituted in for \(\eta _{ji}^{(t)}\).

Last, we ensure the smallness of the perturbation by introducing a penalty functional, \(C[v] = \frac{1}{2}\gamma {\int} v (x)^2dx\) and ask what potential v(x) extremizes

$$F_{{\mathrm{T,R}}} = \delta \langle S_{{\mathrm{T,R}}}\rangle - C = {\int}_\Omega {\left( {v(x)\frac{{\delta ^2\langle S_{{\mathrm{T,R}}}\rangle }}{{\delta x{\mkern 1mu} \delta v}} - \frac{1}{2}\gamma {\mkern 1mu} v(x)^2} \right)} dx.$$

We take a variational derivative with respect to v(x) and set it to zero to obtain the extremizing potential,

$$v_{{\mathrm{T,R}}}(x) = - \frac{1}{\gamma }\frac{{\delta ^2\langle S_{{\mathrm{T,R}}}\rangle }}{{\delta x{\mkern 1mu} \delta v}}.$$
(18)

This vT,R is the external potential that extremizes the gradient of entropy minus the penalty functional.

Improving inferribility for a thermalizing quantum oscillator

We can now ask what perturbing external field should be applied a quantum harmonic oscillator that is in the process of warming up or cooling down, in order to improve its predictability or retrodictability. For this system \(V(x) = \frac{1}{2}m\omega ^2x^2\), and \(E_k = (k + \frac{1}{2})\hbar \omega\). The stationary eigenfunctions are \(\psi _k(x) = \frac{1}{{\sqrt {2^kk!} }}\pi ^{ - 1/4}\exp \left( { - \frac{{x^2}}{2}} \right)H_k(x)\) where Hk is the k-th Hermite polynomial, \(H_k(x) = ( - 1)^k{\mathrm{e}}^{x^2}\frac{{d^k}}{{dx^k}}{\mathrm{e}}^{ - x^2}\). For concreteness, we also have to choose a prior distribution on states. We choose the prior distribution to be an equilibrium distribution at a (possibly different) temperature, \(P_k \propto {\mathrm{e}}^{ - \beta _2E_k}\). We truncate the transition matrix at an energy En 1/β1, 1/β2 so that edge effects are negligible. We take m = ħ = ω = 1, and choose U to be the negative of Eq. (18) so that adding them to V(x) decreases the corresponding entropy, and increases inference performance.

The initial and final temperatures determine the flow of probability. The equilibrium distribution at a high temperature has much more probability mass at higher energy states than an equilibrium distribution at a low temperature, so if we start with a high temperature and quench to a low temperature, there will tend to be a flow of probability from high states to low states. The opposite will happen when we quench from a low to a high temperature. We use T = 1 as the low temperature, and T = 10 as the high temperature.

To ensure that each perturbing potential, U(x), actually increase or decrease predictability/retrodictability (depending on whether we add or subtract it from V(x)), we calculate the “Δ%” for retrodiction and prediction: the percent difference in how often we can correctly guess the initial or final state, upon perturbing the system. The performance is obtained similarly to that in Fig. 4 (cf. Methods section). The perturbation potential is normalized to up(x) = U(x)/U so that the L2 norm of up(x) is 1, and the strength, λ, with which up is applied is varied so that the total potential is V(x) + λup(x).

Figure 5 shows some extremizing potentials for a system that was at one temperature, and is then suddenly quenched to a different temperature. Figure 5a shows optimal potentials for a system quenched from a high temperature to a low temperature while optimizing 〈SR〉 for t = 1, 3, and 5 time steps. Figure 5b, c shows the change in inference success as the potential is applied at varying strengths, λ. Figure 5d–f shows the same quantities (extremizing potential and change in inference) for a system at a low temperature quenched to a high temperature, while optimizing 〈SR〉. This potential is also optimal for 〈ST〉. Finally, Fig. 5g–i shows a high temperature system quenched to a low temperature while optimizing 〈ST〉.

Fig. 5
figure 5

The external fields and performance checks. We take ħ = ω = m = 1 and plot perturbations that minimize 〈SR〉 or 〈ST〉 for the quantum harmonic oscillator for processes taking t = 1, 3, 7 time steps, Eq. (18). The potentials, U(x), which are (negatives of) the solutions to Eq. (18), are normalized by their L2 norm. We plot this normalized potential, up(x) = U(x)/U. The normalized potential can be added to the system as a perturbation with different strengths, λ, i.e., \(\hat H_{{\mathrm{osc}}} \to \hat H_{{\mathrm{osc}}} + \lambda {\mkern 1mu} u_p(x)\). Along with the normalized perturbing potential, we plot the change in inference success, Δ% (what percentage of the time the initial or final state of the system can be predicted or retrodicted) vs. strength of the applied perturbation. Locally, this curve should have positive slope. ac A high temperature (T = 10) equilibrium system is quenched to a low temperature (T = 1) system. These potentials extremize 〈SR〉. df A low temperature (T = 1) system quenched to a high temperature (T = 10) system. Note the large scale shape of the potential, panel (d), is similar to that of panel (a). These potentials extremize both 〈SR〉 and 〈ST〉. gi A high temperature (T = 10) equilibrium system is quenched to a low temperature (T = 1) system. These potentials extremize 〈ST

To quantify how significantly the perturbations change the quantum system, we keep track of the L1 difference in eigenvalue spacing, i.e., \({\cal{S}} \equiv \mathop {\sum}\limits_{k = 1}^{30} {|E_k^\prime - E_{k - 1}^\prime - \hbar \omega |}\). The largest values \({\cal{S}}\) achieves for any potential and applied strength shown in Fig. 5 is ~1.2. In other words, we can get few percent change in success rate by introducing a change to all energy levels that amounts to one level spacing. Note that this is a single step perturbation along a single direction, rather than an iterated one.

This example illustrates how to combine real, physical, continuous quantities, such as perturbation potentials, with the more abstract formalism of evaluating the entropy of Markov transition matrices with discrete states. The general procedure we outlined in this section can also be applied to other thermal systems, quantum or otherwise.

Discussion

We developed a formalism to describe exactly how predictability and retrodictability changes in response to small changes in a transition matrix, and used it to descend entropy landscapes to optimally improve the accuracy with which the past or future of a stochastic system can be inferred. Our main results are the equations relating perturbations of Markov processes to the change in average entropy and retrodiction entropy of the system, Eqs. (4), (7), and (9).

We specifically focused on Markov processes, not only because it yields to mathematical analysis, but also because many important processes in physical, biological and social sciences are Markovian. That being said, the general principle outlined here can also be used in systems with memory, or in other inference problems such as the determination of unknown boundary conditions, system parameters, or driving forces.

As examples of manipulating predictability and retrodictability, we studied two specific types of perturbations, Eqs. (10) and (13), and used these to study how certain types of transition matrices evolve as they flow along the trajectory of maximal increase in retrodictability and predictability. We found that the transition networks tend to cull their connections and split into cycles and chains when we try to minimize retrodiction entropy. Conversely, the transition networks become fully connected when we attempt to maximize either inferential entropy. If one does not have full control over transition rates, one can steer a system toward the direction of either extreme by a small amount. Finally, as a physical example, we studied how to find the perturbing potential that extremally changes the predictability and retrodictability of a thermalizing quantum system.

Our formulas lead us to intuitive results such as the divergence of entropy generation when a path between two otherwise isolated states is enabled. However, they also lead us to less obvious conclusions, such as how predictability changes when retrodictability is optimized (and vice versa); or the shape of optimal potentials perturbing a thermalizing quantum system.

Our basic equations, Eqs. (4), (7), and (9), are very generally applicable to any discrete-time Markov process. The type of transition matrix perturbations we chose to study, namely those in Eqs. (10) and (13) are natural and practical choices, but of course, they are not the only two possibilities. For example, an operator that takes two matrix elements 0 < Tja, Tjb < 1 and “transfers” probability between them, changing them to \(T_{ja} + \epsilon ,T_{jb} - \epsilon\) would make an interesting future study.

In our work we observed an intriguing asymmetry between prediction and retrodiction. In particular, we observe that predictability is more easily improved than retrodictability. This a byproduct of how we set up our problem: We took the initial distribution, the probability vector P(0), and the forward dynamics, T, as givens, and found the probability, P(t), via propagating P(0) with T. If we had done the opposite by picking the distribution P(t) and the backwards dynamics, \(\widetilde {\mathbf{T}}\), then we could find P(0) to be the back-evolved distribution, then our results would reverse.

An experimenter only has control over the prior distribution at the current time, P(0), but cannot in general decide what distribution she wants at an arbitrary future time, P(t), and pick a P(0) that results in a specified P(t). The fact that we set up the problem so that t = 0 was the “controlled” time, and the state at the final time is the result of the choices made at t = 0 ultimately lead to the seeming emergence of an “arrow of time”46.

Since our method makes changes to a system to extremize the average of a function over a set of trajectories, it could well be considered within the domain of stochastic control theory47,48. However, there are various elements in our approach that depart from classical stochastic control, which typically deals with problems of the form

$$\begin{array}{c}dX_t = f\left( {X_t,v(t);t} \right) + \hat \xi _t\\ C(X_0,v) = \left\langle {\phi (X_T) + {\int}_0^T R (X_t,v;t){\mkern 1mu} dt} \right\rangle _{P(\xi )}\end{array}$$

where Xt is the system trajectory, \(\hat \xi\) is a Weiner process, v is a control parameter, C is a cost function, and ϕ and R are the target cost and some function that quantifies cost-of-control, cost-of-space, cost-of-dynamics, etc49. The goal is to find the \(\tilde v\) that minimizes C.

One difference is that we do not restrict ourselves to a Weiner process, but allow any valid transition matrix. The control parameter, v, could be the perturbation to the original transition matrix, or it could be some other external parameter which indirectly results in a change in the transition matrix, as in the thermalizing quantum oscillator example.

The second difference is the structure of our cost function. In our case, the cost is an average weighted over priors. For prediction entropy,

$$\langle S_{\mathrm{T}}\rangle = \langle C(X_0)\rangle _{P^{(0)}} = - \langle \langle \log P(X_T|X_0)\rangle _{P(\cdot |X_0)}\rangle _{P^{(0)}}.$$

For a delta function prior, this reduces to the standard control theory cost function, which depends on the initial condition of the system. For retrodiction entropy SR(XT) the cost depends on the final state, and is then averaged over the posterior distribution of XT,

$$\langle S_{\mathrm{R}}\rangle = \langle C(X_T)\rangle _{P^{(T)}} = - \langle \log R(X_0|X_T)\rangle _{R(\cdot |X_T)}\rangle _{P^{(T)}}.$$

The third difference is a philosophical one. Standard stochastic control aims to find a control protocol that is a global minimum of the cost function—one obtains the field v such that C[v + δv] = 0 for all δv. In contrast, we look for the variation δv such that C[δv] is maximal, where C is 〈SR〉 or 〈ST〉. Our method descends entropy gradients in a space of system parameters, and is only guaranteed to be optimal locally. This could then be paired with a stochastic gradient descent algorithm or simulated annealing to find optima in a larger neighborhood. In passing, we note that for systems with a very large number of states, it would probably be computationally advantageous to use a stochastic algorithm even to compute the local gradient.

There is still plenty of room to make our framework more useful and general. Currently, we assume constant transition rates, and perturb the transition matrix at a single instant. However, transition rates can be time-dependent, in which case we would have to perturb the transition rates differently at different times. A second interesting avenue would be to further explore the costs associated with changing the transition probabilities. Another natural generalization is to extend the problem to continuous time.

Methods

Extremization of entropy

We started with a random geometric graph, T(λ = 0), from the ensemble described in the text, where nodes i and j are connected with probability eβd(i,j). We used n = 30 node graphs, with β = 0.5. The extremization is done numerically and iteratively, as outlined in Eq. (11). The entropy was the entropy for a t = 3 step process, and we use a perturbation size \(\epsilon = 0.05\), and step size  = 0.05.

At each step, the matrix of change in entropy (per \(\epsilon\)) due to perturbation of an element is calculated, \(S_{ji} = \frac{1}{\epsilon }\Delta _{ji}^{(\epsilon )}\langle S[{\mathbf{T}}(\lambda )]\rangle\), where the S in the angled brackets is whichever entropy we seek to extremize T over—either 〈SR〉 or 〈ST〉. To get the updated transition matrix, the j, i element of T is perturbed using the standard perturbation operator, Eq. (10), and strength \(\epsilon \prime = d\lambda /\left\| {S_{ji}} \right\|\). The order that we apply these operators is irrelevant up to order \((\epsilon \prime )^2\). The updated transition matrix is then the result of applying all the perturbation operators, one for each element of T. At each step, the prediction and retrodiction entropy of the Markov process were calculated and saved, along with the actual matrix Tji(λ), for plotting purposes. The change in λ at each step is just the L2 distance between the previous matrix and the new, perturbed matrix.

Inference performance

The inference performance can also be calculated analytically as long as we have the transition matrix, Tji, and the prior, P(0), which we do to generate Figs. 4 and 5. Since we are guessing that the maximally likely state is the correct, the formulas for this are

$$\begin{array}{l}C_{\mathrm{T}} = \mathop {\sum}\limits_j {P_j^{(0)}} {\mathrm{max}}_i(T^t)_{ji}\\ C_{\mathrm{R}} = \mathop {\sum}\limits_j {P_j^{(t)}} {\mathrm{max}}_i(R^t)_{ji}.\end{array}$$
(19)

These formulas give us the expected fraction of times we correcly guess the final state given the initial state (CT), or initial state given the final state (CR). The expression maxi(Tt)ji is the probability that you guess the the final state correctly given that the initial state is i, and the normalized sum simply averages your performance across all possible initial states. The CR equation is analogous, simply substituting the retrodiction probability matrix for the transition matrix.

As expected, the performance obtained via random trials fits CT, CR almost exactly since we are using a large number of trials.

Thermalizing quantum harmonic oscillator

While it would be difficult to analytically solve Eqs. (18) and (9) to find the extremal change in potential, it is a simple matter to calculate it analytically.

For the harmonic oscillator \(V(x) = \frac{1}{2}m\omega x^2\), and \(E_k = (k + \frac{1}{2})\hbar \omega\). The stationary eigenfunctions are \(\psi _k(x) = \frac{1}{{\sqrt {2^kk!} }}\pi ^{ - 1/4}\exp \left( { - \frac{{x^2}}{2}} \right)H_k(x)\) where Hk is the k-th Hermite polynomial, \(H_k(x) = ( - 1)^ke^{x^2}\frac{{d^k}}{{dx^k}}e^{ - x^2}\). As mentioned in the text, we choose the prior distribution to be an equilibrium distribution at a given temperature, \(P_k \propto e^{ - \beta _2E_k}\). Since we can only store finite vectors on a computer, we only track the first n = 30 energy eigenstates, which is enough that the total probability mass (sum of Gibbs factors) the prior misses by truncation would only be <5% of the total probability mass. We take m = ħ = ω = 1 for simplicity. The perturbation matrix η(x) can be calculated numerically—it is a high order (order 60) polynomial in x times \(e^{ - x^2}\)—and substituted into Eq. (9) to get the (negative) extremal potential, U(x). The potential is then normalized by the L2 norm of U, up(x) = U(x)/U where \(\left\| U \right\| = \left( {{\int}_{ - \infty }^\infty U (x)^2{\mkern 1mu} {\mathbf{d}}x} \right)^{1/2}\).

Inference performance for quantum harmonic

We solve for the energy eigenvalues of the harmonic oscillator potential plus the perturbation potentials using the method of shooting. For each up, and for each strength, γ, we numerically solve Schrodinger’s equation for the potential \(\frac{1}{2}m\omega ^2x^2 + \gamma {\mkern 1mu} u_p(x)\) at different energies, Etrial. We pick our shooting point to be far outside our region of interest, at x = 15, and evaluate whether the value of the numerical solution is positive or negative at the shooting point. Near an energy eigenvalue, the sign of ψtrial will be (without loss of generality) <0 for energies a little below the true eigenvalue, and >0 for energies a little above the true eigenvalue. We use the bisection method of root finding to approximate the energy eigenvalue, with as much precision as we want. Our energy eigenvalues are correct up to 10−6.

Once we have found the first n eigenvalues, we compute the transition matrix using Eq. (12), which is determined by the final temperature, and the prior distribution on states, \(P_j^{(0)} = {\mathrm{e}}^{ - \beta _iE_j}/Z\), which is determined by the initial temperature (\(Z = \mathop {\sum}\limits_{j = 1}^n {{\mathrm{e}}^{ - \beta _iE_j}}\)). We then calculate the average percentage of times the final state can be inferred given the initial state after t = 1, 3, 7 steps. We use Eq. (19) to do this.