Learning non-stationary Langevin dynamics from stochastic observations of latent trajectories

Many complex systems operating far from the equilibrium exhibit stochastic dynamics that can be described by a Langevin equation. Inferring Langevin equations from data can reveal how transient dynamics of such systems give rise to their function. However, dynamics are often inaccessible directly and can be only gleaned through a stochastic observation process, which makes the inference challenging. Here we present a non-parametric framework for inferring the Langevin equation, which explicitly models the stochastic observation process and non-stationary latent dynamics. The framework accounts for the non-equilibrium initial and final states of the observed system and for the possibility that the system’s dynamics define the duration of observations. Omitting any of these non-stationary components results in incorrect inference, in which erroneous features arise in the dynamics due to non-stationary data distribution. We illustrate the framework using models of neural dynamics underlying decision making in the brain.


Supplementary Note 1: The likelihood calculation
The model likelihood is given by Eq. (3), in which the transition probability densities p(x t i |x t i−1 ) are obtained from the solution of Eq. (6), and the probability densities of spike observations are p(y t i |x t i ) = f (x t i ). The absorption term p(A|x t E ) only applies in the case of absorbing boundaries and is given by Eq. (9).
and the scaled probability density satisfies the backward Fokker-Planck equation with zero derivative (Neumann) boundary conditions [1]. Thus the eigenfunctions φ 0 (x) are obtained by solving Supplementary Eq. (4) with the boundary conditions ∂φ 0 /∂x| x=±1 = 0. To find the eigenfunctions of the operator H, we solve another eigenvalue-eigenvector problem (H 0 + H I )Ψ(x) = λΨ(x) in the basis of the operator H 0 . As a result, we obtain an eigenbasis {Ψ j (x)} and eigenvalues {λ j } of the operator H, where j = 1, 2, ...N v and the eigenbasis is truncated to the N v eigenvectors with the smallest eigenvalues. We refer to the operator H as H abs or H ref for absorbing and reflecting boundary conditions, respectively.
We solve the eigenvector-eigenvalue problem Supplementary Eq. (4) using Spectral Elements Method (SEM) [2,3,4]. We use the Gauss-Legendre-Lobatto (GLL) grid with Lagrange interpolation polynomials as the basis functions (SEM basis). In the SEM basis, each function is represented by a vector that consists of the function values evaluated at the SEM grid points {x i }, where i = 1, 2, ..., N . For example, the firing rate function f (x) is represented by a vector f with components f i = f (x i ). In the SEM basis, the eigenbasis The transformation between the SEM basis and the eigenbasis of H is performed by generalized rules of basis transformations viaf = Q T W f and f = Qf , wheref is the representation of f in the eigenbasis of H. Here W = diag(w) is the SEM weight matrix that normalizes the matrix of eigenvectors: Q T W Q = I. Integration of functions in the SEM basis is performed with Gaussian quadrature by taking an inner product with the weight vector w: f (x)dx ≈ f T w, where the weights for the GLL grid are calculated analytically [4].
We compute the likelihood in the eigenbasis of H. The formal solution of Supplementary Eq. (1) is given by In the eigenbasis of H, the corresponding matrix is diagonal: In the H-basis, the spike emission operator matrix takes the form E = Q T W f Q. The absorption operator takes the form A = Q T 0H diag(λ 0 )Q 0H , where λ 0 are the eigenvalues of H 0 obtained from Supplementary Eq. (3), and Q 0H is a transformation matrix from H 0 -basis to H-basis. The integration over the terminal state x t E in Eq. (3) is realized by performing the inner product with a column vector β N +2 = Q T W ρ eq , where ρ eq is a representation of the function exp(−Φ(x)/2) in the H basis. Since the operator H propagates the scaled probability density ρ(x, t), first scaling ρ(x, t) with exp(−Φ(x)/2) and then taking the inner product with the weight vector corresponds to the integration of the original probability density p(x, t).
Thus, in the H basis, the likelihood Eq. (3) is represented by a chain of matrix-vector multiplications: where the absorption operator A is applied only in the case of absorbing boundaries. Here ρ T 0 is a row vector that corresponds to the scaled initial probability density of latent states p 0 (x) exp(Φ(x)/2) in the eigenbasis of the operator H. The chain of matrix-vector multiplications in Supplementary Eq. (9) realizes the integration over x t 0 , x t 1 , ..., x t N in Eq. (3). The last inner product with the vector β N +2 realizes the integration over the final x t E in Eq. (3). On each iteration of the gradient descent, we evaluate the likelihood numerically by calculating the eigenvalues and the eigenfunctions of the operator H, computing the matrices T k , E, A, and performing the chain of matrix-vector multiplications in Supplementary Eq. (9) from left to right (forward pass). To calculate the likelihood derivatives, we also perform a backward pass, where the chain in Supplementary Eq. (9) is calculated in the reverse order.

Supplementary Note 2: Variational derivatives of the likelihood
The variational derivatives of the model likelihood are obtained using calculus of variations as described previously [2]. The final analytical expressions read: Here F 0 (x) = p 0 (x)/p(x) is an auxiliary function used to perform unconstrained optimization of p 0 (x) (Methods). φ i (x) = Ψ i (x) exp(Φ(x)/2) are the scaled eigenfunctions of H. The functions α i (x) and β i (x) are calculated with, respectively, the forward and backward passes through the chain Supplementary Eq. (9) in the eigenbasis of the operator H: The absorption matrix A in Supplementary Eqs. (11),(12) is applied only in the case of absorbing boundary conditions. The matrix G is defined as: where i, j index the components of the vectors α τ and β τ +1 . The matrix Γ τ is defined for τ = N + 2 as Γ τ = −I, and for τ < N + 2 it is defined as: where ∆t τ = t τ − t τ −1 are interspike intervals, and λ i are the eigenvalues of H. The vectors α i are calculated with a forward pass Supplementary Eq. (11), and the vectors β i and G are subsequently calculated with a backward pass Supplementary Eqs. (12)-(14). The implementation of the forward and backward passes is based on the sum-product and scaling algorithms, similar to the Hidden Markov Models [5]. The algorithm scales all α i and β j so that α T i β i = 1 for all i, which prevents numerical underflow. The antiderivatives (indefinite integrals) in Supplementary Eq. (10) are evaluated by multiplication with the integration matrix that is constructed by analytical integration of the Lagrange interpolation polynomials [2].
While we perform computations in the SEM basis, the likelihood derivatives Supplementary Eq. (10) are expressed in terms of continuous functions, which can be discretized in any appropriate basis with arbitrary precision specified by the basis dimensionality. Detailed derivation of Supplementary Eq. (10) is provided in Supplementary Information 3.

Supplementary Note 3: Analytical derivation of the likelihood variational derivatives
Here we derive the analytical expressions Supplementary Eq. (10). For convenience, we represent the likelihood using the Dirac bra-ket notation [2,3]: In this notation, the scaled latent probability density ρ(x, t) = p(x, t) exp(Φ(x)/2) is represented by the bra vectors, with the initial state ρ 0 |. The time-propagation of the latent probability density is carried out by the operator exp(−H∆t τ ), where ∆t τ = t τ − t τ −1 . The operators y tτ account for the observed spikes. The ket |β N +2 accounts for the marginalization over the terminal latent state x t E .
To find variational derivative of the likelihood with respect to the force F (x), we differentiate Supplementary Eq. (15) applying the product rule of derivatives. The likelihood dependence on F (x) is hidden inside the terms ρ 0 and β N +2 , and in each of the operators exp(−H∆t τ ), and A. Thus, the derivative is given by the sum: where i and j index the eigenfunctions of the operator H. In Supplementary Eq. (16), we introduced the quantities: In a finite basis, such as the truncated eigenbasis of the operator H, each ket α τ | corresponds to a vector α τ , and each a i (τ ) is the i-th entry of this vector, compare Supplementary Eq. (17) with Supplementary Eqs. (11),(12). Thus, a i (τ ) and b i (τ ) are calculated with the forwardbackward pass. We first derive the expression for the first term in Supplementary Eq. (16) denoted as L 1 . Using the formula for the derivative of an exponential operator [6], we obtain [2,3]: where the matrix Γ τ is defined in Supplementary Eq. (14), and here τ = 1, 2, . . . N + 1. The operator H can be written as [3]: Using the Euler-Lagrange equation for a variational derivative, we obtain: Here we normalize Φ(x) such that exp(−Φ(x))dx = 1, and φ i (x) = Ψ i (x) exp(Φ(x)/2) are the scaled eigenfunctions. The absorption operator is A = H 0 , where H 0 is the part of H that accounts for the drift and diffusion in the latent space: Therefore, the derivative of the operator A with respect to F (x) is the same as the derivative of the operator Here the matrix G, defined in Supplementary Eq. (13), gathers the contributions from all terms exp(−H∆t i ) and A (where for τ = N + 2 we define Γ τ = −I, so that Supplementary Eq. (18) also holds for the operator A). Next, we compute the second term in Supplementary Eq. (16) denoted as L 2 , which requires calculating the derivatives of a i (0) and b i (N + 2) with respect to F (x). For convenience, we write this term in the function representation: Here the functions α i (x) and β i (x) correspond to α i | and |β i in Supplementary Eq. (17), , and the derivative does not apply to the functions β 0 (x) and α N +2 (x), since their dependence on F (x) is already accounted for in the term L 1 in Supplementary Eq. (16). The derivatives are evaluated using the chain rule for functional derivatives: where Using the Euler-Lagrange equation, we obtain: where H is the Heaviside step function. Thus, we obtain: The variational derivative of the second term l 2 in Supplementary Eq. (23) is calculated similarly. As a result, we obtain the expression for L 2 (c.f. with the second term in Supplementary Eq. (10)): To derive the expression for δL /δF 0 (x), we write the likelihood in the function representation: The likelihood depends on F 0 (x) only through the term p 0 (x): We compute the derivative δL /δp 0 [s] using the Euler-Lagrange equation: To compute the derivative δp 0 [s]/δF 0 (x), we express p 0 (x) through F 0 (x) (Methods): Here and H is the Heaviside step function. Using the chain rule, we obtain: Using the Euler-Lagrange equation, we get: As a result, Finally, we derive the expression for the derivative dL /dD. The likelihood Supplementary Eq. (15) depends on D through the operators exp(−H∆t τ ) and A. Applying the product rule for derivatives, we obtain: where a i (τ ), b j (τ ) are defined in Supplementary Eq. (17). Similarly to Supplementary Eq. (18), we have where the matrix Γ τ is defined in Supplementary Eq. (14). Using Supplementary Eq. (19) and the Euler-Lagrange equation, we obtain: where the matrix G is defined by Supplementary Eq. (13).

Supplementary Note 4: A relationship to the likelihood of an inhomogeneous Poisson process
In this section we derive the likelihood of a spike sequence Y (t) from an inhomogeneous Poisson process when the latent trajectory X (t) is fixed and fully observed. The data Y (t) = {t 0 , t 1 , ..., t N , t E } consist of spike times t 1 , t 2 , . . . t N , and t 0 and t E are the trial start and end times, respectively. Since the trajectory is fully observed, the time-dependent firing rate of the Poisson process λ(t) is known: λ(t) = f (X (t)), where f (x) is the firing-rate function. By definition of the Poisson firing rate, the probability density p(y t i ) of a spike occurring at time t i equals λ(t i ), and the probability of not observing a spike in a time interval [t i−1 ; t i ] is given by [7]:p Since Poisson process has no memory, the probability density of observing the entire spike sequence Y (t) is a product of the probability densities of all observed spikes and probabilities of not observing a spike during all interspike intervals (cf. Eq. (4)): Substituting Supplementary Eq. (40) and p(y t i ) = λ(t i ) into Supplementary Eq. (41), we obtain: Rearranging the terms in Supplementary Eq. (42) to combine all exponents into a single exponent, we obtain a familiar equation for the likelihood of a spike train from an inhomogeneous Poisson process with a known time-dependent firing rate λ(t) [7]: Thus, for a single fully observed trajectory X (t), the likelihood of spike data Y (t) is simply given by the likelihood of an inhomogeneous Poisson process. Supplementary Eq. (43) is widely used for maximum likelihood estimation of parameters in neural encoding models, in which the relationship between a stimulus and firing rate is implemented by a known parametric function [8].

Supplementary Note 5: Derivation of the modified Fokker-Planck equation
Our goal is to infer latent Langevin dynamics from spike data. In this case, the one specific trajectory that produced the observed spike sequence is unknown and therefore we cannot directly use the likelihood of an inhomogeneous Poisson process Supplementary Eq. (43). Instead, we have to consider all possible latent trajectories that may have produced the data Y (t) and weight each trajectory not only according to how consistent it is with the data (Supplementary Eq. (43)) but also how likely it can arise from the Langevin dynamics.
To perform this computation, we again use the fact that Poisson process has no memory and factorize the probability density of the entire spike sequence into probability densities of observing each spike and probabilities of not observing a spike during interspike intervals, as in Supplementary Eq. (41). We denote the latent state at each of the observation times {t 0 , t 1 , . . . , t N , t E } by X(t) = {x t 0 , x t 1 , . . . , x t N , x t E }, which is a discrete set of points along a continuous path X (t). We fix the set of points X(t) and later marginalize over it in Eq. (3).
Fixing the latent state x t i at the spike time t i allows us to find the probability density of this spike occurring p(y t i |x t i ) = f (x(t i )), the same way as we did in Supplementary Eq. (42).
To compute the probability of no spikes occurring during the interspike interval [t i−1 ; t i ], we need to average the probability in Supplementary Eq. (40) over all possible latent paths that connect x t i−1 and x t i . This averaging corresponds to the expectation of the stochastic path integral [3]: Here a stochastic trajectory X (t) follows the Langevin dynamics Eq. (1), with the fixed start and end points: the state x t i at time t i and the state x t i−1 at time t i−1 . Using the Feynman-Kac formula, the path integral Supplementary Eq. (44) can be transformed into the backward Kolmogorov equation [9]: To find the corresponding forward equation, we derive the adjoint operator: Here Q(h, q) is a boundary term that arises from integration by parts. For both the reflecting and absorbing boundary conditions for the function h(x) (in the forward equation), the boundary conditions for the function q(x) (in the backward equation) are chosen so that the boundary term is zero [1]. The operatorĤ in Supplementary Eq. (46) contains the usual drift and diffusion terms that are the same as in the Fokker-Planck operator for the Langevin dynamics Eq. (1). In addition, it contains the term f (x) that accounts for the decay of the probability density due to spike emissions through the inhomogeneous Poisson process. We refer to the forward equation with the operator −Ĥ as a modified Fokker-Planck equation: We use this equation to propagate the latent probability density between the adjacent spikes, which is the same as computing the probability of not observing spikes during interspike intervals averaged over all possible latent paths connecting the states x t i−1 and x t i . In the main text Eq. (6), we use p(x, t) instead of p(x, t|x , t ) to keep the notation simple.
6 Supplementary Note 6: Feature complexity and model selection 6

.1 Feature complexity for non-stationary dynamics
Feature complexity is the negative entropy of latent trajectories. We derive an analytical expression Eq. (14) for the trajectory entropy Eq. (13) for non-equilibrium dynamics following the steps similar to the derivation of the equilibrium trajectory entropy [10]. First, we discretize time with a uniform step ∆t and represent a continuous latent trajectory as a sequence of states X = [x 0 , x 1 , x 2 , ...x N ], so that Eq. (13) turns into: Using the Markov property of Langevin dynamics, we substitute the factorizations P (X) = p(x 0 ) τ p(x τ +1 |x τ ) and Q(X) = q(x 0 ) τ q(x τ +1 |x τ ) into Supplementary Eq. (48): This expression can be reorganized by interchanging the integration order: The terms in square brackets can be simplified using the normalization property of probability densities: Substituting Supplementary Eqs. (51), (52) into Supplementary Eq. (50), we obtain: We introduce the following notation: Now we can rewrite Supplementary Eq. (53) as: Following Ref.
[10], we evaluate the limit in the second term with a L'Hopitals rule (for notation convenience, we now use the subscript in x τ to represent the continuous time variable): Interchanging integration with the time derivative, we obtain: To evaluate the expression in square brackets, we use the asymptotic solution for p(x τ +t |x τ ) valid for small times t [11]: where F (x) is the driving force in the Langevin equation. For the reference model (free diffusion with zero driving force), the exact analytical solution for q(x τ +t |x τ ) is: which is valid for all times. Here we also took into account that the reference model has the same diffusion coefficient D. Taking the time derivatives in Supplementary Eqs. (59), (60), we obtain: Now we can evaluate each term in the square brackets in Supplementary Eq. (58) separately: so that Supplementary Eq. (58) evaluates to: With this result, Supplementary Eq. (56) transforms into which is equivalent to Eq. (14).  3) is an eigenvalue-eigenvector problem, the solution for ρ(x, t) = p(x, t) exp(Φ(x)/2) can be written as:

Numerical calculation of the Jensen-Shannon divergence
We approximate Eq. (15) by the midpoint rule: where t i = i∆t, and we choose the terminal time t max = n∆t to be sufficiently large such that nearly all probability density p(x, t) decays to zero due to absorption at the boundaries. Here p(x, t) is the time-dependent solution of the Fokker-Planck equation Eq. (5). Since JSD is defined for normalized distributions p and q, we take into account the probability loss through the absorbing boundaries and define a normalized distributionp(x, t) = p(x, t)+I p δ(x−x b ), where I p = 1− p(x, t)dx is the total probability loss through the absorbing boundaries up to time t. Here we do not distinguish between the probability loss through the left and right boundaries, and collapse the total probability loss into a single term. JSD is then calculated using the following expression: where we used the sifting property of the delta function.
To calcuate D JS between two models, we first solve the eigenvector-eigenvalue problem Supplementary Eq. (3) and use this solution to find the probability densities p 1 (x, t) and p 2 (x, t)  Fig. S1(a) Φ(x) 2(x/0.6) 4 − 4(x/0.6) 2 double well potential in Fig. S1 initial probability density in Fig. 3,5,S2(a) p 0 (x) ∝ exp(−100x 2 ) initial probability density in Fig. 4  Supplementary Figure 1. The accuracy of inferring the initial state distribution depends on trial durations. (a) The initial state influences the trajectory only at short times before the dynamics equilibrate. When the data consist of many short trials such that the system has no time to equilibrate, the trajectories contain information about the initial state and p 0 (x) can be accurately inferred (lower panel). The data was generated from dynamics in a two-well potential (upper panel) with a narrow p 0 (x) distribution and absorbing boundary conditions. The low potential at the boundaries results in short trial durations. The data contained 100 trials with the average duration ∼ 1 second. (b) When the data consist of a few long trials such that the dynamics are largely at equilibrium, then trajectories contain little information about the initial state and the inference of p 0 (x) is less accurate (lower panel). The data was generated as in a but the potential has large barriers near the boundaries (upper panel), which results in long trial durations. The data contained 4 trials with the average duration ∼ 25 seconds, i.e. the total data amount is similar to a. (c) For long trials, the potential and noise magnitude can be inferred accurately even when the inference of p 0 (x) is inaccurate. The potential Φ(x), initial state distribution p 0 (x), and noise magnitude D are inferrred simultaneously from the same spike data as in b. For long trials, the dynamics are largely in equilibrium and independent of p 0 (x), and therefore accurate inference of p 0 (x) is not important for accurate inference of the potential function. For each of the ten datasets, the selected models (lower row, pairs of potentials shown with colored lines) agree well with the ground-truth (black line). For some of the simulations in c, the D JS threshold was not reached after the maximum number of GD iterations (set to 20,000). In this case, we plotted the pair of models with the maximum achieved feature complexity.