Solving the inverse problem of time independent Fokker–Planck equation with a self supervised neural network method

The Fokker–Planck equation (FPE) has been used in many important applications to study stochastic processes with the evolution of the probability density function (pdf). Previous studies on FPE mainly focus on solving the forward problem which is to predict the time-evolution of the pdf from the underlying FPE terms. However, in many applications the FPE terms are usually unknown and roughly estimated, and solving the forward problem becomes more challenging. In this work, we take a different approach of starting with the observed pdfs to recover the FPE terms using a self-supervised machine learning method. This approach, known as the inverse problem, has the advantage of requiring minimal assumptions on the FPE terms and allows data-driven scientific discovery of unknown FPE mechanisms. Specifically, we propose an FPE-based neural network (FPE-NN) which directly incorporates the FPE terms as neural network weights. By training the network on observed pdfs, we recover the FPE terms. Additionally, to account for noise in real-world observations, FPE-NN is able to denoise the observed pdfs by training the pdfs alongside the network weights. Our experimental results on various forms of FPE show that FPE-NN can accurately recover FPE terms and denoising the pdf plays an essential role.


Methods
In this work, we focus on the one-dimensional and time-independent FPE, whose form is shown below: where P : R × R → R + ∪ {0} is the pdf over the observable x and the time t, g : R → R is the drift term and h : R → R is the diffusion term. In actual implementation, P(x, t), ∂P(x, t)/∂t , g(x), and h(x) are discretized over variable x and denoted as P(t) , ∂ t P(t) , g , and h , respectively. More specifically, we discretize variable x into a number of bins on the support, (x 1 , x 2 , . . .) . Then P(x,t) becomes the vector P(t) = (P(x 1 , t), P(x 2 , t), . . .) , and g(x) becomes the vector g = (g(x 1 ), g(x 2 ), . . .) . The same transform applies to all the other functions.
In order to solve the inverse problem, we focus on the scenario in which a pdf is assumed to satisfy FPE and the data within some time period is measured. We also assume P(x, t) is differentialble and has a finite support, which is satisfied by most real-world pdfs. However, the measured pdf, denoted as P noisy (t) , is corrupted with multiplicative noise, and the exact forms of g and h are unknown. Hence, we propose to train a FPE-based neural network model FPE-NN in a self-supervised way, based on P noisy (t) only, to find g , h and the true pdf which is denoted as P clean (t) . We currently focus on the one-dimensional FPE and will expand to multi-dimensional FPE in future work.
In the following sections, we first give an overview of how to find g and h and to recover P clean (t) in an alternating training process. Next, we introduce the architecture of the network FPE-NN. This is followed by training Overview of the FPE-NN training process. Generally, FPE-NN takes P(t) at a time point as the input, g and h as trainable weights, to predict P(t) at neighboring time points. The input and the weights are trained in an alternating way. To distinguish them from the true values, the training input and weights are denoted as P (t) , ĝ , and ĥ , respectively. As shown in Fig. 1, each iteration of the training consists of two steps. The first step is the gh-training step, where we keep the updated P (t) constant to train ĝ and ĥ . The second step is the P-training step, where we keep the updated ĝ and ĥ constant to train P (t) . The alternating training process is repeated until the stopping criterion is met. The reason that the input and weights are not trained simultaneously is that we have to use different ways to feed in the data samples when training them. Specifically, ĝ and ĥ are time independent so the samples can be fed in batches, whereas P (t) is time dependent hence the samples have to be fed one-by-one.

Architecture of FPE-NN.
Here we introduce our FPE-based neural network model (FPE-NN). It consists of two major parts and each part consists of several layers (Fig. 2a). The central part, named FPE Core, is designed based on FPE which takes the input P (t) at a certain time point t to calculate the corresponding derivative ∂ tP (t) with weights ĝ and ĥ . The remaining part is designed based on the Euler method which uses the input P (t) and the derived ∂ tP (t) to predict the corresponding distributions at multiple neighboring time points. Please take note that we are not using the neural network to approximate FPE or the Euler method, instead, we use the neural network to carry out the computation steps in FPE and the Euler method. The network architecture of FPE Core is designed based on the discrete form of FPE over variable x for fixed-time t: where ⊙ is the element-wise multiplication; ∂ x and ∂ xx are matrices to compute derivatives following a new and accurate finite difference method 21 . The element-wise multiplication of two vectors is implemented by locally connecting two layers (Fig. 2b) and the multiplication of a vector by a matrix is achieved by fully connecting two layers (Fig. 2c). The weights of the locally connecting layers are the FPE terms ĝ and ĥ , which are trainable. The weights of the fully connecting layers are non-trainable and have fixed values that perform derivative calculations ∂ x and ∂ xx (details are provided in the Supplementary). There are no biases or activation functions in these two kinds of connecting layers. Overall, the FPE Core computes the corresponding ∂ tP (t) of the input P (t) at a certain time point t based on the given weights ĝ and ĥ .
In the following part, we use the input P (t) and the derived ∂ tP (t) from FPE Core to predict the distribution at time point t + i t using the Euler method, where i ∈ {−n, . . . , n} and n is a hyperparameter to be tuned (2) ∂ t P(t) = ∂ x (g ⊙ P(t)) + ∂ xx (h ⊙ P(t)) Figure 1. Flowchart of using FPE-NN to train the input P (t) and the weights ĝ , ĥ in an alternating way. Each iteration consists of two steps, first the gh-training step then the P-training step. The counting of iteration starts from 1, as P 0 (t) , ĝ 0 and ĥ 0 are the initialized values. In the k-th iteration, the updated input and weights in the previous iteration are P k−1 (t) , ĝ k−1 , and ĥ k−1 , respectively. We first keep P k−1 (t) constant to improve the accuracy of the weights in the gh-training step, and save the trained weights as ĝ k and ĥ k . Then we keep ĝ k and ĥ k constant to improve the accuracy of the input in the P-training step, and save the trained input as P k (t) . Subsequently, P k (t) , ĝ k , and ĥ k are used for the next iteration. www.nature.com/scientificreports/ case-by-case. t is the time gap between adjacent time points. As shown in Fig. 2a, a few layers are added to perform the following calculation: where P pred (t + i�t) is the predicted distribution at time point t + i t . There is no need to add an extra constraint to ensure that the integral of P pred (t + i�t) equals to 1, because the integral of a pdf is invariant when it evolves according to FPE (Proof is provided in the Supplementary). Furthermore, the integrals of the target distributions all equal to 1, and P pred (t + i�t) is forced to be very close to the target distributions by the loss functions. The details about the target distributions and the loss functions are provided later. In summary, the inputs for the whole network FPE-NN is P (t) , the trainable weights are ĝ and ĥ , and the output is the predicted distributions P pred (t + i�t) , i ∈ {−n, . . . , n} . In order to generate a loss function for training, P pred has to be compared with target distributions. However, as the true pdf P clean is unknown, we can only compare P pred with some selected noisy target distributions. We will describe how we formulate the loss functions using the noisy target distributions when the details of the alternating training steps are provided. Furthermore, in order to alleviate the noise effect of the target distributions, FPE-NN is designed to make predictions at 2n + 1 time points and usually n > 1 . Detailed explanations are provided in the Supplementary.
Initialization. The initialized input P (t) , denoted as P 0 (t) , is derived by smoothing P noisy (t) with the Savitzky-Golay (SG) filter 23 , which helps to partially remove the noise. The initialized weights, denoted as ĝ 0 and ĥ 0 , are calculated based on P 0 (t) using a linear least square method (details are provided in the Supplementary). In the following two subsections, we will explain how to train and update the input and weights separately in the k-th iteration.
The gh-training step in the k-th iteration. In the beginning of the k-th training iteration ( k = 1, 2, . . . ), P (t) , ĝ and ĥ have been updated in the previous iteration. The updated values are denoted as P k−1 (t) , ĝ k−1 and ĥ k−1 , respectively. For the first iteration ( k = 1 ), we use the initialized input P 0 (t) and weights ĝ 0 and ĥ 0 . Then we first run the gh-training step to train ĝ and ĥ (yellow blocks in Fig. 1). Specifically, the input of FPE-NN would be P k−1 (t) which is kept constant in this step, and the weights are ĝ and ĥ which are trainable with initial values ĝ k−1 and ĥ k−1 , respectively. The output P pred is compared with P k−1 , hence one data sample consists of a single distribution P k−1 (t) as input and multiple distributions at neighboring time points P k−1 (t + i�t) as target output, where i ∈ {−n, . . . , n} . The loss function is: www.nature.com/scientificreports/ where the superscript k indicates the k-th iteration. Because ĝ and ĥ are time-independent and shared by all the data samples, we can train FPE-NN in batch training by randomly picking a certain number of samples, which is the first summation in Eq. (4). The interpretation of L k gh is that we assume P k−1 (t) satisfies FPE and search for the optimal terms. The optimized ĝ and ĥ are saved as the updated weights of the k-th iteration.
where P k−1 (t) is kept constant.
The P-training step in the k-th iteration. After the gh-training step of the k-th iteration, the latest updated input and weights are P k−1 (t) , ĝ k and ĥ k , respectively. Now we run the P-training step to train P (t) (green blocks in Fig. 1). Specifically, FPE-NN would have a trainable input P (t) with initial value P k−1 (t) , and the weights ĝ k and ĥ k which are kept constant in this step. The output P pred is compared with P 0 which is the smoothed P noisy . Therefore, one data sample has a single distribution P k−1 (t) as input and multiple distributions P 0 (t + i�t) as target output, where i ∈ {−n, . . . , n} . The corresponding loss function is: where the superscript k indicates the k-th iteration. Because the input P (t) is time-dependent and specific to each sample, we could only train the samples one-by-one. The interpretation of L k P is that we search for the optimal P (t) which has the minimum difference to P 0 (t) , and naturally to P noisy (t) , when it evolves according to FPE with terms ĝ k and ĥ k . The optimized P (t) is saved as the updated input of the k-th iteration: where ĝ k and ĥ k are kept constant.
Training Stopping Criterion. As described previously, the alternating FPE-NN training process uses the updated P (t) to improve the accuracy of ĝ and ĥ , and uses the updated ĝ and ĥ to improve the accuracy of P (t) . Eventually, FPE-NN aims to recover the true values of P (t) , ĝ and ĥ . However, because the true values are unknown, an indirect metric must be set to evaluate the training process and used as a criterion to stop the training iteration. In this study we use the sum of the two loss functions L gh + L P as the stopping criterion, to stop the whole training process when it converges or fails to decrease within a certain number of successive iterations.
Predict future distributions. Once the whole training process is finished, the well-trained weights ĝ and ĥ could be used to predict the future evolution of a different pdf, denoted as P ′ noisy (t) , which satisfies the same FPE but is not used in FPE-NN training. Specifically, the task would be to use the distributions of P ′ noisy (t) at time points {t − r�t, . . . , t} to predict the distributions at time points {t + �t, . . . , t + f �t} , where r and f are user-defined parameters. In this study, we choose r = n and f = 5.
The prediction process is similar to the P-training step which keeps the FPE-NN weights constant. First, the given distributions P ′ noisy (·) are smoothed with the SG filter, and the smoothed distributions are denoted by P ′′ noisy (·) . We then use P ′′ noisy (·) as the target output to train the input P (t) using FPE-NN. The corresponding time distance set is {−r�t, . . . , 0} which is different from the one used in the FPE-training process (Fig. 1). Hence the corresponding loss function becomes: where ĝ and ĥ are well-trained weights and kept constant. By minimizing the loss value we derive the optimal P (t): Finally, the optimized P (t) is fed into FPE-NN again with another time distance set {�t, . . . , f �t} , to predict distributions at time points {t + �t, . . . , t + f �t}: Numerical studies Evaluation metrics. The training process can be evaluated with up to six metrics. The first two metrics are the loss values, which can be used to quantify the training result based on the observation P noisy (t) . In particular L gh measures how well P (t) satisfies FPE, while L P measures how close P (t) is to the measured pdf P noisy (t).
In the cases of validating our methods with simulated data where the corresponding true values are known, three more metrics can be used to quantify the normalized errors of P (t) , ĝ and ĥ : where P(t) , g and h are the true values. According to the form of the loss function L gh (Eq. (4)), the gradients of ĝ and ĥ both are proportional to the corresponding P (t) at the same point of x. Usually the values of P (t) at the boundary points of x are very small, hence the corresponding ĝ and ĥ at these points are difficult to be well trained in backpropagation. On the other hand, the boundary points are less important due to their small P(t) values. Hence, we also measure the error of ĝ and ĥ excluding the boundary points, denoted as Ẽ g and Ẽ h , respectively. Please take note that the boundary area is not excluded during the training process, they are ignored only when calculating Ẽ g and Ẽ h . The boundary area is determined by using the student T-test 24 at each point of x. The x points whose density values within all the training data are statistically lower than 0.01 × max(P noisy (t)) are classified to the boundary area.
The last metric is the ability of using the trained ĝ and ĥ to predict future distributions, the detailed process of which has been described previously. The normalized error of the predicted future distribution is denoted as E test : where i ∈ {1, . . . , f } indicates the number of time gaps ahead of the last known distribution, P pred (t + i�t) is the predict distribution provided in Eq. (10), and P target (t + i�t) is the ground truth.
Simulated data. We validate our method with simulated data. First, We solve the forward problem of FPE with chosen FPE terms and initial conditions, to generate several distribution sequences at multiple time points. Then, we assume the FPE terms are unknown, and we train FPE-NN to solve the inverse problem of FPE which is to find the FPE terms based on the generated distribution sequences.
The simulated data is generated based on three selected FPE examples which have been used to model realworld phenomena in different areas. Another FPE example, the Brownian motion in a periodical potential 25 , is also used to validate our method and the result is provided in the Supplementary. For each FPE example, we generate 120 distribution sequences and each sequence consists of distributions at 50 time points with a uniform time gap t . 100 distribution sequences are used to train FPE-NN and the remaining 20 sequences are used to test the ability of predicting future distributions. We also tested training with less or more than 100 distribution sequences, and the result is provided in the Supplementary.
Example 1: Heat flux (OU process). Our first FPE example comes from meteorology, where Lin and Koshyk 26 used it in climate dynamics to describe how the heat flux at the sea surface evolves. The observable x is the deviation of the heat flux from its steady state in units of watts per square meter. Each unit of time t corresponds to 55 days. The FPE form is: The process described by Eq. (13) is called the Ornstein-Uhlenbeck (OU) process 17 , where D and θ are the diffusion and drift coefficients respectively. It was originally used in the Brownian motion of molecular dynamics and has since been applied widely in various fields [27][28][29] . The forward problem of Eq. (13) has an analytical solution which is a time-varying Gaussian distribution: where the mean µ and deviation σ are: where x 0 is the origin state of x when t = 0 . We set D = 0.0013 and θ = 2.86 , which are the empirical parameters from the sea measurements 30 . x is discretized into 110 bins with support [−0.01, 0.1] . The initial conditions are uniformly sampled from x 0 ∈ [0.04, 0.08] and t init ∈ [0.03, 0.05] . We obtain simulations with a random pair of (x 0 , t init ) for each distribution sequence. We adjust the time gap t to be 0.001 in order to maximize the (10) P pred (t + i�t) =P(t) + i�t(∂ x (ĝP(t)) + ∂ xx (ĥP(t))) www.nature.com/scientificreports/ distribution differences at different time points while keeping P(x, t) values at boundary points close to zero. The time gap of the next two examples are also tuned based on the same criteria.
Example 2: DNA bubble (Bessel process). The next FPE example is from the biology area, where FPE has been used to model the dynamics of DNA bubbles formed due to thermal fluctuations 16 . The observable x is the bubble length in units of base pair and the time variable t is in units of microsecond. The corresponding pdf satisfies: Equation (16) describes a generalized Bessel process which also has many other applications 31 . In the DNA bubble study, µ is estimated as 1 and ǫ is approximated as 2(T/T m − 1) , where T is the environment temperature and T m is the melting temperature which is determined by the DNA sequence. In this work, we set µ = 1 to follow the DNA bubble study and arbitrarily choose ǫ = 0.2 . Equation (16) has a very complicated analytical solution 31 .
Hence for convenience, we generate data by solving the forward problem with the 4th-order Runge-Kutta method for temporal discretization. x is discretized into 100 bins with support [0.1, 1.1]. The initial condition is a normal distribution whose mean µ and deviation σ are randomly sampled from [0.4, 0.8] and [0.05, 0.1] respectively. We adjust the time gap t to be 0.001, based on the same criteria as mentioned before.
Example 3: Agent's wealth. The final FPE example comes from economics, where Cordier et al. 7 uses FPE to describe the wealth distribution of agents. The agent is a terminology in economics referring to an individual, company, or organization who has an influence on the economy through producing, buying, or selling. The observable x is the agent's wealth and its pdf satisfies: Equation (17) has no analytical solution and we solve the forward problem by using the 4th-order Runge-Kutta method. We set = 0.4 and m = 1 following Cordier's report 7 .
x is discretized into 100 bins with support [0, 1.0]. Similar to the DNA bubble example, the initial condition is a normal distribution whose µ and σ are randomly sampled from [0.3, 0.7] and [0.05, 0.1], respectively. The time gap t is adjusted to be 5e-4 , based on the same criteria mentioned in the heat flux example.
Noise addition. After we generate the distribution sequences (denoted as P clean ) by solving the forward problem of FPE as described previously, noise is added to the distributions to model real-world scenarios: where ξ is a random number sampled from the standard normal distribution N(0, 1), and W is a constant which is adjusted to ensure that the corresponding E P of P noisy (x, t) is approximately 0.01. Simulated data with higher corresponding E P is also tested using our method, and the results are provided in the Supplementary.

Results.
As described previously, we generate the simulated data by solving the forward problem of FPE with given FPE terms and initial conditions. After that, we assume the FPE terms are completely unknown, and train FPE-NN to solve the inverse problem which is to find the FPE terms based on the simulated data. A typical training process with the data from the agents' wealth example is shown in Fig. 3. L P decreases rapidly in the first few iterations while L gh decreases at a slower rate. Both of their curves become almost flat after the 25-th iteration and decrease very slowly. The curve of E P generally coincides with that of L P with a turning point at the 25-th iteration. In contrast, E g and E h are continuously decreasing during the iterations. The result demonstrates that with the alternating training process, the input and weights help each other gradually recover the true values over the iterations. We also found that further improvements on the accuracy of ĝ and ĥ , after reaching a certain point, have minor effect on the accuracy improvement of P (t) . On the other hand, ĝ and ĥ are more sensitive to the minor improvements in P (t) . We also test E test at several selected iterations as shown in Fig. 3c. The result suggests that a larger time gap is more sensitive to the accuracy of ĝ and ĥ . However, further improvement in ĝ and ĥ after a certain point has little effect on future distribution prediction. A typical plot of the predicted distributions in future time is provided in Supplementary. Overall, the learning curves demonstrate our selfsupervised method can train P (t) , ĝ and ĥ in an alternating way and let them gradually recover the true values.
The training results of the three different FPE examples are summarized in Table 1. Generally, P (t) , ĝ and ĥ have been well trained using FPE-NN in all the three simulated datasets. The noise of the trained P (t) is significantly suppressed, as the final noise ratio E P could reach to less than 0.002. In contrast, the corresponding E P for P noisy (t) is around 0.01. The trained ĝ and ĥ also fit well with their true values. Figure 4 shows the plots of the final values of ĝ and ĥ (red diamond), excluding the boundary points, are quite close to the true values (black line). In contrast, if we do not train the input but just smooth P noisy (t) with SG-filter, the optimized weights ĝ 1 and ĥ 1 (blue cross) have much higher error. Hence, further reduction of the noise in P noisy (t) is essential to improving the accuracy of trained ĝ and ĥ .
We also examined how the training result is affected by the number of output distributions, which is the hyperparameter n in Eqs. (4) and (6). As demonstrated in the Supplementary, n > 1 can help to suppress the noise effect of P (t) . However, it is not always better to have more neighboring time points, because the local truncation error from the Euler method may become too large to be ignored. Therefore n is a hyperparameter which needs  are the trained weights in the first iteration. They are the optimal FPE terms can be found based on the smoothed P noisy (t) , P 0 (t) . The figure shows that ĝ 1 and ĥ 1 are significantly different from their true values, suggesting that merely smoothing P noisy (t) is insufficient. To further reduce the noise by training P (t) with FPE-NN is essential for the ĝ and ĥ optimisation. www.nature.com/scientificreports/ to be tuned case-by-case. In this study, we tested four different values of n for each FPE example. As shown in Table 1, L gh and L P always increase when n increases, which is because of the larger truncation error caused by the larger time distance. However, the errors of P (t) , ĝ and ĥ are not monotonically increasing or decreasing. Different datasets have different best n within the four options. The effect of the training data size is also studied, by training FPE-NN with 200, 75, or 50 distribution sequences. The detailed result is provided in the Supplementary. Generally, using less data (75, or 50 sequences) causes the calculated ĝ and ĥ to be less accurate. More distribution sequences could provide more information about how the PDF evolves, enabling better fit of g(x) and h(x). On the other hand, the accuracy of the training result can be further improved by using more training data (200 sequences), but in real applications it could be too expensive to acquire extra distributions. Hence, the result shows that 100 distribution sequences is an appropriate data size to achieve accurate enough training results.

Discussion
In this study, we proposed a new method to find the FPE terms based on the measured noisy pdfs, without any prior knowledge except assuming the pdfs satisfy FPE. Such an inverse problem of FPE could help us to study the stochastic processes in a more efficient way. We designed our network FPE-NN based on FPE and the Euler method. The input of FPE-NN is the distribution at a time point, the trainable weights are ĝ and ĥ representing the FPE terms, and the output is the distributions at several neighboring time points. In an alternating way, we use FPE-NN to train the input and the weights separately, forcing them to recover their true values. By training the input we successfully reduced the noise in the measured pdfs, which is a key factor for finding the accurate FPE terms. We validated our method with simulated data from three typical FPE examples. The result shows that the final trained input and weights all can agree well with the true values.
Although we are using a uniform grid to generate the simulated data in this paper, FPE-NN also can be used with minor changes in cases where the spatial gap x and the time gap t are not uniform. As shown in the Supplementary, the matrices which perform the derivative calculations of ∂ x and ∂ xx can be easily modified to accommodate a non-uniform x . For the non-uniform t , we could treat the time distance set (Fig. 1) as part of the input, to make it more flexible and specific for each data sample.
Currently, we are focusing on one-dimensional and time-independent FPE, but our method can be easily extended to higher spatial dimensions. In future work, we will try to solve the inverse problem of time-dependent FPE. Furthermore, our method can perform well when the initial noise ratio ( E P ) is around 0.01. However, the recovered FPE terms become less accurate with data of higher noise ratio. Hence, another possible future direction is to improve the noise robustness of the proposed method. www.nature.com/scientificreports/