Neural superstatistics for Bayesian estimation of dynamic cognitive models

Mathematical models of cognition are often memoryless and ignore potential fluctuations of their parameters. However, human cognition is inherently dynamic. Thus, we propose to augment mechanistic cognitive models with a temporal dimension and estimate the resulting dynamics from a superstatistics perspective. Such a model entails a hierarchy between a low-level observation model and a high-level transition model. The observation model describes the local behavior of a system, and the transition model specifies how the parameters of the observation model evolve over time. To overcome the estimation challenges resulting from the complexity of superstatistical models, we develop and validate a simulation-based deep learning method for Bayesian inference, which can recover both time-varying and time-invariant parameters. We first benchmark our method against two existing frameworks capable of estimating time-varying parameters. We then apply our method to fit a dynamic version of the diffusion decision model to long time series of human response times data. Our results show that the deep learning approach is very efficient in capturing the temporal dynamics of the model. Furthermore, we show that the erroneous assumption of static or homogeneous parameters will hide important temporal information.


Introduction
Mathematical models are important tools for conceptualizing human cognition and predicting observable behavior. Such models aim to provide a mathematical formalization of cognitive processes by mapping latent cognitive constructs to model parameters and specifying how these generate manifest data [1]. The surge of cognitive model applications has made it possible to test precise mechanistic hypotheses and to predict performance in various domains, such as decision-making [2,3], learning [4,5], or memory [6,7].
The majority of cognitive models treat human data as independent and identically distributed (IID) observations. The IID assumption implies that these models largely ignore the temporal changes of latent cognitive constructs. However, such constructs are inherently dynamic, regardless of a particular time scale [8,9,10,11]. For instance, there is little dispute that constructs, such as working memory capacity [12] or mental speed [13], change over the human life span. These constructs also vary on much shorter time scales, for example, within experimental sessions [14,15].
In psychological experiments, cognitive affordances are influenced not only by external task demands but also by internal mental processes and brain states that change over time. There are many possible explanations for the resulting systematic and unsystematic fluctuations, for instance, fatigue [16,17], practice [18,19], mind-wandering [20,21], or motivational factors [22,23]. In this article, we argue that cognitive mechanisms should be treated as complex dynamic systems and that cognitive models should account for the dynamics of their components to fully understand and capture the rich structure of empirical human data.
Ignoring temporal fluctuations and changes in cognitive parameters can have drastic consequences for the descriptive, explanatory, and predictive merits of cognitive models. Consider a simple inverted U-shape hypothetical tra-arXiv:2211.13165v4 [stat.ME] 20 Sep 2023 Time T Parameter θ Figure 1: Conceptual illustration of a hypothetical parameter θ varying over time (solid black line). The solid blue line and shaded blue region depict the posterior mean and the 95%-CI of a static model, respectively. The solid red line and shaded red region depict the posterior mean and 95%-CI of a dynamic model, respectively. Treating the parameter as static (i.e., stationary) by marginalizing out the effects of time leads to inflated uncertainty estimates (matching the width of the marginal distribution, depicted in gray) and obscures the underlying change.
jectory of a single parameter, as depicted in Figure 1. Typical cognitive models assuming IID observations [2,6] would estimate a flat trajectory (depicted in blue) whose uncertainty would match the width of the marginal parameter distribution (depicted in gray). Differently, dynamic models would account for temporal change and achieve a much greater information gain (depicted in red). Indeed, this is not just a hypothetical scenario, and we subsequently demonstrate its consequences in a real data application (cf. Figure 8).
One way to mathematically formalize dynamic systems is by treating them as stochastic generative processes that produce data with temporal dependencies (i.e., time series data). As most complex systems are inherently non-linear, these time series often do not exhibit simple fluctuations around a stable mean with a fixed variance, but resemble a heterogeneous random walk [24]. Beck and Cohen [25] coined the term superstatistics, which refers to a superposition of multiple stochastic processes on different temporal scales that can describe heterogeneous temporal dynamics [26]. Thus, instead of assuming static model parameters, a superstatistics modeling approach introduces a hierarchy of at least two models: A low-level (i.e., observation or microscopic) model that formalizes the local behavior of a system and a high-level (i.e., transition or macroscopic) model that describes the parameter dynamics of the low-level model. Note that there is no absolute time scale for low-and high-level processes. The meaning of these terms is relative and always depends on the scale relevant to the research question.
A viable approach for modeling parameter transitions is offered by hidden Markov models (HMMs). For in-stance, Kucharský et al. [27] accounted for different response states during a decision-making task by combining a HMM with an evidence accumulation model of decision-making. This model combination allows for discontinuous changes on longer time scales and continuous changes on shorter time scales. Similarly, Gunawan et al. [28] extended a hierarchical version of the same decisionmaking model with a HMM and applied it to three existing long time series of response time and choice data. Both studies demonstrate that the HMM approach can reveal plausible fluctuations of decision model parameters in cognitive tasks.
However, the superstatistics framework is far more general and flexible in representing macroscopic fluctuations. First, it does not require modelers to pre-define a small set of possible modes (i.e., distinct system behaviors). Further, models within the superstatistics framework can be agnostic about the concrete dynamics of the model parameters -the most plausible dynamic can be directly estimated in a data-driven fashion. For example, using a superstatistics framework, Metzner et al. [29] demonstrated that the transition between different sleep stages is less abrupt than previously suggested.
The superstatistics framework has been utilized in physics [30,31,32], the life-sciences [33] and economics [34,35], but it has not yet been disseminated in the cognitive sciences. Under the assumption that cognitive processes are dynamic and complex, it seems natural to equip existing cognitive models with superstatistical aspects. However, to our knowledge, no previous study besides Metzner et al. [29] has employed superstatistical methods for studying the dynamic aspects of cognitive parameters. Existing dynamic models of cognition fit stationary time series models (e.g., autoregressive models) to the observed behavior [9] but do not incorporate a low-level mechanistic model that formalizes the underlying cognitive process(es). Thus, these time series models describe how behavior changes over time but do not explain how behavior occurs at a specific point in time. On the other hand, popular mechanistic models tailored to describe local behavior, such as diffusion decision models [DDM, 36,2,37], either ignore the dynamic aspects of their parameters entirely or represent parameters as deterministic functions of time [38,39,40,41,42].
In this work, we argue that the superstatistics framework can reveal a more nuanced picture of cognitive dynamics and behavioral fluctuations. This is possible because we formalize the dynamic aspect of the low-level parameters as a higher-order stochastic process. Consequently, we estimate the low-level parameters at each time step directly from the data. Thus, their temporal evolution is only constrained by the modeler's choice of prior distri-butions and by the high-level transition model. Nevertheless, superstatistical models can be rigorously validated in the same way as their static counterparts, using standard model criticism methods, such as simulation-based calibration (SBC) to assess computational faithfulness, parameter recovery for inferential calibration, posterior resimulation checks for assessing model adequacy, as well as cross-validation for assessing predictive performance [43,44]. Superstatistical models allow us to address questions about how cognitive systems undergo distinct transitions in various settings [27]. Further, one can examine which model parameters explain behavioral fluctuations without predefined equations that fix the hypothesized temporal evolution of specific parameters.
Superstatistical models can be quite challenging to estimate and compare for a number of reasons, especially in a Bayesian framework for principled uncertainty quantification [24]. First, both the high-level and low-level models are stochastic, so there is considerable uncertainty about the values of all model parameters (i.e., static and dynamic) given a finite number of observations. Second, the low-level models might be complex and non-linear so that there is not always a closed-form analytic expression relating model parameters to data (i.e., the likelihood function is intractable), or the likelihood might be computationally very expensive to evaluate. Finally, even for stationary low-level models, the computational cost might become insurmountable when these models are applied to multiple data sets, since standard Bayesian methods are not amortized and thus need to be re-run sequentially (unless massively parallelized) and from scratch for each data set [45,46]. Indeed, estimation challenges may be the main reason for the underrepresentation of superstatistical models in psychology and the cognitive sciences. However, we argue that recent advances in (amortized) simulation-based inference [SBI,47,45,48] render estimation challenges secondary and allow researchers to create and test highfidelity models of cognition without worrying about analytic tractability. SBI encompasses methods that use synthetic data to approximate intractable posterior distributions of unknown parameters. Moreover, amortized SBI with neural networks represents a particularly efficient way to perform posterior estimation on multiple data sets by investing the primary computational effort in a relatively expensive training phase [47,48]. Once simulationbased training has converged, the trained networks can be applied to any number of observations or set of observations consistent with the model's structure.
The main purpose of this article is two-fold. First, we demonstrate and validate the use of superstatistics in cognitive modeling via an out-of-the-box extension of a popu- Figure 2: Coal mining disasters in the United Kingdoms between 1852 and 1962. The annual reported accident counts are depicted using gray bars. The mean posterior of the rate parameter λ of a Poisson process with Gaussian fluctuation is shown with solid lines for both estimation methods separately. The shaded area represents ±1 posterior standard deviation. lar mechanistic cognitive model, namely, the DDM. Second, we develop and validate a novel Bayesian estimation method grounded in the BayesFlow framework for amortized neural SBI [45]. To this end, we first perform benchmark comparisons with existing frameworks on simulated data. We then specify a non-stationary DDM and fit it to long time series of response times obtained from human participants. Moreover, with this application, we empirically demonstrate how stationary models assuming IID observations can hide a number of interesting dynamic patterns and fluctuations present in behavioral data.

Benchmark Studies
To ensure the trustworthiness of our method, we first benchmark its performance against two existing Bayesian frameworks which use different estimation algorithms: bayesloop [24] and Stan [49]. The former employs grid approximation for low-dimensional problems, whereas the latter relies on Hamiltonian Monte Carlo [HMC,50] sampling. Both frameworks operate in a nonamortized way and can only estimate superstatistical models with closed-form likelihoods.
Coal Mining Accidents Currently, bayesloop cannot fit low-level models as complex as the DDM, nor highlevel models such as the Gaussian process. Therefore, we compare the estimation performance of our method on a simpler example based on the coal mining accident data (freely available from [24]). These data comprise counts of coal mining accidents in the United Kingdom between 1852 and 1962. The low-level model is a simple Poisson distribution with a parameter λ that corresponds to the accident rate. One can assume that the accident rate in coal mines was not constant during this more than a centurylong period. Therefore, the accident rate λ is allowed to fluctuate over time according to the Gaussian random walk transition model (cf. equation (3)). Both estimation methods use the same informative prior distribution for the low-level parameter λ 0 ∼ Exp(0.5) and high-level parameter σ ∼ Beta (1,25).
Using the bayesloop software, we approximated a grid with 4000 equally spaced points ranging from 0 to 15 for λ and from 0 to 1 for σ, respectively. This calculation lasted approximately 38 minutes on a standard desktop computer. Training the neural network for 20 epochs took approx. 18 minutes, and obtaining 4000 posterior samples took less than a second. Thus, in this case, the training effort amortizes even after a single data set. Figure 2 shows the annual count of coal mining accidents overlaid with the estimated dynamic accident rate λ (posterior mean and ±1 standard deviation). Both methods estimate an almost identical latent trajectory for the lowlevel model parameter λ. Between the years 1880 and 1900, we observe a decrease in coal mining accidents followed by two temporary increases around the years 1905 and 1930. The estimated parameter dynamic closely follows these data patterns. Thus, we conclude that our neural method can estimate a plausible parameter dynamic for a simple low-level model and performs equally well compared to bayesloop.

Static Diffusion Decision Model
As a second benchmark, we compare our neural method to Stan in terms of Bayesian updating, assuming a "true" DDM with timeinvariant parameters. This benchmark serves two goals. Firstly, it aims to compare the estimation performance of our method with that of Stan, which is regarded as the gold standard for sampling-based Bayesian inference. Secondly, it aims to assure that our method can correctly identify stationary parameters when fitting a dynamic model on data generated from a stationary process (i.e., it does not estimate "pseudo-dynamics").
To this end, we simulated 100 data sets with 100 observations, each using a static DDM with 3 free parameters (see A Non-Stationary Diffusion Decision Model section) without parameter fluctuation over time. Then, we fit a non-stationary DDM with a Gaussian random walk transition model (cf. equation (3)) to all 100 data sets using both estimation methods. Again, we use the same prior distributions (see Appendix) to ensure comparability. We compared the two methods based on the following two performance metrics: i) the median absolute error (MAE) between the estimated posterior means and the data generating stationary parameters averaged across all 100 simulations, and ii) the average posterior standard deviation over time. These two metrics are common indicators for inferential model calibration, which aims to analyze the global behavior of the posterior distribution given possible observations from the prior predictive distribution [51]. The former metric informs us how well the posterior recovers the true model configurations (analogous to posterior z-scores). The latter metric indicates how much the posterior is informed by the data beyond the prior knowledge that was encoded in the prior distribution (analogous to posterior contraction) [51].
The upper panel of Figure 3 depicts the absolute difference between the true data generating parameters and the dynamically estimated posterior means over time, averaged over all 100 simulations for both methods separately. On average, the posterior means show a relatively large deviation from the true data generating parameters on early trials of the data. This difference then quickly decreases and flattens after approximately 25 trials. The performance of both methods concerning this metric is almost indistinguishable.
The lower panel of Figure 3 displays the median posterior contraction measured as posterior standard deviation over time for all 3 parameters separately. We observe considerable posterior contraction within the first 25 time points. Again, the performance of both methods is nearly identical. However, there is a large difference in estimation time between the two methods. As we are interested in the filtering posterior distributions, the Stan model has to be refit with every additional observation of a time series. Hence, we fit the Stan model to each simulated x 1:t , t = 1, . . . , T , which amounted to T = 100 re-fits per simulated data set. Fitting the model to all 100 synthetic data sets resulted in 100 x 100 model fits. This procedure took over 1 week of non-stop computing on a standard desktop computer -whereas training the neural network lasted approximately 8 hours with almost instantaneous fit to the 100 data sets thereafter. This is a non-negligible difference that will grow with longer time series, more data sets, or increased complexity until reaching a point where models can no longer be estimated with Stan due to limited processing resources or time constraints (see next section).
In summary, our method closely approximates the results obtained from bayesloop and Stan on the considered benchmark examples. Note, however, that our method is primarily designed for models where the above frameworks cannot be applied -higher dimensional models, possibly lacking a closed-form likelihood (i.e., available only as stochastic simulators), or many data sets consisting of long time series. The next application we present could be tackled with our neural approximators, but not with the above two frameworks.

Simulation Study
Next, we probe the parameter recoverability of a nonstationary DDM under different induced misspecifications (i.e., models that differ from the one used for training the network). To this end, we performed an extensive study for which we simulated data sets consisting of T = 400 time points in four different scenarios: i) A static DDM with constant parameters; ii) a DDM with stationary variability (commonly referred to as "inter-trial variability") where the 3 DDM parameter fluctuate randomly around a constant value; iii) a non-stationary DDM with a Gaussian random walk transition model; iv) and a DDM with constant parameters that jump abruptly and uniformly at three predefined time points (i.e., a regime switching model). Crucially, we trained the neural approximator only with simulations from the non-stationary model. However, during amortized inference, we applied the network to 200 data sets from each of the four scenarios. Thus, we could investigate the network's response in the open world setting where the true data generator may differ from the reference model used during the training phase. Figure 4 shows an exemplar fit of the non-stationary DDM with a random walk transition model to data sets from each of the four simulation scenarios. In the top row, we see that the estimated parameter trajectories converge to the constant ground-truth parameters. A similar pat- Figure 4: Example time-varying parameters estimated by our neural method in each scenario of the simulation study. Each row depicts the posterior estimates obtained from a single simulated person. The third row corresponds to the dynamic model used for training the network (i.e., well-specified case). The first, second, and fourth rows correspond to model variants not seen during training (i.e., misspecified cases).
tern emerges when the ground-truth parameters randomly fluctuate around a constant value (second row), yet we observe less uncertainty reduction. The third row depicts the posterior estimates based on a data set simulated from the reference non-stationary DDM (i.e., the well-specified case). Besides some local deviations from the groundtruth parameter trajectory, the model is able recover the overall trend of the dynamics. In the fourth row, we can inspect the posterior estimates from a data set simulated from the regime switching DDM which allows the parameters to "jump" uniformly at three time points to any value within the parameter bounds. Despite the severe misspecification, the random walk DDM is able to recover the discontinuous trajectories surprisingly well; still, the gradual change implied by the random walk transition does not allow for a rapid adaptation and exhibits a notable lag after each switch. ters in all 4 simulation scenarios at the selected time point. The recovery performance at other time points as well as further details and analyses (i.e., MAE over time) can be found in the Appendix.

Human Data Applications
Following our benchmarking and simulation studies, we applied non-stationary versions of the DDM to two separate data sets collected from response time (RT) experiments: i) A standard random-dot motion task (a maximum of T = 1320 trials per participant), and ii) very long time series (a maximum of T = 3200 trials per participant) from a lexical decision task. The first application serves as a starting point with data stemming from a popular task in experimental psychology. The second application showcases the utility of our method to estimate a complex nonstationary DDM with a Gaussian process (GP) transition model and multiple drift rate parameters for different diffi-culty conditions. Before fitting a model to empirical data, it is imperative to assess the faithfulness of the approximation method [43,52]. To this end, we perform simulationbased calibration [SBC; 53,54]. These analyses suggest that our neural Bayesian method exhibits reasonable calibration, with slightly miscalibrated posteriors for the nondecision time parameter (see Appendix for more details on calibration).

Random-Dot Motion Task
First, we fit a non-stationary DDM with a Gaussian random walk transition model to a data set retrieved from the experimental study of Evans and Brown [55]. We chose this data set because the purpose of the original study was to investigate the decline of the threshold parameter over time. The experiment had a 3 (Low, Medium, and High feedback) by 2 (Time and Trial condition) factorial between-subject design. Differently from our approach, Evans and Brown [55] subdivided the time series into trial bins and fitted a stationary hierarchical Bayesian DDM to each bin separately. Therefore, we can compare the parameter trajectories recovered by our neural superstatistics method with the estimates obtained by the original authors using Markov chain Monte Carlo (MCMC). Figure 6 depicts the trajectory of the threshold parameter aggregated across all individuals in a separate panel for each experimental condition. Note, that in the Time condition participants had a fixed amount of time they could spend on the task resulting in different time intervals. When we compare our estimates to those obtained by Evans and Brown [55], it becomes evident that both approaches yield similar qualitative and quantitative patterns. This result complements our promising results "in silico" and points to the convergent validity of our superstatistics approach in applications with real data.
Lexical Decision Task We fit the non-stationary DDM with a GP transition model (cf. equation (5)) to human behavioral data originating from a lexical decision-making task. The data consist of long RT and choice time series from four experimental conditions. For this application, we use four separate drift rates -one for each experimental condition. The length of these time series made it impossible to estimate the model with Stan (due to memory limitations and infeasible compute time). Thus, to increase the trustworthiness of the results obtained with our neural method, we resort to the established fast-dm software [36] as a benchmark, which is capable of estimating homogeneous (block) trial-by-trial fluctuations (i.e., inter-trial variabilities). We then compare the goodness of absolute fit in terms of re-simulation accuracy between both estimation methods and investigate the multi-horizon predictive performance of our method. Fur- Figure 6: Estimated trajectories of the DDM threshold parameter aggregated across all individuals for each between-subject experimental condition. The first column corresponds to the Time and the second to the Trial condition. The rows correspond to the three feedback conditions, Low, Medium, and High, respectively. The red solid lines depict the median of the individual posterior means and the red shaded area the 95% credibility interval of these posterior means.
ther, we analyze the main advantage of the non-stationary DDM, that is, the inferred trial-by-trial parameter dynamics, and compare those to the static fast-dm parameter estimates. Note that fast-dm is not a Bayesian method and is thus not included in our previous benchmark studies.
The left panel of Figure 7 depicts the empirical RT time series data of an individual participant in black (Figures for the remaining participants are available in the Appendix). To evaluate whether the non-stationary DDM is capable of capturing the empirical data, we perform posterior re-simulations on the first 3 blocks of the experiment (trials 1 − 2500). To this end, we draw 100 samples from the posterior distributions over θ 0:2499 to simulate 100 posterior-re-simulated data sets. The resulting RT time series are then summarized with the median and the 95% credibility interval (CI) across simulations and depicted in red color. We smooth the trial-by-trial empirical data and model outputs via a simple moving average (SMA) with a period of 5 to ease visual inspection of potential trends. Note, that the re-simulation from the fast-dm model is only shown in the marginal RT distribution on the right panel to avoid visual clutter.
The overall time series show that the individual's RTs decrease over time. Furthermore, the variability of the RTs, which is most pronounced in the first session, decreases considerably over time. The non-stationary DDM not only captures both of these overall trends, but also represents the shorter time oscillations within the empirical RT time series. The data also exhibits various sudden "jumps" in RTs, probably due to fluctuations in non-decisional processes, such as inattention. Unsurprisingly, these jumps are not fully accounted for by our non-stationary DDM since the high-level model (GP with squared exponential kernel) does not allow for sudden large changes in the low-level parameters.
We purposefully leave out the remaining 700 trials from the posterior re-simulation analysis to also test the predictive capabilities of the non-stationary DDM against heldout empirical data [56,57]. To this end, we generate 100 new parameter dynamics according to equation (5) with randomly drawn posterior samples of θ 2499 as initial parameter values and posterior samples of the highlevel Gaussian process parameters η. Then, we simulate 100 novel RT time series for the remaining 700 trials using the simulated parameter trajectories. The resulting RT time series are summarized in the same manner as before (median, 95% CI) and again smoothed with an SMA. The corresponding multi-horizon posterior predictions are depicted in Figure 7 with an orange color. The dynamic model yields accurate predictions on the held-out data and thus does not overfit the training data. Moreover, the heldout time series remain in the 95% CI of the multi-horizon prediction, which is the case for all individual data sets (see Appendix).
The right panel of Figure 7 depicts the empirical RT distributions (black) along with the data generated by the nonstationary DDM (red) and the static DDM (blue). Note that the three empirical RT distributions show a substantial overlap. Since the fast-dm re-simulations serve as a benchmark for the non-stationary DDM, it is essential to quantify if there are pronounced deviations between the re-simulated and the empirical RT distributions. To this end, we estimate the pairwise maximum mean discrepancy (MMD) between the three distributions for each individual separately and then average the resulting values across participants. MMD is a kernel-based statistical metric of equality between distributions [58].
Accordingly, our analysis reveals no pronounced differences between the three distributions. The average MMD between the empirical RT distributions and the ones predicted by the non-stationary DDM (M M D = 0.026, SD = 0.008) is lower than between the em-  The SDs of the average MMD values indicate that data generated with the non-stationary DDM are not only slightly more accurate on average but also more consistent compared to data generated from the standard DDM. For the sake of completeness, we also compare both re-simulated RT distributions (M M D = 0.035, SD = 0.019). This comparison reveals that the re-simulated RT distributions of the static DDM are more similar to the one obtained by the nonstationary DDM than to the empirical RT distribution. Altogether, both models can reproduce the empirical RT distributions with high fidelity, but the non-stationary DDM fits the data slightly better than the static DDM estimated with fast-dm.
In summary, our non-stationary DDM can closely resimulate and predict the temporal trajectory of empirical RT time series as well as corresponding raw RT distributions from all individuals (see Appendix). Even though the standard DDM also accounts for the marginal RT distribution, it cannot generate the observed heterogeneous RT time series data (cf. Figure 7).
However, the most decisive advantage of our nonstationary DDM over its stationary counterpart is that it can recover parameter dynamics directly from the empirical data. As the static parameters of fast-dm can only vary homogeneously around their mean, we cannot detect any systematic changes in the parameters over time. However, the dynamic parameters estimated with our neural method strongly suggest such systematic changes. All parameters of the non-stationary DDM seem to exhibit considerable fluctuations and notable oscillations throughout the experiment. Due to the assumption of homogeneous variation, the inter-trial variabilities inferred with fast-dm vastly overestimate the uncertainty in parameter estimates (cf. Figure 8). The dynamic drift rates fluctuate roughly within the uncertainty corridors spanned by the homogeneous inter-trial variabilities, but exhibit  much tighter error bars. As a consequence, local drift rates are much less uncertain than the homogeneous variability parameters indicate. On the other hand, the dynamic non-decision time τ fluctuates more than the corresponding flat inter-trial variability. Note that fast-dm does not support estimating inter-trial variability of the threshold a, so we only report the estimates of our neural method, suggesting a substantial decrease of the threshold parameter throughout the experiment. Notably, we observe a considerable mismatch between heterogeneous and homogeneous dynamics in almost all individuals (see Appendix).

Discussion
In this work, we explored the merits of superstatistics for representing non-stationary dynamics in cognitive processes, along with the utility of a neural Bayesian method for estimating superstatistical models. We verified the computational faithfulness and adequacy of our method using simulations and two benchmark studies. We then applied our method to a dynamic, non-stationary diffusion decision model and estimated the temporal trajectories of its key parameters, namely, drift rates, decision threshold, and non-decision time from the data of two experiments.
We showed that such a non-stationary model i) can indeed be fit to long time series of human data with high fidelity and ii) that the inferred heterogeneous dynamics reveal patterns that would have remained hidden by tradi-tional stationary models [2,6]. To our knowledge, this is the first attempt to augment a stationary cognitive model by employing a superstatistics framework.
Previous research has suggested that response times often exhibit heterogeneous dynamics [10,9]. It has also been shown that even the history of past choices can influence specific parameters of the DDM [40]. Hence, it seems plausible that the cognitive processes represented by the DDM parameters vary over time even within an experimental session due to internal psychological factors. This is exactly what was implied by the individual parameter dynamics inferred from the lexical decision task data set. However, as the data originates from an experiment that was not designed explicitly to test dynamic modeling, we need to be wary of any ad hoc interpretations concerning the estimated parameter dynamics.
Nevertheless, some of the recovered patterns may suggest interpretable underlying changes. For instance, the threshold parameter seemed to decrease within an experimental session for many individuals. This indicates that participants generally responded less cautiously toward the end of an experimental session. A plausible explanation for this change in response caution might be that participants became increasingly bored during a session and started to decrease their ambitions. Note that current DDM modeling approaches rarely account for such variation in the threshold parameter. Further, the drift rates generally tended to increase over time, suggesting that participants' increased their information processing speed over time. A change in the average rate of information uptake typically results in shorter RTs, which is precisely what we observed in most individual data sets (cf. Appendix). These increases in drift rates over time could imply the occurrence of learning effects. An important next step will be to tailor experiments with systematic manipulations from which we expect specific changes in some cognitive process and test whether the estimated parameter dynamics exhibit these changes.
Notwithstanding, our neural method has certain limitations. As can be seen in Figure 3, the values for most parameters change strongly at the beginning of the time series. One could be tempted to (falsely) claim that the psychological constructs mapped to these parameters drastically change at the beginning of the first session of the experiment. However, these early parameter trajectories should be interpreted with great caution as they can be quite dependent on the initial prior. As a result, we cannot easily differentiate between initially large Bayesian updates to move away from the prior or actual changes in the underlying process. As is the case for any dynamic process, our modeling approach may also not be sensible for data sets with few observations. In the context of psychological experiments, a possible remedy could be to use burn-in trials at the beginning of an experiment that only serves the purpose of having some data points to inform the plausible parameter values. At the same time, these could serve as practice trials during which participants get accustomed to the task.
Furthermore, the simulation study has demonstrated that the non-stationary DDM exhibits a good performance in recovering parameters across various scenarios. However, it is essential to acknowledge that there still exists an error between the true and estimated parameters. Especially for the drift rate parameter errors around 0.25 have been observed frequently. Consequently, interpreting small local changes in parameter values requires caution. Despite this limitation, we firmly believe that the proposed method excels particularly in scenarios where moderately large changes in parameters are expected to occur over the course of a couple of time steps.
Another limitation concerns the implementation of the low-level mechanistic model, that is, the DDM itself. We assumed four different drift rates -one for each stimulus type -which is the standard procedure used in the application of stationary DDMs [2]. This parameter is usually regarded as a proxy for average information uptake speed. However, in theory, there should only be one drift rate per participant [3] that changes over time, for instance, due to experimental manipulation. Thus, a non-stationary DDM could also incorporate only one drift rate parameter. In our experiment, the manipulation (i.e., four conditions) was randomized throughout the experiment. This implies that besides fluctuation stemming from other sources, the drift rate would "jump" from trial to trial based on this change in task difficulty. To account for these jumps, we would need a different high-level transition model whose changes can be bigger than what a smooth Gaussian process or Gaussian random walk allows. In order to keep the content of this article manageable, we decided against proposing a novel transition model.
Finally, there are numerous degrees of freedom when implementing a computational model -not only with respect to the low-level observation model, but also regarding the high-level transition model. Exploring different model specifications and then deciding which is the most sensible for the type of task and data at hand requires Bayesian model comparison. Concerning dynamic cognitive models, it would be of particular interest to test which highlevel transition model specification is most plausible for a given setting [24]. Since Bayesian model comparison is a topic in its own right, future studies should investigate the utility of simulation-based methods [59,60] for comparing competing superstatistical models.
We acknowledge that our study may not provide a definitive argument for when and why a non-stationary DDM is superior to a static DDM. The primary objective of this article is to showcase the implementation of nonstationary parameters within a superstatistics framework. However, we believe that the superstatistics framework, coupled with powerful neural approximators, gives rise to many new modeling opportunities and makes it possible to augment virtually any computational model with timevarying parameters. We think that there are many interesting research questions out there that could be investigated with the approach we propose in this work. Future studies can use our method to estimate even more challenging cognitive models than the DDM explored in this work and further extend its scope beyond cognitive science and psychology.

Methods Experimental Tasks
Random-Dot Motion Task The data set used in this study was adopted from Evans and Brown [55]. It includes data from 58 individuals, after excluding participants with a response accuracy below 70%. Each individual was randomly assigned to one of six groups, which were formed by two factors: the time vs. trial condition and three levels of feedback details. During the experiment, participants solved a total of 24 blocks of the task. In the trial condition, each block comprised 40 trials, whereas in the time condition, each block lasted for 1 minute. In each trial, participants were presented with a random dot kinematogram and were required to determine if some of the dots coherently moved to the top-left or topright direction. For more in-depth information about the experimental setup and methodology, refer to the comprehensive details provided in Evans and Brown [55].
Lexical Decision Task A total of 11 students from Heidelberg University participated in the experiment. Their average age was 23.81 (SD = 3.30) and 10 of the participants were female. All individuals gave written informed consent to the study, which was approved by the local ethics committee. The study was conducted according to the ethical declarations of Helsinki.
The participants performed a lexical decision-making task. On each trial, they had to assess if a presented letter string was a German word. As stimuli, we used high and low-frequency words, pseudo words that were generated by replacing vowels of existing words, and random letter strings. These four experimental conditions were pseudo-randomly presented throughout 3200 trials. All participants solved their task on 4 separate days (sessions) consisting of 800 trials each. The sessions were further split into 8 blocks of 100 trials with short breaks between them. On each trial, participants' choice (German word; non-German word) and response time was recorded.

Model Family
Following Mark et al. [24], we consider dynamic models that entail a low-level model with time-dependent parameters θ t , which vary according to a high-level model with static parameters η. The low-level model is defined by a likelihood function L, and the high-level model consists of a transition function T .
In this work, we aim to tackle general superstatistical models for which the low-level model likelihood L may not be available in closed-form. Such models are implemented as randomized stateful simulators that generate observable trajectories {x t } T t=1 via the following (very general) recurrent system: In the above equation, T is an arbitrary high-level transition function parameterized by η, G stands for an arbitrary (non-linear) transformation which encodes the functional assumptions of the low-level model. ξ t ∼ p(ξ) and z t ∼ p(z) are sources of random noise. The initial parameter configuration θ 0 follows a prior distribu-tion θ 0 ∼ p(θ) which encodes available information about plausible parameter values.
One example of a transition model T is a convolution with a Gaussian distribution, which implies a gradual change in the low-level model's parameters resembling a random walk: Another similar example would be a convolution with a fat-tailed distribution, allowing for abrupt changes in the parameter space. Furthermore, since our simulationbased setting is not limited to transition models with a Markov property, we can also test more complex transitions, such as a vector autoregression [VAR,61]: where p is the order of the VAR model (i.e., its look-back period), ξ t ∼ N (0, σ), and η = {c, A 1 , . . . , A p , σ} are the high-level parameters of the model.
We can even test transition models which depend on the entire history of the process, such as a Gaussian process [GP, 62] with mean function µ θ and covariance function K θ defined through the vector of time indices. The high-level parameters η in this case would be the free kernel parameters, such as the amplitude σ or the length-scale l of a Gaussian kernel A typical task in Bayesian analysis of dynamic systems is to recover both the entire trajectory of dynamic parameters {θ t } T t=1 as well as the vector of static parameters η. Since for many discrete dynamic systems, the current data point x t depends on the current parameter configuration θ t as well as on the observable history of the system x 1:t−1 , we can write the (implicit) point-wise likelihood as The point-wise likelihood describes the probability of each data point, given the parameter values of the same time step and all past data points [24]. Notably, we do not require this likelihood to be available in closed-form; we only need the ability to generate random draws through the forward-time generative process specified by equation (1).
Assuming the above factorization of the likelihood is possible, we aim to estimate the joint filtering posterior distribution of θ t and η up to each discrete time-step t p(θ t , η | x 1:t ) ∝ L t p(θ t | x 1:t−1 , η) p(η | x 1:t−1 ). (8) This posterior encodes the reduction in uncertainty regarding the dynamic states evolving over time and the static parameter values being increasingly constrained by the data. From this joint distribution, we can derive the corresponding marginal posteriors as follows: These distributions describe the average parameter dynamics over all possible high-level parameters and the best estimate for the high-level parameters up to discrete time-step t, respectively. Thus, learning both distributions amounts to standard Bayesian updating with an additional uncertainty factor due to the high-level transition model T . Thus, posterior contraction over time will strongly depend on the form of the transition model and may even increase in some cases, such as models allowing for sudden "jumps" in their parameters (i.e., regime switching behavior).

Neural Bayesian Estimation
Various methods for estimating dynamic models have been proposed in the literature. Markov chain Monte Carlo (MCMC) methods offer a viable but computationally demanding approach based on random draws from the posterior [63]. Variational inference (VI) methods approximate the true target posterior with simple, tractable densities and thus are a faster alternative to MCMC at the cost of a potential loss of posterior accuracy [63]. A recent promising approach for low-dimensional problems is the grid-based method of Mark et al. [24], which represents parameter distribution on discrete lattices and enables efficient approximation of model evidence.
However, the above methods all depend on the ability to evaluate the likelihood function L t at each time point explicitly. This restriction makes it impossible to efficiently test the growing number of simulator-based or non-analytic models of cognition to observed data [45,64]. Furthermore, MCMC and standard variational methods cannot leverage experience and require the same repeated computational effort for every new data set. For instance, when multiple participants complete a cognitive task, the same estimation procedures need to be repeated for each participant from scratch. Differently, hierarchical Bayesian models can be employed to jointly estimate group-and participant-level parameters, but they come with high computational costs and also rely on a closedform likelihood function.
In contrast, amortized inference refers to methods with a "pre-paid" computational cost -after an expensive op- Figure 9: A graphical illustration of our neural inference method. A recurrent neural approximator updates the posterior of the low-level model parameters θ t each time step t and yields the posterior over the high-level model parameters η considering all available data. The low-level prior constrains the initial dynamic parameter values θ 0 , which then get passed to the high-level transition model. Together, the two priors and the two models comprise a stochastic simulator that trains the neural approximator to perform amortized Bayesian updating.
timization or training phase, the same procedure can be instantly applied to any data set whose structure is compatible with the model [45,46]. As a useful "side effect", amortization allows us to easily perform extensive checks of computational faithfulness and parameter recoverability "in silico", since we can obtain posterior samples from hundreds or even thousands of simulated data sets by applying the same pre-trained network. Amortized Bayesian inference is typically realized by specialized neural networks, which are trained to become estimation experts from repeated model simulations [65,45]. The architecture of these networks can easily encode the probabilistic symmetry of the data, for instance, recurrent networks for temporal data [66] or permutation-invariant networks for IID data [67].
Crucially, dynamic models with time-varying parameters present a challenge to existing neural architectures since they induce a new joint posterior at each time-step p(θ t , η | x 1:t ). However, most previous architectures can only estimate a single set of parameters with no temporal information [45,65,47]. Thus, we propose to use a recurrent probabilistic neural architecture that estimates the joint posterior over all static and dynamic parameters for all discrete time points in a single forward pass.

Recurrent Estimation Method
Our proposed architecture consists of several neural components. First, a recurrent neural network (RNN) with learnable parameters ψ (r) embodying long short-term memory (LSTM) consumes the observed data sequentially: where the hidden state h t at each time point t represents the internal memory of the network over arbitrary temporal intervals. Thus, we can treat h t as a compact representation of the observable history up to time point t.
We employ a standard LSTM network, which consists of three gates: an input gate, an output gate, and a forget gate. These gates are responsible for weighing and integrating old and new information. Importantly, LSTM networks can naturally deal with sequences of varying length, which enables them to process streams of "online" data [66].
In order to recover the time-varying parameters θ t of the low-level model as well as the static high-level parameters η, we use the hidden state h t as a conditioning vector for a generative neural network with trainable weights ψ (g) . This network can be implemented as a conditional variant of any popular generative architecture for inference, such as coupling networks [68], autoregressive flows [69], or standard neural networks with probabilistic outputs [70]. The generative network is responsible for approximating the current joint posterior up to time step t given the outputs of the recurrent summary network: To reduce notational clutter, we set ψ = (ψ (r) , ψ (g) ) and assume that h t is expressive enough to encode all information contained in the data for correctly updating the prior (i.e., h t is a maximally informative summary statistic of x 1:t ).
While being mathematically equal, these factorizations imply different neural architectures and correspond-ing ancestral sampling schemes. The former factorization (equation (12)) requires a generative network for first sampling the high-level parameters from p(η | x 1:t ) and then sampling the low-level parameters from p(θ t | x 1:t , η), conditional on the sampled highlevel parameters. On the other hand, the latter factorization (equation (13)) requires a generative network for first sampling the low-level parameters from p(θ t | x 1:t ) and then sampling the high-level parameters from p(η | x 1:t , θ t ), conditional on the sampled low-level parameters.
In the current work, we consistently target the factorization in equation (12), but we were able to obtain comparable filtering results with either ancestral sampling strategy. In practice, we can either assume a multivariate Gaussian posterior for q(θ t | x 1:t , η, ψ) and q(η | x 1:t , ψ) as a dynamic extension of the basic method in [71] or estimate free-form posteriors as a dynamic extension of the BayesFlow method [45]. We use the former approach for the toy Coal Mining benchmark and the latter approach for all other experiments in this work.

Simulation-Based Training
where E [·] denotes an expectation over the dynamic generative model and ψ = (ψ (r) , ψ (g) ) denotes the collection of all trainable neural network parameters. This criterion ensures that the approximate posteriors match the analytic posteriors induced by the dynamic model and can be minimized either via online (i.e., generating dynamic simulations on the fly) or via offline training (i.e., using a set of pre-computed dynamic simulations).

A Non-Stationary Diffusion Decision Model
To illustrate the potential of our approach, we will reformulate in superstatistical terms a popular cognitive model for analyzing human response times (RTs) in binary decision tasks, namely the DDM. The standard DDM describes the microscopic dynamics of perceptual evidence accumulations via a simple stochastic ordinary differential equation (SDE). Accordingly, the accumulated evidence x j in experimental task j follows a random walk with drift and Gaussian noise: where t s represents time on a continuous microscopic scale (i.e., during forced-choice decision making). A core assumption of the DDM is that task-relevant information (i.e., perceptual evidence) accumulates at a constant rate (v). This process runs in a corridor with two absorbing boundaries, which represent two decision alternatives. As soon as the accumulated evidence x j reaches either a predefined threshold (a) or 0, the model makes a categorical decision D j for the alternative favored by the collected evidence: Further, the model assumes a constant additive factor (τ ) accounting for non-decision processes, such as encoding or motor responses. Thus, the standard (static) DDM has three key parameters θ = (v, a, τ ). The starting point of the decision process is either estimated as an additional parameter or fixed at a/2.
The typical assumption of the standard DDM is that the parameters θ remain stationary for the duration of a given cognitive task. In order to relax this restrictive assumption, the standard DDM has been extended to incorporate so-called inter-trial-variability for the drift rate and non-decision time parameters [72,73]. In this way, the extended DDM concedes that these cognitive parameters are not static but vary over time. However, the assumed variation is homogeneous and memoryless, and the generative model still yields IID data, that is, the transition model coincides with independent sampling and reduces to θ t = T (η, ξ t ).
In contrast, our superstatistical model assumes a stateful Gaussian process (GP) high-level model, which describes the trial-by-trial dynamics of the DDM parameters according to equation (5) and (6) (see Appendix for detail).
Thereby, we want to demonstrate that our estimation method can tackle very flexible transition models T , as long as we can simulate data from the low-level model. However, we also fit a DDM with a simpler Gaussian random walk transition model to the data described in the Human Data Application section. This simpler model corroborates our findings by suggesting qualitatively similar parameter dynamics, but yields less sharp predictions on unseen data than its GP counterpart (see Appendix for more details).

Acknowledgments
We thank Daniel Durstewitz and Lasse Elsemüller for their helpful feedback on this project. We also thank Marie Wieschen for her efforts in data collection.

Author contributions
L.S., S.T.R., designed the research, created and applied the models, and wrote the initial draft of the manuscript. P.C.B. significantly contributed with his statistical and scientific expertise to the methods and the written content. A.V. and U.K. supervised the project and contributed to all stages.

Data and Code Availability
All models, data, and scripts for reproducing the results of this work are publicly available in the project's repository https://github.com/bayesflow-org/ Neural-Superstatistics. Our methods are implemented in the BayesFlow Python library for amortized Bayesian workflows [74].

Competing interests
The authors declare no competing interest.

Appendix Implementation Details
All experiments, neural networks, and simulation models are implemented using the BayesFlow library https: //github.com/stefanradev93/BayesFlow built on top of TensorFlow [75]. Code and further instructions for reproducing the results from all experiments and applications in the current manuscript is available at https: //github.com/bayesflow-org/Neural-Superstatistics.

Stan Benchmark Study Data Simulation
To simulate the 100 data sets each consisting of T = 100 trials, we used the standard DDM implementation (cf. equation (15) in the main text). The diffusion constant was fixed to 1 and the starting point parameter to 0.5 (i.e., symmetric starting point between the two decision boundaries). For data simulation, we randomly sampled parameter sets from the following prior distributions: where Γ(a, b) denotes a Gamma distributions with shape a as the first and scale b as the second argument.

Non-Stationary DDM fitting
We fitted a separate non-stationary DDM with a Gaussian random walk transition model to all 100 simulated data sets. The same implementation and likelihood was used for Stan and our neural estimation method. However, all 3 parameters were allowed to vary according to a Gaussian random walk (cf. equation (3) in the main text). The starting values were sampled from the same prior distributions as in simulation. The hyperparameters of the random walk transition model were sampled from the following distribution: where B(α, β) denotes a Beta distribution with α and β parameters. In order to avoid implausible parameter values, the time-varying parameters v t , a t , τ t were clipped to lower bounds [0, 0, 0] and upper bounds [6,4,2], respectively.
We trained the neural approximator via online learning (i.e., simulations on the fly) for 50 epochs with 1000 iterations each and a batch size of 8. We use an Adam optimizer with an initial learning rate of 5 × 10 −4 and a cosine learning rate decay schedule. After training the network, we draw 4000 posterior samples (the same as with Stan) for each of the 100 data sets.

Simulation Study
In what follows, we describe the settings for the four different simulation scenarios, namely, the static DDM, the DDM with stationary variability, the DDM with non-stationary variability, and the static DDM with random uniform jumps at pre-defined time steps (i.e., regime switching DDM). For each scenario, we simulated 200 data sets, each consisting of T = 400 time steps.
We trained the neural approximator via online learning (i.e., simulations on the fly) for 75 epochs with 1000 iterations each and a batch size of 8. We use an Adam optimizer with an initial learning rate of 5 × 10 −4 and a cosine learning rate decay schedule. After training the network, we draw 4000 posterior samples (the same as with Stan) for each of the 100 data sets.

Static DDM
To simulate the 200 data sets for the static DDM scenario, we used the same prior and likelihood as in the Stan Benchmark Study.

Stationary Variability DDM
For the stationary variability DDM, we used the same DDM implementation as in the static DDM scenario except that we used the following variability statements: where N (µ, σ) denotes a Normal distribution with location µ and standard deviation σ and U(lower, upper) denotes an Uniform distribution with a lower and a upper bound.
The newly introduced variability parameters (v s , a s , τ s ) were sampled from the following prior distributions: To avoid implausible values the per trial parameters v t , a t , τ t were bounded with lower bounds [0, 0, 0] and upper bounds [6,4,2] respectively.

Non-Stationary DDM
We used the same non-stationary DDM implementation as described in Stan Benchmark Study.

Regime Switching DDM
The regime switching DDM is basically the same implementation as the static DDM, but the parameter jumped uniformly at 3 specific time steps (T = 100; T = 200; T = 300) and stayed again constant after the jump: where the lower and upper bounds of the Uniform distributions are [0, 0, 0] and [6,4,2], respectively. The starting values of the parameters were once again sampled form the same prior distributions as in the static DDM.

Amortized inference
We fitted the same non-stationary DDM with a Gaussian transition model

Mean Absolute Error
As an additional analysis of the overall parameter recovery performance of the non-stationary DDM, we computed the median absolute error (MAE) between the true data generating parameter and the posterior mean for all DDM parameters and simulation scenarios separately. In the top row of Figure A. 15 we can see that the posterior estimates of the non-stationary DDM quickly approach the true data-generating parameter when the true parameter was constant over time. That said, there remains some error between the true and estimated parameter even after 400 time steps. This error is the largest in the drift rate parameter (≈ 0.15). We see similar recovery performance in the scenario, where the parameters were allowed to randomly fluctuate around a constant value (second row in Figure A.15). However, we observe a larger variability in the MAE. The third row depicts the MAE when data was simulated with the same model as we fitted to the data (i.e., the well-specified case). Once again, the MAE quickly decreases in the beginning and then flattens out. However, in this scenario, the MAE remains on a larger level than in the previous two scenarios. Also, the variability of the MAE between data fits is larger. This is not surprising because the estimation of non-stationary model parameters is more difficult than static or stationary variable parameters. The last row in Figure A.15 shows how the parameter estimates of the non-stationary DDM react to sudden jumps in otherwise constant parameters. We observe that the MAE significantly increases when a jump occurred and then decreases again.

Human Data Application: Random-Dot Motion
We fitted a non-stationary DDM with a Gaussian random walk transition model to each individual in the data set separately. The model implementation was the same as in the Stan Benchmark and the Simulation Study. For the training of our neural estimation method we used 75 epoch each consisting of 1000 iteration with a batch size of 16.
After training we obtained 2000 posterior samples.

Human Data Application: Lexical Decision Gaussian Process DDM
For the Gaussian Process transition model, we first create a T x T squared distance matrix with T = 3200. Based on this distance matrix we calculate the radial basis function kernel (cf. equation (6) in the main text) given the two parameters, amplitude σ and length-scale l, resulting in the covariance k for the multivariate normal distribution of the Gaussian Process: where µ θ is the mean parameter value. For these means we used the same priors we otherwise used for the starting values of the DDM parameters (v 0,i , a 0 , τ 0 ). In the following we present a list of the priors used by the Gaussian Process DDM simulator to generate data for the simulation study and for training the neural networks. Γ(a, b) refers to a Gamma distribution parameterized with shape a and scale b. The same prior distribution was used for all i = 4 drift rates v 0,i . U(a, b) stands for a continuous uniform distribution with a lower limit a and an upper limit b. l j denotes the length-scale parameters of the GP transition model. The same prior distribution was used for all j = 1, . . . , 6 lengthscale parameters governing the transitions of the DDM parameters. The amplitude parameter σ of the Gaussian kernel is usually highly correlates with the length-scale l. Thus, we fixed σ to sensible values for all low-level parameter transitions.

Simulation-Based Calibration
We validate the computational faithfulness of our Bayesian inference algorithm using simulation-based calibration, a robust method for ensuring unbiased posterior distributions. The underlying principle is that an ensemble of posterior distributions should be indistinguishable from the prior distribution. To accomplish this, we carry out 2000 simulations with the dynamic DDM, each generating a separate data set. For each simulated data set, we fit the model and obtain 250 posterior samples. These posterior distributions collectively form an ensemble.
When we calculate rank statistics for the ensemble relative to the prior distribution then these should be uniformly distributed. To assess the uniformity at predefined time points, we utilize the empirical cumulative distribution function (ECDF) for each marginal rank distribution. Comparing it with a uniform ECDF allows us to gauge how the data is distributed. We further draw ECDF simultaneous bands using simulations from the uniform, providing an intuitive graphical test for uniformity. For clarity, Figure A.16 presents the ECDF difference, providing a more dynamic range for the visualization. The red line (ECDF difference) should consistently fall within the gray shaded area (confidence band) across the entire range of fractional rank statistic values. In the majority of cases, this criterion is met for most parameters at all selected time points. Some slight deviations are observed for the threshold and non-decision parameters; however, these are typically small and not a major cause for concern.

Individual Model Fits and Predictions
In the following, we show the fit and multi-horizon predictions of the dynamic DDM on the individual data of the remaining 10 participants not shown in the main text.

Individual Parameter Dynamics
In the following, we show the inferred parameter dynamics of the remaining 10 participants not shown in the main text.

Gaussian Random Walk Transition Model
We wanted to test if our neural estimation method can also estimate dynamic models with a simpler high-level transition model than a Gaussian process (GP). To this end, we fit a dynamic DDM with a Gaussian random walk as a transition model to the empirical data set described in the Human data application section: θ t = T (θ t−1 , η, z t ) = θ t−1 + η z t with z t ∼ N (0, 1) We use a Beta prior distribution parameterized with α and β for the standard deviations η j of the Gaussian random walk transition model. The same prior distribution is used for all j = 6 low-level parameter transitions: We trained the same neural network architecture as described in the main text for 50 epochs, 1000 batches per epoch, and a batch size of 8. The following figures show the results from simulation-based calibration (SBC), the model fit and inferred parameter dynamics for the same exemplar participant shown in the main text. Additionally, we depict the estimated parameter dynamics averaged across all individuals for comparison. These results are very similar to those obtained with the GP-DDM, which uses a Gaussian process as a transition model. However, the model with the Gaussian process transition model produces sharper predictions on unseen data. Note, that the dynamics implied by the random walk transition model are less sharper (i.e., contain more uncertainty) than those implied by the GP transition model.