Introduction

Sensorimotor control depends on accurate estimation of internal state variables1,2,3,4,5. Numerous experiments have used Bayesian estimation theory to demonstrate that humans estimate internal states by integrating multiple sources of information including prior beliefs and sensory cues from various modalities6,7,8,9,10,11,12,13,14,15,16. Bayesian estimation is typically formulated in terms of three components: prior distributions representing a priori beliefs about state variables, likelihood functions derived from noisy sensory measurements, and cost functions that characterize reward contingencies17. In this formulation, the likelihood function and prior distribution are combined to compute a posterior distribution and the cost function is used to extract an estimate that maximizes expected reward. This formulation is the basis of most psychophysical studies of Bayesian integration9,10,11,12,13,14,15,18,19,20.

Implicit in this formulation is the assumption that the brain has access to priors, likelihoods, and cost functions. Access to these quantities is appealing as it could support rapid and optimal state estimation without the need to learn new policies for novel behavioral contexts21,22. However, in most experiments, Bayes-optimal behavior can also be achieved by simpler algorithms that do not depend on direct access to likelihoods, priors and cost functions21,22,23. For example, optimal cue combination in the presence of Gaussian noise may be implemented by a weighted sum of measurements6. Similarly, integration of noisy evidence with prior beliefs may be implemented by a suitable functional mapping between measurements and estimates24,25. Finally, online estimation of a variable from sequential measurements that are subject to Gaussian noise can be achieved by a Kalman filter that only keeps track of the mean and variance26 without representing and updating the full posterior distribution.

In contrast to simple laboratory tasks, optimal inference in natural settings is often intractable and involves approximations that may deviate from optimality27. Therefore, it is critical to go beyond statements of optimality and suboptimality, and assess the inference algorithms humans use during sensorimotor and cognitive tasks28,29. Further, as articulated by Marr30, characterization of the underlying algorithms could establish a bridge between behaviorally relevant computations and neurobiological mechanisms.

We devised an experiment in which the computational demands for optimal Bayesian estimation were incompatible with simple algorithmic solutions. Subjects had to reproduce an interval by integrating their prior belief with one or two measurements of the interval. Several previous experiments have reported a decrease in perceptual or motor variability when subjects are given multiple intervals to measure31,32,33,34,35,36,37,38,39,40, but the underlying algorithms are not characterized. An important constraint for developing a suitable algorithmic model is that noise associated with measurement and production of time intervals scales with duration41. A consequence of this so-called scalar property of noise is that simple algorithms that only update certain parameters of the posterior (e.g., mean and/or variance) cannot emulate Bayes-optimal behavior. Therefore, optimal behavior in this paradigm would provide strong evidence that the underlying inference algorithm involves updating probability distributions. Conversely, suboptimal behavior would suggest that subjects rely on a simpler algorithm. We found that when subjects made two measurements their performance was suboptimal. Furthermore, comparison of behavior with various models indicated that subjects relied on an inference algorithm that used measurements to update point estimates using point nonlinearities.

Results

Subjects integrate interval measurements with prior knowledge

Subjects performed an interval reproduction task consisting of two randomly interleaved trial types (Fig. 1A,B). In “1-2-Go” trials, two flashes (S1 followed by S2) demarcated a sample interval (ts). Subjects had to reproduce ts immediately after S2. The interval between the onset of S2 and when the keyboard was pressed was designated as the production interval (tp). In “1-2-3-Go” trials, ts was presented twice, demarcated once by S1 and S2 and once by S2 and S3, providing the opportunity to make two measurements (Fig. 1B). Similar to 1-2-Go, subjects had to match tp (the interval between S3 and keyboard press) to ts. Across trials, ts was drawn from a discrete uniform distribution ranging between 600 and 1000 ms (Fig. 1C). Subjects received two forms of trial-by-trial feedback based on the magnitude and sign of the error. First, a feedback stimulus was presented whose location relative to the warning stimulus reflected the magnitude and sign of the error (Fig. 1A,B; see Methods). Second, if the error exceeded a threshold window (Fig. 1D), stimuli remained white and a tone denoting incorrect response was presented. Otherwise, the stimuli turned green and a tone denoting correct was presented. The threshold window for correct performance was proportionally larger for longer ts to accommodate the scalar variability of timing due to signal-dependent noise34,42,43,44,45,46. The threshold was adjusted adaptively and on a trial-by-trial basis according to performance (see Methods).

Figure 1
figure 1

The 1-2-Go and 1-2-3-Go interval reproduction task. (A,B) Task design. Each trial began with the appearance of a fixation spot (Fix on). The color of the fixation spot informed the subject of the trial type: blue for 1-2-Go, and red for 1-2-3-Go. After a random delay, a warning stimulus (large white circle) appeared. Additionally, two or three small white rectangles were presented above the fixation spot. The number of rectangles was associated with the number of upcoming flashes. After another random delay, two (S1 and S2 for 1-2-Go) or three (S1, S2 and S3 for 1-2-3-Go) white annulae were flashed for 100 ms in a sequence around the fixation spot. Consecutive flashes were separated by the duration of the sample interval (ts). With the disappearance of each flash, one of the small rectangles also disappeared (rightmost first and leftmost last). The white rectangles were provided to help subjects keep track of events during the trial. Subjects had to press a button after the last flash to produce an interval (tp) that matched ts. Immediately after button press, subjects received feedback. The feedback was a small circle that was presented to the left or right of the warning stimulus depending on whether tp was larger or smaller than ts, respectively. The distance of the feedback circle to the center of the warning stimulus was proportional to the magnitude of the error (tp − ts). (C) Experimental distribution of sample intervals. (D) Feedback. Subjects received positive feedback if production times fell within the green region. The width of the positive feedback window was scaled with ts.

Subjects’ timing behavior exhibited three characteristic features (Fig. 2). First, tp increased monotonically with ts. Second, tp was systematically biased toward the mean of the prior, as evident from the tendency of responses to deviate from ts (diagonal) and gravitate toward the mean ts. As proposed previously24,47,48,49, this so-called regression to the mean indicated that subjects relied on their knowledge of the prior distribution of ts. Third, performance was better in 1-2-3-Go condition in which subjects made two measurements, as evidenced by a lower root-mean-square error (RMSE) in 1-2-3-Go compared to 1-2-Go condition (Fig. 2C; permutation test; p-value < 0.01 for all subjects; see Supplementary Table 1 for a summary of RMSE data by subject and condition). This observation indicates that subjects combined the two measurements to improve their estimates by decreasing variability and systematic biases, corroborating reports from other behavioral paradigms31,32,33,34,35,36,37,38,39,40,50,51. Combined with the systematic bias toward the mean of ts, these results indicated that subjects integrated prior information with one or two measurements to improve their performance.

Figure 2
figure 2

Performance in the interval reproduction task. (A) Production interval (tp) as a function of sample interval (ts) for a low sensitivity subject (GB). Filled circles and error bars show the mean and standard deviation of tp for each ts in the 1-2-Go (blue) and 1-2-3-Go (red) conditions. The dotted unity line represents perfect performance and the colored lines show the expected tp from a Bayes Least-Squares (BLS) model fit to the data. Inset: root-mean-square error (RMSE) in the 1-2-Go (blue) and 1-2-3-Go (red) conditions differed significantly (asterisk, p-value < 0.01; permutation test). (B) Same as (A) for a high sensitivity subject (LB). (C) The histogram of changes in RMSE across conditions for all subjects. See also Supplementary Table 1.

A Bayesian model of behavior

Building on previous work11,18,22,52, we asked whether subjects’ behavior could be accounted for by a Bayesian observer model based on the Bayes-Least Squares (BLS) estimator. For the 1-2-Go trials, the observer model (1) makes a noisy measurement of ts, which we denote by \({t}_{{m}_{1}}\), (2) combines the likelihood function associated with \({t}_{{m}_{1}}\), \(p({t}_{{m}_{1}}|{t}_{s})\), with the prior distribution of ts, p(ts), to compute the posterior, \(p({t}_{s}|{t}_{{m}_{1}})\), and (3) uses the mean of the posterior as the optimal estimate, \({t}_{{e}_{1}}\). We modeled \(p({t}_{{m}_{1}}|{t}_{s})\) as a Gaussian distribution centered at ts with standard deviation, σm, proportional to ts with constant of proportionality, wm; i.e., σm = wmts (Fig. 3A, left box). We assumed that the production process was also perturbed by noise and modeled tp as a sample from a Gaussian distribution centered at \({t}_{{e}_{1}}\) with standard deviation, σp, proportional to \({t}_{{e}_{1}}\) with constant of proportionality, wp; i.e., \({\sigma }_{p}={w}_{p}{t}_{{e}_{1}}\) (Fig. 3A, right box). Note that the entire operation of the BLS estimator can be described in terms of a deterministic mapping of \({t}_{{m}_{1}}\) to \({t}_{{e}_{1}}\) using a nonlinear function, which we denote as \({f}_{BL{S}_{1}}({t}_{{m}_{1}})\) (Fig. 3B)24.

Figure 3
figure 3

BLS model of interval integration. (A) BLS model for 1-2-Go trials. The left panel illustrates the measurement process. The measured interval, \({t}_{{m}_{1}}\), is perturbed by zero-mean Gaussian noise whose standard deviation is proportional to the sample interval, ts, with constant of proportionality wm (σm = wmts). The middle panel illustrates the estimation process. The model multiplies the likelihood function associated with \({t}_{{m}_{1}}\) (middle panel, green) with the prior (bottom), and uses the mean of the posterior (top) to derive an interval estimate (\({t}_{{e}_{1}}\), black vertical line on the posterior). The right panel illustrates the production process. The produced interval, tp, is perturbed by zero-mean Gaussian noise with standard deviation proportional to \({t}_{{e}_{1}}\), with constant of proportionality wp (\({\sigma }_{p}={w}_{p}{t}_{{e}_{1}}\)). (B) The effective mapping function (\({f}_{BL{S}_{1}}\), black curve) from the first measurement, \({t}_{{m}_{1}}\), to the optimal estimate, \({t}_{{e}_{1}}\). The dashed line indicates unity. (C) BLS model for 1-2-3-Go trials. The model uses the posterior after the first measurement, \(p({t}_{s}|{t}_{{m}_{1}})\), as the prior and combines it with the likelihood of the second measurement (\({t}_{{m}_{2}}\), orange) to compute an updated posterior, \(p({t}_{s}|{t}_{{m}_{1}},{t}_{{m}_{2}})\). The mean of the updated posterior is taken as the interval estimate (\({t}_{{e}_{2}}\)). (D) The effective mapping function (\({f}_{BL{S}_{2}}\), grayscale) from each combination of measurements, \({t}_{{m}_{1}}\) and \({t}_{{m}_{2}}\), to the optimal the estimate, \({t}_{{e}_{2}}\). Red lines indicate combinations of measurements that lead to identical estimates (shown for \({t}_{{e}_{2}}=\) 700, 750, 800, 850, and 900 ms).

For the 1-2-3-Go trials, the observer model (1) makes two measurements, \({t}_{{m}_{1}}\) and \({t}_{{m}_{2}}\), (2) combines the likelihood, \(p({t}_{{m}_{1}},\,{t}_{{m}_{2}}|{t}_{s})\), with the prior, p(ts), to compute the posterior, \(p({t}_{s}|{t}_{{m}_{1}},\,{t}_{{m}_{2}})\), and (3) uses the mean of the posterior to derive an optimal estimate, \({t}_{{e}_{2}}\). When the measurements are conditionally independent, the posterior is proportional to \(p({t}_{{m}_{2}}|{t}_{s})p({t}_{{m}_{1}}|{t}_{s})p({t}_{s})\) which can be rewritten as \(p({t}_{{m}_{2}}|{t}_{s})p({t}_{s}|{t}_{{m}_{1}})\). This revised formulation can be interpreted in terms of an updating strategy in which the observer uses the posterior after one measurement, \(p({t}_{s}|{t}_{{m}_{1}})\), as the prior for the second measurement (Fig. 3C; see Methods). In these trials, the mapping from \({t}_{{m}_{1}}\) and \({t}_{{m}_{2}}\) to \({t}_{{e}_{2}}\) can be described in terms of a two-dimensional nonlinear function, denoted by \({f}_{BL{S}_{2}}({t}_{{m}_{1}},\,{t}_{{m}_{2}})\) (Fig. 3D). Note that the iso-estimate contours of \({f}_{BL{S}_{2}}({t}_{{m}_{1}},\,{t}_{{m}_{2}})\) are nonlinear and convex (Fig. 3D, red). The nonlinearity indicates that the effect of \({t}_{{m}_{1}}\) and \({t}_{{m}_{2}}\) on \({t}_{{e}_{2}}\) is non-separable, and the convexity indicates that \({t}_{{e}_{2}}\) is more strongly influenced by the larger of the two measurements. These features are direct consequences of scalar noise and are not present when measurements are perturbed by Gaussian noise (see Appendix). Note that the BLS model for the 1-2-3-Go task reduces RMSE by reducing both prior-induced biases and variability, as previous studies have reported31,32,33,34,35,36,37,38,39,40,50,51.

We fit the model to each subject’s data assuming that responses in both 1-2-Go and 1-2-3-Go conditions were associated with the same wm and wp (Methods). The model was augmented in two ways to ensure that estimates of wm and wp were accurate. First, we included an offset parameter to absorb interval-independent biases (e.g., consistently pressing the button too early or too late). Second, trials in which tp grossly deviated from ts were designated as “lapse” trials (see Methods).

Model fits captured subjects’ behavior for both conditions as shown by a few representative subjects (Figs 2A,B, 4A; see Supplementary Fig. 1 for fits to all the subjects). Following previous work24, we evaluated model fits using two statistics, an overall bias, BIAS, and an overall variability, \(\sqrt{{\rm{VAR}}}\) (see Methods). As shown in Fig. 4B, the model broadly captured the bias and variance for all subjects in both 1-2-Go and 1-2-3-Go conditions. We did not find any systematic difference in \(\sqrt{{\rm{VAR}}}\) between the model and data (see Supplementary Fig. 2). In contrast, the observed BIAS was significantly larger than predicted by the model fits in the 1-2-3-Go condition (Fig. 4B, inset; two tailed t-test, t(8) = 4.6982, p-value = 0.0015), but not in the 1-2-Go condition (two tailed t-test, t(8) = −0.3236, p-value = 0.7546).

Figure 4
figure 4

BLS model fits to data. (A) Behavior of two subjects and the corresponding BLS model fits with the same format as in Fig. 2A,B. (B) BIAS (circles) and \(\sqrt{{\rm{VAR}}}\) (squares) of each subject (abscissa) and the corresponding values computed from simulations of the fitted BLS model (ordinate). Red and blue points correspond to 1-2-Go and 1-2-3-Go, respectively. The dotted line plots unity. Data points corresponding to subjects SM and CV are marked by light green and light blue, respectively. Inset: difference between the BIAS observed from data and that expected by the BLS model fit for 1-2-Go (blue) and 1-2-3-Go (red) conditions. (C) Comparison of behavioral performance to model predictions. Each line connects the RMSE for 1-2-Go (left) and 1-2-3-Go (right) conditions for one subject. To facilitate comparison across subjects, RMSE values for each subject were normalized by the RMSE of the BLS model in the 1-2-3-Go condition. The black circles and error bars correspond to the mean and standard error of the normalized RMSE across subjects. See also Supplementary Figs 1, 2, and 3.

We quantified this observation across subjects by normalizing each subject’s RMSE in the 1-2-Go and 1-2-3-Go conditions to the RMSE expected from the BLS model in the 1-2-3-Go condition (Fig. 4C). We found that the observed RMSE in the 1-2-3-Go condition was significantly larger than expected (two tailed t-test, t(8) = 3.5484, p-value = 0.007). Further, the drop in observed RMSE in the 1-2-3-Go was significantly less than expected by the BLS model (see Supplementary Fig. 3). These analyses indicate that subjects were able to integrate the two measurements but failed to optimally update the posterior by the likelihood information associated with the second measurement.

Our original BLS model assumed that the noise statistics for \({t}_{{m}_{1}}\) and \({t}_{{m}_{2}}\) were identical. However, for two reasons, the noise statistics between the two measurements may differ. First, the internal representation of the first measurement may be subject to additional noise since it has to be held longer in working memory. Second, after the first measurement is made, subjects may be able to benefit from anticipatory and attentional mechanisms to make a more accurate second measurement. Both possibilities can be straightforwardly modeled by a modified BLS model that accommodates different levels of noise for the two measurements. Therefore, we also formulated an optimal estimator, BLSmem, in which the two measurements were associated with different noise statistics (see Methods, Supplementary Fig. 4). Despite having additional parameters, BLSmem failed to capture the BIAS observed from data in 1-2-3-Go trials (two-tailed t-test, t(8) = 4.9690, p-value = 0.0011; Supplementary Fig. 4).

An algorithmic view of Bayesian integration

The success of the BLS model in capturing behavior in the 1-2-Go condition24,47,49 and its failure in the 1-2-3-Go condition suggests that subjects were unable to update the posterior by the second measurement. We examined a number of simple inference algorithms that could account for this limitation. One of the simplest algorithms proposed for integrating sequential measurements is the Kalman filter. The Kalman filter only updates the mean and variance of the posterior26. This strategy is optimal when measurement noise is Gaussian because a Gaussian distribution is fully determined by its mean and variance. More generally, when integrating the likelihood function leaves the parametric form of the posterior distribution unchanged, a simple inference algorithm that updates those parameters can implement optimal integration.

First, we asked whether there exists a similarly simple and optimal updating algorithm when the noise is signal-dependent (i.e., scalar noise). For the posterior to have the same parametric form after one and two measurements, it is necessary that the product of two likelihood functions have the same parametric form as a single likelihood function. We tested this property analytically and verified that the parametric form of the likelihood function associated with scalar noise was not invariant under multiplication (see Appendix). As a result, the updating algorithm requires adjustment within each trial depending on \({t}_{{m}_{1}}\) (Supplementary Fig. 5). In other words, any inference algorithm that only updates certain statistics of the posterior (e.g., mean and variance) is expected to behave suboptimally when multiple time intervals have to be integrated. Therefore, we hypothesized that subjects might have used a simple updating algorithm analogous to the Kalman filter to integrate multiple measurements.

A linear-nonlinear estimator (LNE) model for approximate Bayesian inference

The first algorithm we tested was one in which the observer combines the last estimate \({\bar{t}}_{n-1}\), with the current measurement, \({t}_{{m}_{n}}\), using a linear updating strategy. If we denote the corresponding weights by 1−kn and kn and set kn = 1/n, this algorithm tracks the running average of the measurements, \({\bar{t}}_{n}\) (k1 and k2 are 1 and 0.5, respectively). This and similar models with linear updating schemes32,53,54,55,56,57,58,59,60 would certainly fail to account for the observed nonlinearities in subjects’ behavior (Supplementary Fig. 6). Therefore, we constructed a linear-nonlinear estimator (LNE) that augmented the linear updating by a point nonlinearity that could account for the observed prior dependent biases in tp (Fig. 5A). The nonlinear function, \({f}_{BL{S}_{1}}({\bar{t}}_{n})\), was chosen to match the BLS estimator for a single measurement (n = 1), which is determined by wm.

Figure 5
figure 5

A linear-nonlinear estimator (LNE) model and its fits to the data. (A) LNE algorithm. LNE derives an estimate by applying a nonlinear function, \({f}_{BL{S}_{1}}\), to the average of the measurements. In the 1-2-Go trials (top), the average, \({\bar{t}}_{1}\), is the same as the first measurement, \({t}_{{m}_{1}}\), and the estimate, \({t}_{{e}_{1}}\), is \({f}_{BL{S}_{1}}({\bar{t}}_{1})\). In 1-2-3-Go trials (bottom), the average, \({\bar{t}}_{2}\), is updated by the second measurement, \({t}_{{m}_{2}}\) (\({\bar{t}}_{2}\,=\,0.5[{\bar{t}}_{1}+{t}_{{m}_{2}}]\)), and the estimate, \({t}_{{e}_{2}}\), is \({f}_{BL{S}_{1}}({\bar{t}}_{2})\). In both conditions, the produced interval, tp, is perturbed by zero-mean Gaussian noise with standard deviation proportional to the final estimate (\({t}_{{e}_{1}}\) for 1-2-Go and \({t}_{{e}_{2}}\) for 1-2-3-Go) with the constant of proportionality wp, as in the BLS model. (B) The mapping from measurements to estimates (grayscale) for the LNE estimator in the 1-2-3-Go trials. Red lines indicate combinations of measurements that lead to identical estimates (shown for \({t}_{{e}_{2}}=\) 700, 750, 800, 850, and 900 ms). (C) Mean and standard deviation of tp as a function of ts for two example subjects (circles and error bars) along with the corresponding fits of the LNE model (lines). (D) BIAS (circles) and \(\sqrt{{\rm{V}}{\rm{A}}{\rm{R}}}\) (squares) of each subject (abscissa) and the corresponding values computed from simulations of the fitted LNE model (ordinate). Conventions match Fig. 4B. (E) The RMSE in the 1-2-Go and 1-2-3-Go conditions relative to the corresponding predictions from the LNE model (conventions as in Fig. 4C). See also Supplementary Figs 2, 3, 7, and 8.

Simulation of LNE verified that it could indeed integrate multiple measurements and exhibit prior-dependent biases (see Supplementary Fig. 7). However, the behavior of LNE was qualitatively different from BLS. The contrast between the two models was evident from a comparison of the relationship between measurements and estimates. Unlike BLS (Fig. 3D), estimates derived from LNE are linear with respect to \({t}_{{m}_{1}}\) and \({t}_{{m}_{2}}\), a feature that can be visualized by the linear iso-estimate contours of the LNE model (Fig. 5B).

We fitted LNE to each subject independently and asked how well it accounted for the observed statistics. The LNE model broadly captured the observed regression to the mean (Fig. 5C,D; see Supplementary Fig. 8 for fits to all subjects), but had a qualitative failure: subject’s behavior exhibited significantly more BIAS in 1-2-Go condition (Fig. 5D, inset; two tailed t-test, t(8) = 4.9304, p-value = 0.001) and significantly less BIAS in 1-2-3-Go condition (Fig. 5D, inset; two tailed t-test, t(8) = −2.3782, p-value = 0.045) than the biases predicted by the model. This failure can be readily explained in terms of how LNE functions. Since the static nonlinearity in LNE is the same for one and two measurements, the bias LNE generates is the same for the 1-2-Go and 1-2-3-Go conditions and improvements in estimation are achieved mainly through a reduction in VAR. Therefore, when we fitted LNE to data from both conditions, the model consistently underestimated BIAS for the 1-2-Go condition, and overestimated BIAS for the 1-2-3-Go condition (Fig. 5C, red and blue lines nearly overlap). Further, the LNE model made systematic errors predicting subject VAR in 1-2-Go trials (Supplementary Fig. 2).

We further evaluated LNE by asking how it accounted for the observed performance improvement in the 1-2-3-Go condition compared to the 1-2-Go condition. We normalized each subject’s RMSE from the 1-2-Go and 1-2-3-Go conditions to the RMSE expected from the behavior of the fitted LNE model in the 1-2-3-Go condition (Fig. 5E). Most subjects surpassed the predictions of the LNE model (horizontal line) for the 1-2-3-Go condition, and average RMSE reached values that were significantly lower than expected (0.990; two tailed t-test, t(8) = −2.463, p-value = 0.039). Based on these results, we concluded that LNE fails to capture subjects’ behavior both qualitatively and quantitatively.

An extended Kalman filter (EKF) model for approximate Bayesian inference

We considered a moderately more sophisticated algorithm inspired by the extended Kalman filter (EKF)61. This algorithm is shown in Fig. 6A. Upon each new measurement, EKF uses the error between the previous estimate and the current measurement to generate a new estimate. The difference between EKF and the Kalman filter is that the error is subjected to a nonlinear function before being used to update the previous estimate. This nonlinearity is necessary for the algorithm to be able to account for the nonlinear prior-dependent biases observed in behavior.

Figure 6
figure 6

An extended Kalman filter (EKF) model and its fits to the data. (A) EKF algorithm. EKF is a real-time inference algorithm that uses each measurement to update the estimate. After the first flash, EKF uses the mean of the prior as its initial estimate, \({t}_{{e}_{0}}\). The second flash furnishes the first measurement, \({t}_{{m}_{1}}\). EKF computes a new estimate, \({t}_{{e}_{1}}\), using the following procedure: (1) it measures the difference between \({t}_{{m}_{1}}\) and \({t}_{{e}_{0}}\) to compute an error, x1, (2) it applies a nonlinear function, \({f}^{\star }(x)\), to x1, (3) it scales \({f}^{\star }({x}_{1})\) by a gain factor, k1, whose magnitude depends on the relative reliability of \({t}_{{m}_{1}}\) and \({t}_{{e}_{0}}\), and (4) it adds \({k}_{1}\,{f}^{\star }({x}_{1})\) to \({t}_{{e}_{0}}\) to compute \({t}_{{e}_{1}}\). In the 1-2-Go condition (top), \({t}_{{e}_{1}}\) is the final estimate used for the production of tp. In the 1-2-3-Go condition (bottom), the updating procedure is repeated to compute a new estimate \({t}_{{e}_{2}}\) by adding \({t}_{{e}_{1}}\) to \({k}_{2}\,{f}^{\star }({x}_{2})\) where x2 is the difference between the second measurement, \({t}_{{m}_{2}}\), and \({t}_{{e}_{1}}\), and k2 is the scale factor determined by the relative reliability of \({t}_{{e}_{1}}\) and \({t}_{{m}_{2}}\). \({t}_{{e}_{2}}\) is then used as the final estimate for the production of tp. We assumed that the produced interval, tp, is perturbed by zero-mean Gaussian noise with standard deviation proportional to the final estimate (\({t}_{{e}_{1}}\) for 1-2-Go and \({t}_{{e}_{2}}\) for 1-2-3-Go) with the constant of proportionality wp, as in the BLS model. (B) The mapping from measurements to estimates (grayscale) for the EKF estimator in the 1-2-3-Go condition. Red lines indicate combinations of measurements that lead to identical estimates (shown for \({t}_{{e}_{2}}=\) 700, 750, 800, 850, and 900 ms). (C) Mean and standard deviation of tp as a function of ts for two example subjects (circles and error bars) along with the corresponding fits of the EKF model (lines). (D) BIAS (circles) and \(\sqrt{{\rm{VAR}}}\) (squares) of each subject (abscissa) and the corresponding values computed from simulations of the fitted EKF model (ordinate). Conventions match Fig. 4B. (E) The RMSE in the 1-2-Go and 1-2-3-Go conditions relative to the corresponding predictions from the EKF model (conventions as in Fig. 4C). See also Supplementary Figs 2, 3, 9, and 10.

In our experiment, immediately after the first flash, the only information about the sample interval, ts, comes from the prior distribution. Accordingly, we set the initial estimate, \({t}_{{e}_{0}}\), to the mean of the prior distribution. After the first measurement, EKF computes an “innovation” term by applying a static nonlinearity, \({f}^{\star }(x)\) to the error, x1, between \({t}_{{m}_{1}}\) and \({t}_{{e}_{0}}\). This innovation is multiplied by a gain, k1, and added to \({t}_{{e}_{0}}\) to compute the new estimate, \({t}_{{e}_{1}}\). In the 1-2-Go condition in which only one measurement is available, \({t}_{{e}_{1}}\) serves as the final estimate that the model aims to reproduce.

For the 1-2-3-Go condition, EKF repeats the updating procedure after the second measurement, \({t}_{{m}_{2}}\). It computes the difference between \({t}_{{m}_{2}}\) and \({t}_{{e}_{1}}\) to derive a prediction error, x2, which is subjected to the same nonlinear function, \({f}^{\star }(x)\), to yield a second innovation. This innovation is then scaled by an appropriate gain, k2, and added to \({t}_{{e}_{1}}\) to generate an updated estimate, \({t}_{{e}_{2}}\), which the model aims to reproduce.

The two important elements that determine the overall behavior of EKF are the nonlinear function \({f}^{\star }(x)\) and the gain factor(s) applied to the innovation(s) (k1 and k2) to update the estimate(s). We set the form of the nonlinear function \({f}^{\star }(x)\) such that biases in \({t}_{{e}_{1}}\) after one measurement are the same between EKF and BLS models. This ensures that EKF and BLS behave identically in the 1-2-Go condition. Note that our implementation of EKF assumes that the same nonlinear function is applied after every measurement. If one allows this nonlinear function to be optimized separately for each measurement, EKF would be able to replicate the behavior of BLS exactly (Supplementary Fig. 5).

For the gain factors, we reasoned that the most rational choice is to set the weight of each innovation based on the expected reliability of the corresponding estimate, \({t}_{{e}_{n-1}}\), relative to the new measurement, \({t}_{{m}_{n}}\), as in the Kalman filter (see Methods). This causes the gain factor to decrease with the number of measurements, and ensures that the influence of each new measurement is appropriately titrated. With these assumptions, EKF remains suboptimal for the 1-2-3-Go condition. However, it captures certain aspects of the nonlinearities associated with the optimal BLS estimator as shown by Fig. 6B (compare to Fig. 3D).

The algorithm implemented by EKF is appealing as it uses a simple updating strategy that can be straightforwardly extended to multiple sequential measurements and is a nonlinear version of error correcting mechanisms proposed for related synchronization tasks54,56,58. Furthermore, EKF captures important features of human behavior. First, integration of each new measurement causes a reduction in RMSE, as seen in 1-2-3-Go compared to 1-2-Go condition. Second, the nonlinear function applied to innovations allows EKF to incorporate prior information and capture prior-dependent biases. Third, since the nonlinearity is applied to each innovation (as opposed to the final estimate), EKF, unlike LNE, is able to capture the reduction in BIAS in 1-2-3-Go compared to 1-2-Go condition.

After simulating the model to ensure it integrates measurements and exhibits prior-dependent biases (Supplementary Fig. 9), we fitted EKF to each subject independently and asked how well it accounted for the observed statistics. Similar to BLS and LNE, EKF broadly captured the observed regression to the mean in the 1-2-Go trials (Fig. 6C,D, blue). This is not surprising since the EKF algorithm is identical to BLS when the prior is integrated with a single measurement. EKF was also able to capture the mean tp as a function of ts in the 1-2-3-Go trials (Fig. 6C,D red). Across subjects, EKF provided a better match to the data when compared to BLS and LNE although it modestly underestimated the BIAS in 1-2-3-Go condition (Fig. 6D, inset; two tailed t-test, t(8) = 4.6055, p-value = 0.02639). See Supplementary Fig. 10 for fits of the EKF model to all subjects.

We also asked if EKF could account for the observed RMSEs. To do so, we performed the same analysis we used to evaluate the BLS and LNE models. We normalized each subject’s RMSE from the 1-2-Go and 1-2-3-Go conditions to the RMSE expected from the EKF model for 1-2-3-Go (Fig. 6E). We found no significant difference between observed and predicted RMSEs for the 1-2-3-Go condition (two-tailed t-test, t(8) = 1.5506, p-value = 0.160), and no significant difference between the observed and predicted change in RMSE from the 1-2-Go to the 1-2-3-Go condition (Supplementary Fig. 3). These results indicate that subjects’ suboptimal behavior is consistent with the approximate Bayesian integration implemented by the EKF algorithm.

To further validate the superiority of the EKF model, we directly compared various models to BLS using log likelihood ratio. Specifically, we computed the ratio of the log likelihood of the data given each model and maximum likelihood parameters (\({\mathrm{log}}_{e} {\mathcal L} ({t}_{s},\,{t}_{p}|{ {\mathcal M} }_{i},\,{{\rm{\Theta }}}^{{\rm{ML}}})\) see Methods) to the log likelihood of the BLS model, \({\mathrm{log}}_{e} {\mathcal L} ({t}_{s},\,{t}_{p}|{ {\mathcal M} }_{BLS},\,{{\rm{\Theta }}}^{{\rm{ML}}})\), for each subject. We found that EKF provided the best fit for 8 out of 9 subjects (Table 1). For one of the subjects, the fits were poor for all models but LNE provided the best fit.

Table 1 Predictive log likelihood ratio for each model and subject.

Discussion

The neural systems implementing sensorimotor transformations must rapidly compute state estimates to effectively implement online control of behavior. Behavioral studies indicate that, at a computational level, state estimation may be described in terms of Bayesian integration6,7,8,9,10,11,12,13,14,15. However, describing behavior with a Bayesian model does not necessarily indicate that the brain implements these computations by representing probability distributions21,22,62. Here, we focused on integration of multiple time intervals and found evidence that the brain relies on simpler algorithms that approximate optimal Bayesian inference.

We demonstrated that humans integrate prior knowledge with one or two measurements to improve their performance. A key observation was that the integration was nearly optimal for one measurement but not for two measurements. In particular, when two measurements were provided, subjects systematically exhibited more BIAS toward the mean of the prior than expected from an optimal Bayesian model. This observation motivated us to investigate various algorithms that could lead to similar patterns of behavior.

Analytical and numerical analyses suggested that simple inference algorithms that update certain parameters of the posterior instead of the full distribution can not integrate multiple measurements optimally when the noise is signal-dependent. We then systematically explored simple inference algorithms that could perform sequential updating and account for the behavioral observations. One of the simplest updating algorithms is the Kalman filter63. However, this algorithm updates estimates linearly and thus cannot account for the nonlinearities in subjects’ behavior, even for a single measurement (Supplementary Fig. 6). The LNE model augmented the Kalman filter such that the final estimate was subjected to a point-nonlinearity. This allowed LNE to generate nonlinear biases but since the nonlinearity was applied to the final estimate, LNE failed to capture the decrease in bias observed in the 1-2-3-Go compared to 1-2-Go condition. Finally, we adapted the EKF, which is a more sophisticated variant of the Kalman filter that applies a static nonlinearity to the errors in estimation at every stage of updating. This algorithm accounted for optimal behavior in the 1-2-Go condition and exhibited the same patterns of suboptimality observed in humans in the 1-2-3-Go condition. Therefore, EKF provides a good characterization of the algorithm the brain uses when there is need to integrate multiple pieces of information presented sequentially. This finding implies that subjects may only rely on the first few moments of a distribution and use nonlinear updating strategy to track those instead of updating the entire posterior. This strategy is simple and in many scenarios could lead to optimal behavior with little computational cost. Moreover, the recursive nature of EKF’s updating strategy allows it to readily generalize to scenarios when it is necessary to update estimates in real time, even when the number of available samples is not known a priori, which could be tested by extensions of our experiment. Finally, the error-correcting nature of the EKF updates are consistent with the correlations between response intervals observed in synchronization and continuation tasks53,57,64,65,66, corrections to timing perturbations during synchronization53,58,67, and the influence of recent temporal inputs on the perception of interval duration68,69,70. Therefore, EKF may provide an algorithmic understanding across a range of timing tasks including interval estimation, synchronization, discrimination, and reproduction.

Maintaining and updating probability distributions is computationally expensive. Moreover, it is not currently known how neural networks might implement such operations22. In contrast, EKF is relatively simple to implement. The only requirement is to use the current estimate to predict the next sample, and use a nonlinear function of the error in prediction to update estimates sequentially. Predictive mechanisms that EKF relies on are thought to be an integral part of how brain circuits support perception and sensorimotor function1,5,8,71,72,73,74. As such, the relative success of EKF may be in part due to its compatibility with predictive mechanisms that the brain uses to perform sequential updating. This observation makes the following intriguing prediction: when performing 1-2-3-Go task, subjects do not make two measurements; instead, they use the prior to predict the time of the second flash, use the prediction error to update their estimate, and the new estimate to predict the third flash. Two lines of evidence from recent physiological experiments support this prediction. First, individual neurons in the primate medial frontal cortex encode interval duration similarly for single and multiple interval tasks75, providing a basis for maintaining the current interval estimate during a given task. Second, neural signals in several regions of the brain encode intervals prospectively76,77,78,79,80, providing a basis for predicting the timing of upcoming sensory inputs. Future modeling efforts and electrophysiological experiments will be required to link neural signals to the implementation of the EKF algorithm.

While EKF provides a better account of the observed data in our experiments, it may be that our specific formulation of the Bayesian model did not capture the underlying process. Our BLS model was based on three assumptions: (1) that likelihood function is characterized by signal-dependent noise, (2) that the subjective prior matches the experimentally imposed uniform prior distribution, and (3) that the final estimates are derived from the mean of the posterior, which implicitly assumes that subject rely on a quadratic cost function, as was previously demonstrated24. Our formulation of the likelihood function is particularly important, as it is the key factor that prohibits simple algorithms such as EKF to optimally integrate multiple measurements. The inherent signal-dependent noise in timing causes the likelihood function to be skewed toward longer intervals (see Appendix). This characteristic feature was particularly important for explaining human behavior in a task requiring interval estimation following several measurements40. Moreover, it has been shown that subjects exhibit larger biases for longer intervals within the domain of the prior indicating that the brain has an internal model of this signal-dependent noise24,49,81,82. These results support our formulation of the likelihood function. However, one aspect of our formulation that deserves further scrutiny is the assumption that noise perturbing the two measurements was independent. This may need revision given the long-range positive autocorrelations in behavioral variability83,84,85, and because S2 is shared between the two measurements, which may lead to correlations between \({t}_{{m}_{1}}\) and \({t}_{{m}_{2}}\).

Our formulation of the prior and cost function should also be further evaluated. For example, humans may not be able to correctly encode a uniform prior probability distribution for interval estimation47,49. Similarly, the cost function may not be quadratic86. However, since priors and the cost functions impact both 1-2-Go and 1-2-3-Go conditions, moderate inaccuracies in modeling these components may not be able to explain optimal behavior in 1-2-Go and suboptimal behavior in the 1-2-3-Go condition simultaneously. Finally, recent results suggest the performance may be limited by imperfect integration87,88 and imperfect memory38,89, which future models of sequential updating should incorporate.

Methods

Subjects and apparatus

All experiments were performed in accordance with relevant regulations and guidelines for the ethical treatment of subjects, as approved by the Committee on the Use of Humans as Experimental Subjects at MIT, after receiving informed consent. Eleven human subjects (6 male and 5 female) between 18 and 33 years of age participated in the interval reproduction experiment. Of the 11 subjects, 10 were naive to the purpose of the study.

Subjects sat in a dark, quiet room at a distance of approximately 50 cm from a display monitor. The display monitor had a refresh rate of 60 Hz, a resolution of 1920 by 1200, and was controlled by a custom software (MWorks; http://mworks-project.org/) on an Apple Macintosh platform.

Interval reproduction task

Experiment consisted of several 1 hour sessions in which subjects performed an interval reproduction task (Fig. 1). The task consisted of two randomly interleaved trial types referred to as “1-2-Go” and “1-2-3-Go”. On 1-2-Go trials, two flashes (S1 followed by S2) demarcated a sample interval (ts) that subjects had to measure24. On 1-2-3-Go trials, ts was presented twice, once demarcated by S1 and S2 flashes, and once by S2 and S3 flashes. For both trial types, subjects had to reproduce ts immediately after the last flash (S2 for 1-2-Go and S3 for 1-2-3-Go) by pressing a button on a standard Apple keyboard. On all trials, subjects had to initiate their response proactively and without any additional cue (no explicit Go cue was presented). Subjects received graded feedback on their accuracy.

Each trial began with the presentation of a 0.5 deg circular fixation point at the center of a monitor display. The color of the fixation was blue or red for the 1-2-Go and 1-2-3-Go trials, respectively. Subjects were asked to shift their gaze to the fixation point and maintain fixation throughout the trial. Eye movements were not monitored. After a random delay with a uniform hazard (100 ms minimum plus and interval drawn from an exponential distribution with a mean of 300 ms), a warning stimulus and a trial cue were presented. The warning stimulus was a white circle that subtended 1.5 deg and was presented 10 deg to the left of the fixation point. The trial cue consisted of 2 or 3 small rectangles 0.6 deg above the fixation point (subtending 0.2 × 0.4 deg, 0.5 deg apart) for the 1-2-Go and 1-2-3-Go trials, respectively. After a random delay with a uniform hazard (250 ms minimum plus an interval drawn from an exponential distribution with mean of 500 ms), flashes demarcating ts were presented. Each flash (S1 and S2 for 1-2-Go and S1, S2, and S3 for 1-2-3-Go) lasted for 6 frames (100 ms) and was presented as an annulus around the fixation point with an inside and outside diameter of 2.5 and 3 deg, respectively (Fig. 1A,B). The time between consecutive flashes, which determined ts, was sampled from a discrete uniform distribution ranging between 600 and 1000 ms with a 5 samples (Fig. 1C). To help subjects track the progression of events throughout the trial, after each flash, one rectangle from the trial cue would disappear (starting from the rightmost).

Produced interval (tp) was measured as the interval between the time of the last flash and the time when the subject pressed a designated key on the keyboard (Fig. 1A,B). Subjects received trial-by-trial visual feedback based on the magnitude and sign of the relative error, (tp − ts)/ts. A 0.5 deg circle (“analog feedback”) was presented to the right (for error < 0) or left (error > 0) of the the warning stimulus at a distance that scaled with the magnitude of the error. Additionally, when the error was smaller than a threshold, both the warning stimulus and the analog feedback turned green and a tone denoting “correct” was presented. If the production error was larger than the threshold, the warning stimulus and analog feedback remained white and a tone denoting “incorrect” was presented. The threshold was constant as a function of the relative error and therefore scaled with the sample interval (Fig. 1D). This accommodated the scalar variability of timing that leads to more variable production intervals for longer sample intervals. The scaling factor was initialized at 0.15 at the start of every session and adjusted adaptively using a one-up, one-down scheme that added or subtracted 0.001 to the scaling factor for incorrect or correct responses, respectively. These manipulations ensured that the performance across conditions, subjects, and trials remained approximately at a steady state of 50% correct trials.

To ensure subjects understood the task design, the first session included a number of training blocks. Training blocks were conducted with the supervision of an experimenter. Training trials were arranged in 25 trial blocks. In the first block, the subjects performed the 1-2-Go condition with the sample interval fixed at 600 ms. In the second block, we fixed the interval to be 1000 ms. In the third block, the subject performed the 1-2-3-Go task with the interval fixed at 1000 ms. In the fourth block, the subject continued to perform the 1-2-3-Go task, but with the intervals chosen at random from the experimental distribution. In the final training block, the task condition and sample intervals were fully randomized, as in the main experiment. The subject then performed 400 trials of the main experiment. Subjects completed 10 sessions total, performing 800 trials in each of the remaining 9 experimental sessions. To ensure subjects were adapted to the statistics of the prior24, we discarded the first 99 trials of each session. We also discarded any trial when the subject responded before S2 (for 1-2-Go) or S3 (for 1-2-3-Go) or 1000 ms after the veridical ts. Supplementary Table 2 summarizes the number of completed trials for each subject. Data from two subjects were not included in the analyses because they were not sensitive to the range of sample intervals we tested and their production interval distributions were not significantly different for the longest and shortest sample intervals.

Models

We considered several models for the interval estimation: (1) an optimal Bayes Least-Squares model (BLS), (2) an optimal Bayes Least-Squares model that allowed different noise levels for the two measurements in 1-2-3-Go trials, (3) an extended Kalman filter model (EKF), and (4) a linear-nonlinear estimation model (LNE). All models were designed to be identical for the 1-2-Go task where only one measurement was available but differed in their prediction for the 1-2-3-Go trials.

BLS model

We used the Bayesian integration model that was previously shown to capture behavior in the 1-2-Go task24. This model assumes that subjects combine the measurements and the prior distribution probabilistically according to Bayes’ rule:

$$p({t}_{s}|{t}_{m})=\frac{\lambda ({t}_{m}|{t}_{s})p({t}_{s})}{p({t}_{m})},$$
(1)
$$\lambda ({t}_{m}|{t}_{s})=\frac{1}{\sqrt{2\pi {w}_{m}^{2}{t}_{s}^{2}}}{e}^{-\frac{{({t}_{m}-{t}_{s})}^{2}}{2{w}_{m}^{2}{t}_{s}^{2}}},$$
(2)

where p(ts) represents the prior distribution of the sample intervals and p(tm) the probability distribution of the measurements. The likelihood function, λ(tm|ts), was formulated based on the assumption that measurement noise was Gaussian and had zero mean. To incorporate scalar variability into our model, we further assumed that the the standard deviation of noise scales with ts with constant of proportionality wm representing the Weber fraction for measurement.

Following previous work24, we further assumed that subjects’ behavior can be described by a BLS estimator that minimizes the expected squared error, and uses the expected value of the posterior distribution as the optimal estimate:

$${t}_{{e}_{1}}={f}_{BL{S}_{1}}({t}_{{m}_{1}})=E[{t}_{s}|{t}_{{m}_{1}}],$$
(3)

where \({f}_{BL{S}_{1}}\) denotes the BLS function that maps the measurement (\({t}_{{m}_{1}}\)) to the Bayesian estimate after one measurement (\({t}_{{e}_{1}}\)). The subscript 1 is added to clarify that this equation corresponds to the condition with a single measurement (i.e., 1-2-Go). The notation E[•] denotes expected value. Given a uniform prior distribution with a range from \({t}_{s}^{min}\) to \({t}_{s}^{max}\), the BLS estimator can be written as:

$${f}_{BL{S}_{1}}({t}_{{m}_{1}})=\frac{{\int }_{{t}_{s}^{min}}^{{t}_{s}^{max}}\,{t}_{s}\lambda ({t}_{{m}_{1}}|{t}_{s})d{t}_{s}}{{\int }_{{t}_{s}^{min}}^{{t}_{s}^{max}}\,\lambda ({t}_{{m}_{1}}|{t}_{s})d{t}_{s}},$$
(4)

We assume that \({t}_{s}^{min}\) and \({t}_{s}^{max}\) match the minimum and maximum of the experimentally imposed sample interval distribution. We extended this model for the 1-2-3-Go task to two measurements. To do so, we incorporated two likelihood functions in the derivation of the posterior. Assuming that the two measurements are conditionally independent, the posterior can be written as:

$$p(\,{t}_{s}|{t}_{{m}_{1}},\,{t}_{{m}_{2}})=\frac{\lambda ({t}_{{m}_{1}}|{t}_{s})\lambda ({t}_{{m}_{2}}|{t}_{s})p({t}_{s})}{\int \,\lambda ({t}_{{m}_{1}}|{t}_{s})\lambda ({t}_{{m}_{2}}|{t}_{s})p({t}_{s})d{t}_{s}},$$
(5)

where \({t}_{{m}_{1}}\) and \({t}_{{m}_{2}}\) denote the first and second measurements, respectively, and the likelihood function, λ, is from Eq (2). Because measurements are taken in a sequence, it is intuitive to rewrite Eq (5) in a recursive form:

$$p({t}_{s}|{t}_{{m}_{1}},{t}_{{m}_{2}})=\frac{1}{Z}\lambda ({t}_{{m}_{2}}|{t}_{s})p({t}_{s}|{t}_{{m}_{1}}),$$
(6)

where \(p({t}_{s}|{t}_{{m}_{1}})\) is the posterior as specified in Eq (1) and Z is a normalization factor which ensures the integral over the density sums to one. Note that Z includes terms related to \(p({t}_{{m}_{1}})\), allowing it to appropriately normalize the density after propagating the posterior related to the first measurement forward. Therefore, although the posterior for Eqs (5) and (6) are identical, specifying the posterior in this way allows for the algorithm to be updated sequentially.

The corresponding BLS estimator can again be written as the expected value of the posterior:

$${t}_{{e}_{2}}={f}_{BL{S}_{2}}({t}_{{m}_{1}},\,{t}_{{m}_{2}})=E[{t}_{s}|{t}_{{m}_{1}},{t}_{{m}_{2}}],$$
(7)
$${f}_{BL{S}_{2}}({t}_{{m}_{1}},\,{t}_{{m}_{2}})=\frac{{\int }_{{t}_{s}^{min}}^{{t}_{s}^{max}}\,{t}_{s}\lambda ({t}_{{m}_{1}}|{t}_{s})\lambda ({t}_{{m}_{2}}|{t}_{s})d{t}_{s}}{{\int }_{{t}_{s}^{min}}^{{t}_{s}^{max}}\,\lambda ({t}_{{m}_{1}}|{t}_{s})\lambda ({t}_{{m}_{2}}|{t}_{s})d{t}_{s}},$$
(8)

where \({f}_{BL{S}_{2}}\) denotes the BLS function that uses two measurements (\({t}_{{m}_{1}}\) and \({t}_{{m}_{2}}\)) to compute \({t}_{{e}_{2}}\). The subscript 2 indicates the mapping function is for two measurements (i.e., 1-2-3-Go). We performed the integrations for the BLS model numerically using Simpson’s quadrature.

BLSmem model

We also considered the possibility that the brain may not be able to hold representations of the first measurement or the associated posterior perfectly over time until the time for integration. To model this we assumed two Weber fractions − wm as formulated in the BLS model and wmem which adjusts the Weber fraction of the first measurement in 1-2-3-Go trials to account for noisy memory or inference processes. In 1-2-Go trials, the posterior was set according to Eq (1) with wm controlling the signal dependent noise. In 1-2-3-Go trials, the posterior was set according to:

$$p({t}_{s}|{t}_{{m}_{1}},\,{t}_{{m}_{2}})=\frac{{\lambda }_{mem}({t}_{{m}_{1}}|{t}_{s})\lambda ({t}_{{m}_{2}}|{t}_{s})p({t}_{s})}{\int \,{\lambda }_{mem}({t}_{{m}_{1}}|{t}_{s})\lambda ({t}_{{m}_{2}}|{t}_{s})p({t}_{s})d{t}_{s}},$$
(9)

With the likelihood function associated with the first measurement defined as:

$${\lambda }_{mem}({t}_{m}|{t}_{s})=\frac{1}{\sqrt{2\pi {w}_{mem}^{2}{t}_{s}^{2}}}{e}^{-\frac{{({t}_{m}-{t}_{s})}^{2}}{2{w}_{mem}^{2}{t}_{s}^{2}}},$$
(10)

This formulation allows the measurement noise to be different for the two measurements. The optimal estimator was then calculated as:

$${f}_{BL{S}_{2}}({t}_{{m}_{1}},{t}_{{m}_{2}})=\frac{{\int }_{{t}_{s}^{min}}^{{t}_{s}^{max}}{t}_{s}{\lambda }_{mem}({t}_{{m}_{1}}|{t}_{s})\lambda ({t}_{{m}_{2}}|{t}_{s})d{t}_{s}}{{\int }_{{t}_{s}^{min}}^{{t}_{s}^{max}}{\lambda }_{mem}({t}_{{m}_{1}}|{t}_{s})\lambda ({t}_{{m}_{2}}|{t}_{s})d{t}_{s}},$$
(11)

EKF model

EKF implements an updating algorithm in which, after each flash, the observer updates the estimate, \({t}_{{e}_{n}}\), based on the previous estimate, \({t}_{{e}_{n-1}}\), and the current measurement, \({t}_{{m}_{n}}\). The updating rule changes \({t}_{{e}_{n-1}}\) by a nonlinear function of the error between \({t}_{{e}_{n-1}}\) and \({t}_{{m}_{n}}\), which we denote by xn.

$${t}_{{e}_{n}}={t}_{{e}_{n-1}}+{k}_{n}\,{f}^{\star }({x}_{n}),$$
(12)
$${x}_{n}={t}_{{m}_{n}}-{t}_{{e}_{n-1}},$$
(13)

\({f}^{\star }\) is a nonlinear function based on the BLS estimator, \({f}_{BL{S}_{1}}\):

$${f}^{\star }({x}_{n})={f}_{BL{S}_{1}}({x}_{n}+{t}_{{e}_{0}})-{t}_{{e}_{0}},$$
(14)

kn is a gain factor that controls the magnitude of the update and is set by the relative reliability of \({t}_{{e}_{n-1}}\) and \({t}_{{m}_{n}}\), which were formulated in terms of two Weber fractions, wn−1 and wm, respectively:

$${k}_{n}=\frac{{w}_{n-1}^{2}}{{w}_{n-1}^{2}+{w}_{m}^{2}},$$
(15)

To track the reliability of \({t}_{{e}_{n}}\) we used a formulation based on optimal cue combination under Gaussian noise. For Gaussian likelihoods, the reliability of the estimate is related to the inverse of the variance of the posterior. Similarly, the reliability of the interval estimate is related to the inverse of the Weber fraction. Therefore, we used the following algorithm to track the Weber fraction of the estimate, wn:

$${w}_{n}=\frac{{w}_{n-1}{w}_{m}}{\sqrt{{w}_{n-1}^{2}+{w}_{m}^{2}}},$$
(16)

As in the case of Gaussians, this algorithm ensures that wn decreases with each additional measurement, reflecting the increased reliability of the estimate relative to the measurement. This ensures that the weight of the innovation respects information already integrated into the estimate by previous iterations of the EKF algorithm.

At S1, no measurements are available. Therefore, we set the initial estimate, \({t}_{{e}_{0}}\) to the mean of the prior, and its reliability, w0, to ∞. After S2 (one measurement), the EKF estimate is identical to the BLS model. For two measurements, the process is repeated to compute \({t}_{{e}_{2}}\), but the estimate is suboptimal. This formulation can be readily extended to more than two measurements.

LNE model

LNE uses a linear updating strategy similar to a Kalman filter to update estimates by measurements as follows:

$${\bar{t}}_{n}=\mathrm{(1}-{k}_{n}){\bar{t}}_{n-1}+{k}_{n}{t}_{{m}_{n}},$$
(17)

The algorithm is initialized such that \({\bar{t}}_{1}={t}_{{m}_{1}}\) and we chose the weighting to be kn = 1/n. This choice minimizes the squared errors in 1-2-3-Go trials. Note that any other choice for kn would deteriorate LNE’s performance. Following this sequential and linear updating scheme, LNE passes the final estimate through a nonlinear transfer function specified by the BLS model for one measurement (\({f}_{BL{S}_{1}}\)):

$${f}_{LN{E}_{n}}({t}_{{m}_{1}},\,\mathrm{...,}\,{t}_{{m}_{n}})={f}_{BL{S}_{1}}({\bar{t}}_{n}),$$
(18)

where \({f}_{LN{E}_{n}}\) denotes the linear-nonlinear estimator after n measurements. This formulation ensures that LNE is identical to the BLS in 1-2-Go trials.

Interval production model

In all models, the final estimate is used for the production phase. Following previous work24, we assumed that the production of an interval is perturbed by Gaussian noise whose standard deviation scales with the estimated interval. The model was additionally augmented by an offset term to account for stimulus-independent biases observed in responses:

$$p({t}_{p}|{t}_{e})=\frac{1}{\sqrt{2\pi {w}_{p}^{2}{t}_{e}^{2}}}{e}^{-\frac{{({t}_{p}-{t}_{e}-b)}^{2}}{2{w}_{p}^{2}{t}_{e}^{2}}},$$
(19)

where wp is the Weber fraction for production, b is the offset term, and te can refer to the estimate for either 1-2-Go and 1-2-3-Go trials.

All models accommodated “lapse trials” in which the produced interval was outside the mass of the production interval distribution. The lapse trials were modeled as trials in which the production interval was sampled from a fixed uniform distribution, p(tp|lapse), independent of ts. With this modification, the production interval distribution can be written as:

$$p({t}_{p}|{t}_{e},\,\gamma )=\mathrm{(1}-\gamma )p({t}_{p}|{t}_{e})+\gamma p({t}_{p}|{\rm{lapse}}),$$
(20)

where γ represents the lapse rate. With this formulation, we could identify lapse trials as those for which the likelihood of lapse exceeded the likelihood of a nonlapse. To limit cases of falsely identified lapse trials, we set the width of this uniform distribution conservatively to the range of possible production intervals (between 0 and 2000 ms).

Using simulations, we verified that our model was able to detect lapses for the range of wm, wp, and γ values inferred from the behavior of our subject pool. Most subjects had a small probability of a lapse trial that was consistent with previous reports24. Two subjects had relatively unreliable performance with a larger number of lapse trials. However, our conclusions do not depend on the inclusion of these two subjects.

Analysis and model fitting

All analyses were performed using MATLAB R2014b or MATLAB R2017a, The MathWorks, Inc., Natick, Massachusetts, United States. We used a predictive maximum likelihood procedure to fit each model to the data. The likelihood of tp given ts and a set of parameters Θ (specific to each model) was defined as:

$$p({t}_{p}|{t}_{s},\,{\rm{\Theta }})=\int \,p({t}_{p},\,{t}_{m}|{t}_{s},\,{\rm{\Theta }})d{t}_{m},$$
(21)

For 1-2-Go trials and

$$p({t}_{p}|{t}_{s},\,{\rm{\Theta }})=\int \int \,p({t}_{p},\,{t}_{{m}_{1}},\,{t}_{{m}_{2}}|{t}_{s},\,{\rm{\Theta }})d{t}_{{m}_{1}}d{t}_{{m}_{2}},$$
(22)

For 1-2-3-Go trials. The integrand for Equation 22 is

$$p({t}_{p},\,{t}_{{m}_{1}},\,{t}_{{m}_{2}}|{t}_{s},\,{\rm{\Theta }})=\lambda ({t}_{{m}_{1}}|{t}_{s},\,{\rm{\Theta }})\lambda ({t}_{{m}_{2}}|{t}_{s},\,{\rm{\Theta }})p({t}_{p}|{f}_{X}[{t}_{{m}_{1}},\,{t}_{{m}_{2}},\,{\rm{\Theta }}],\,{\rm{\Theta }}),$$
(23)

With the output of \({f}_{X}[{t}_{{m}_{1}},\,{t}_{{m}_{2}},\,{\rm{\Theta }}]\) corresponding to the equivalent mapping function for fBLS, fLNE, fEKF, or \({f}_{BL{S}_{{\rm{mem}}}}\). \(\lambda ({t}_{{m}_{i}}|{t}_{s},\,{\rm{\Theta }})\) corresponds to the likelihood function as defined above, with its dependence on the parameters Θ made explicit.

Assuming that production intervals are conditionally independent across trials, the log likelihood of model parameters can be formulated as:

$${\mathrm{log}}_{e}p({t}_{p}^{1},\,{t}_{p}^{2},\,\mathrm{...,}\,{t}_{p}^{N}|{t}_{s},\,{\rm{\Theta }})=\sum _{i=1}^{N}\,{\mathrm{log}}_{e}p({t}_{p}^{i}|{t}_{s},{\rm{\Theta }}),$$
(24)

where the superscripts denote trial number. Maximum likelihood fits were derived from N-100 trials and cross validated on the remaining 100 trials (leave N out cross validation, LNOCV). This process was performed iteratively until all the data was fit. The final model parameters were taken as the average of parameter values across all the fits to the data. Fits were robust to changes in the amount of left out data,. See Supplementary Figs 1, 8, and 10 for a summary of maximum likelihood parameters and predictions of each model fit to our subjects. We also computed the maximum likelihood parameters using the full data set for each subject. Parameters found using either the full data set or LNOCV were nearly identical (see Supplementary Table 3).

We evaluated model fits by generating simulated data from that model and comparing various summary statistics (BIAS, \(\sqrt{{\rm{VAR}}}\), and RMSE) observed for each subject to those generated by model simulations. For the observed data, summary statistics were computed for non-lapse trials and after removing the offset (b). Model simulations were performed without the lapse term and after setting the offset to zero. The summary statistics were computed as follows:

$${{\rm{BIAS}}}^{2}=\frac{1}{N}\sum _{i=1}^{N}\,{({\bar{t}}_{{p}_{i}}-{t}_{{s}_{i}})}^{2},$$
(25)
$${\rm{VAR}}=\frac{1}{N}\sum _{i=1}^{N}{\sigma }_{i}^{2},$$
(26)
$${\rm{RMSE}}=\sqrt{{{\rm{BIAS}}}^{2}+{\rm{VAR}}},$$
(27)

BIAS2 and VAR represent the average squared bias and average variance over the N distinct ts’s of the prior distribution. The terms \({\bar{t}}_{{p}_{i}}\), \({\sigma }_{i}^{2}\) represent the mean and variance of production intervals for the i-th sample interval (\({t}_{{s}_{i}}\)). The overall RMSE was computed as the square root of the sum of BIAS2 and VAR. To find the BIAS2 and VAR of each model we took the mean value of each after 1000 simulations of the model with the trial number matched to each subject. This ensured an accurate estimate of these quantities that includes the systematic deviations from the true model behavior due to a finite number of trials.

To perform model comparison, we measured the likelihood of the data, given a model and the maximum likelihood model parameters fit to training data, \( {\mathcal L} ({t}_{s},\,{t}_{p}|{ {\mathcal M} }_{i},\,{{\rm{\Theta }}}^{{\rm{ML}}})\). We then computed the ratio \( {\mathcal L} ({t}_{s},\,{t}_{p}|{ {\mathcal M} }_{i},\,{{\rm{\Theta }}}^{{\rm{ML}}})\) and \( {\mathcal L} ({t}_{s},\,{t}_{p}|{ {\mathcal M} }_{BLS},\,{{\rm{\Theta }}}^{{\rm{ML}}})\), the likelihood of the BLS model, and computed the logarithm of that value to measure the log likelihood ratio. To generate confidence intervals, we evaluated the likelihoods using 100 trials of test data that were left out of model fitting. We iterated this process until all the data was used as training data, allowing us to measure the variability of the log likelihood ratio for each subject.

Finally, we further validated our fitting procedure, analyses, and model comparison on data simulated using a generative process that emulates each model. Using this as ground truth data, we confirmed that our analyses identify the correct model and parameters using similar numbers of trials and subjects (Supplementary Figs 7, 9, and 11).