Confidence controls perceptual evidence accumulation

Perceptual decisions are accompanied by feelings of confidence that reflect the likelihood that the decision was correct. Here we aim to clarify the relationship between perception and confidence by studying the same perceptual task across three different confidence contexts. Human observers were asked to categorize the source of sequentially presented visual stimuli. Each additional stimulus provided evidence for making more accurate perceptual decisions, and better confidence judgements. We show that observers’ ability to set appropriate evidence accumulation bounds for perceptual decisions is strongly predictive of their ability to make accurate confidence judgements. When observers were not permitted to control their exposure to evidence, they imposed covert bounds on their perceptual decisions but not on their confidence decisions. This partial dissociation between decision processes is reflected in behaviour and pupil dilation. Together, these findings suggest a confidence-regulated accumulation-to-bound process that controls perceptual decision-making even in the absence of explicit speed-accuracy trade-offs.


Supplementary Note 1
Our model assumed that, in the stopping task, observers entered their response when they thought they had accumulated enough evidence to meet the target performance on each trial, as they were instructed to do. We gave observers feedback on their average performance over sub-blocks of 20 trials. Observers could have tried to meet the target performance using a different strategy, for example, aiming for 100% accuracy on the first 14 trials, and then 0% accuracy on the next six trials (to achieve 70% correct), or aiming for 100% accuracy on the first 10 trials and then guessing (50% accuracy) on the next 10 (to achieve 75% correct). Or more simply, observers who felt as though they performed too well throughout the majority of the sub-block could start making errors, or responding after very few samples, in order to lower their accuracy. If this were the case, there would be a difference in accuracy across the block, and possibly a difference in the number of samples. There  In the Replay task, we asked observers to rate their confidence that they made a correct decision in order to measure their ability to estimate the validity of their own perceptual judgements (Type-II efficiency). The accuracy of these ratings is based both on the observer's insight into their Type-I accuracy, and how well they set criteria for making the confidence ratings. Whilst the former reflects Type-II sensitivity, the latter can be corrupted by biasan observer may be more or less willing to give a high confidence rating. Therefore, care needs to be taken when estimating Type-II efficiency from confidence ratings. Type-II efficiency was calculated using the confidence ratings in the Replay task by calculating meta-d' as described in1, and dividing by d'. We had originally planned to compare Type-II efficiencies between conditions. However, there were no significant betweencondition differences (Kruskal-Wallis 2 = 1.83, p = 0.40), as shown in Supplementary Figure 2. This was likely due to the high uncertainty in the obtained estimates, as there were only 100 trials to calculate each observers' meta-d' in each condition. In subsequent analyses, we therefore used efficiency estimates computed using data pooled across all conditions. We could therefore only analyse between-subject differences, as opposed to within-subject comparisons between conditions.

Supplementary Figure 2. Type-II efficiency for the Same (left), Less (middle) and More (right)
conditions. Thick horizontal bars show the mean, with each subject (n=20) marked by three connected dots across conditions. Type-II efficiency estimates showed no significant differences between conditions.

Supplementary Note 3
A core aspect of making the Type-I decision is the setting and maintenance of the decision bound.
Often it is sufficient to assume a flat bound over accumulation time, where the observer makes their decision after a certain quantity of evidence has accumulated, irrespective of the time it takes to accumulate this evidence. However, in our experiment we sought to understand how human observers were setting their bounds relative to the optimal observer, in order to measure their efficiency in setting and maintaining Type-I decision bounds. The optimal observer sets a flat bound on the likelihood of a correct response, which is not necessarily a flat bound on the total accumulated evidence. Thus, to understand how human observers should set their decision bounds to meet performance targets in the Stopping task, we performed simulations of the optimal observer. The optimal observer is an observer who experiences suboptimal evidence accumulation but who has sufficient insight into their suboptimalities to appropriately set their decision bound to meet the performance targets. Taking an example set of samplescorresponding to the materials used for one of the actual participants -we simulated suboptimal evidence accumulation by corrupting the ideal evidence provided by each sample with 10,000 samples of noise (drawn from zero-mean Gaussian with standard deviation σ) and performing leaky (controlled by parameter α) evidence accumulation across successive samples, resulting in 10,000 simulations for each of the 600 trials. For every trial, the optimal observer commits to a decision (stops accumulating evidence) when the proportion of simulations where the predicted response (corresponding to the sign of the accumulated log-odds evidence) is correct reaches or exceeds the performance target. In this way, the optimal observer is assumed to have access to the probability of making a correct response based on the accumulated evidence with each new sample (something human observers did not have direct access to). Plotting the average accumulated evidence at decision time against the number of samples the optimal observer chose to respond to on each trial revealed a non-monotonic function whose shape was not obvious to model (as shown in Figure Figure   3b. This demonstrates that the optimal bound on Type-I evidence accumulation is well approximated by an exponentially decreasing function of the proportional evidence, allowing us to describe the bound by a function with just three parameters (λ, a and b; Equation 9 in Methods), rather than the height of the bound at each sample (up to 40 parameters). Our modelling approach consisted in comparing human observers' behaviour to this optimal behaviour, where again, describing the bound as an exponential function meant that differences from the optimal bound could be summarised by specific parameters of the function. Regarding Type-II decisions, additional criteria were necessary to describe how observers rated their confidence. These criteria are usually modelled as additional bound-like divisions (confidence criteria) on the accumulated evidence. Before assuming that confidence criteria followed the same function as the exponentially decreasing bound on Type-I evidence, we took a normative approach to examine whether human observers' behaviour could reasonably be described in this way.
Supplementary Figure 3c shows the average proportional evidence (its ideal estimate, neither disrupted by noise nor leak) as a function of the number of samples, binned by confidence rating for three example observers. Though there was large between-observer variability in the setting of confidence criteria, the functions appeared to monotonically decrease with increasing number of samples and could still be reasonably approximated by an exponential function.  The exponential function has three parameters: the scale, b, the rate of decline, , and the minimum, a, as described in Equation 9 in the Methods. Although performance is moderated by the combination of these parameters, the observer would need to primarily adjust b to account for inference noise (increasing inference noise requires an increased b to maintain the same accuracy), whilst stronger temporal biases require the adjustment of , as illustrated in Supplementary Figure 4. The optimal observer primarily adapts the scale, b, and the rate of decline, , according to the target performance level. Modelling suggested that observers were adjusting only the rate of decline of their bound, not the scale, in order to adjust their performance according to the target performance in the Stopping task. This selective adjustment of the rate of decline of decision bounds suggests a link between the adaptability of Type-I decision bounds and Type-II sensitivity; perhaps the Type-II system is more capable of accounting for temporal biases than internal noise. Further investigation will be necessary to assess whether this effect is adaptive, or the result of the temporal bias being more accessible to estimate than noise. Previous evidence has suggested that observers show systematic biases in accounting for noise in their confidence judgements2,3,4, or even blind to the noise affecting their perceptual decisions4.

Supplementary Figure 4. Optimal decision bounds. As performance is moderated both by inference noise and temporal biases, the optimal observer adjusts the parameters of the bound differently for different suboptimalities. a) As the target performance level increases, the optimal observer increases both b and . b) An increase in inference noise requires an increase in b (as seen from the height of the lines on the y-axis). c) The optimal observer alters to cope with changes in the temporal biases (increasing for increasing primacy) as seen from the rate of decline of the lines.
We suggested that observers were using their confidence evidence to set and maintain the Type-I decision bounds. In this respect, the suboptimalities in confidence evidence are responsible for the suboptimalities in bound efficiency. By simulating optimal bounds on the confidence evidence (for example, stopping the Type-I accumulation when there was a 70% chance of a correct response based on the Type-II evidence), and then re-calculating the bound efficiency based on the accumulated Type-I evidence, we were able to reproduce the relationship between bound efficiency and Type-II  Figure 3d of the manuscript. b)

Relationship between simulated bound efficiency (based on parameters fit to the Replay task) and
bound efficiency calculated based on performance in the Stopping task (completed in a different experimental session). Parameter recovery was used to ensure the model fitting procedure was able to accurately recover the underlying parameters describing observers' behaviour. We simulated 500 data sets of 300 trials using parameters sampled from normal distributions centred on the mean parameter values fitted to human observers (Supplementary Table 1). These simulated responses were then fit using the exact same code as for fitting participants' data, and the input (ground-truth) and output (best-fitting) parameters compared. We were particularly concerned about the model's ability to accurately recover the parameters analysed in the Results, given that the likelihood function was numerically estimated based on only 1,000 particles using a noisy objective function5. Supplementary Figure 6 shows the simulated and recovered parameters for inference noise (σ), leak (α), and the rate of decline of the bound ( ). We observed limited evidence for systematic biases in the recovered parameters and few large deviations from equality. All parameters showed strong correlations between simulated and recovered values (all p < 10-5), indicating that the parameters could be accurately estimated by the model.  12

Supplementary Note 5
When the observer cannot control the amount of evidence they accumulate, they may either: 1.
Accumulate all the evidence they are provided and then make their response (no-bound) 2. Establish a covert bound, accumulate evidence to this bound, and then ignore any additional evidence (absorbing bound) or 3. Establish a covert bound, accumulate evidence to this bound, and then continue to monitor any additional evidence in case there is evidence against their decision (reflexive bound6). In the results we presented the data comparing the absorbing bound to the no-bound model, but we first compared the three different possible bounds. Supplementary Figure 7 shows the relative fit of the different bounds. There was no evidence that the reflexive bound was a better description of behaviour compared to no-bound (p = 0.13), and this bound was a significantly worse fit compared to the absorbing bound (pbonf*3 < 0.006). A difference in pupil size could be due to a shift in the timing of the pupil response, or driven a difference in the rate of pupil change. In the Results section, we present a difference in pupil size following the response for high confidence compared to low confidence trials. This effect was driven by a more rapid pupil constriction in high confidence trials, based on the difference in the derivatives Finally, we demonstrate that these effects are still visible when dividing trials by both confidence and bound crossing. As shown in Supplementary Figure 9c, there was still an effect of confidence within not-crossed trials (pbonf*4 = 0.024) and an effect of boundary crossing in high confidence trials (pbonf*4 < 0.004 and pbonf*4 = 0.024 for the differences before and after the response respectively). This again supports the finding that two distinct effects in pupil dilation are visible: 1. dilation relative to committing to a type-I decision (which is temporally offset according to different conditions in the Replay task) and 2. more rapid constriction with greater Type-II confidence, following the response.