Abstract
Perceptual decisions are accompanied by feelings of confidence that reflect the likelihood that the decision was correct. Here we aim to clarify the relationship between perception and confidence by studying the same perceptual task across three different confidence contexts. Human observers were asked to categorize the source of sequentially presented visual stimuli. Each additional stimulus provided evidence for making more accurate perceptual decisions, and better confidence judgements. We show that observers’ ability to set appropriate evidence accumulation bounds for perceptual decisions is strongly predictive of their ability to make accurate confidence judgements. When observers were not permitted to control their exposure to evidence, they imposed covert bounds on their perceptual decisions but not on their confidence decisions. This partial dissociation between decision processes is reflected in behaviour and pupil dilation. Together, these findings suggest a confidenceregulated accumulationtobound process that controls perceptual decisionmaking even in the absence of explicit speedaccuracy tradeoffs.
Introduction
Sensory perception results from inferring the causes of uncertain sensory evidence^{1}. The perceived objects are underconstrained by sensory evidence, so these inferences are fundamentally probabilistic. In recent years, there has been a growing interest in the ability of human observers to estimate the validity of their perceptual decisions. This form of metacognitive judgement can be obtained by asking an observer to rate how confident she is that one of her perceptual decisions is correct. While the perceptual decision is used to quantify the percept itself, each confidence rating quantifies the certainty the observer has about her own percept. These two types of judgements are known as TypeI and TypeII decisions, respectively^{2}.
By definition, an ideal TypeII decisionmaker uses the exact same evidence as for the TypeI decision^{3}. The evidence for the TypeI decision can be computationally described by sequential sampling processes, or diffusion models^{4,5,6,7} (for a recent review^{8}), wherein, samples of noisy evidence are accumulated over time and a decision is made once the evidence reaches a bound. The TypeII decision estimates the likelihood that the TypeI decision is correct, given the accumulated evidence. The likelihood of the TypeI decision is moderated by both the quantity of evidence (determined by the relative placement of the decision bound) and by the quality of evidence (which is marred by suboptimal accumulation, such as noise and leak in the accumulation process). The ideal TypeII decision therefore requires the estimation of the quantity and quality of TypeI evidence.
However, a large body of evidence demonstrates that human observers do not make their TypeII decisions in accordance with this ideal TypeII decision maker. For example, observers poorly incorporate estimates of sensory noise into their TypeII decisions (resulting in over or underconfidence^{9,10,11}), and they may ignore evidence in favour of other decision alternatives^{12,13,14} (although not always^{15}). Furthermore, observers may integrate additional evidence into their TypeII decision that was not used for making their TypeI decision^{8,16}, allowing them to report errors in the absence of feedback^{17,18,19}, and to change their mind after the initiation of a response^{20}. These departures from ideal are yet to be fully characterised by a unifying framework of TypeII decisionmaking.
The computational description of TypeII decisionmaking is significantly constrained by the tight concomitance of TypeI and TypeII decisions. In the context of sequential sampling processes, the suboptimalities affecting TypeII decisions could be merely inherited from the suboptimalities in TypeI evidence accumulation, or there could be additional suboptimalities in the computation of uncertainty over the TypeI variables. Moreover, there is the possibility that the processes of accumulating TypeI and TypeII evidence are neither identical nor functionally independent, that is, confidence may interact with the very process of accumulating evidence for perceptual decisions. Exploring this possibility is essential for the understanding of confidence and perceptual decisionmaking.
To clarify this relationship, we asked observers to make the same TypeI judgement but in three distinct TypeII contexts. In all three contexts, the TypeI judgement required observers to make a twoalternative categorisation decision after viewing a series of visual stimuli (a variation of the weather prediction task^{21,22}). Based on the work of Drugowitsch et al.^{22} each orientation offered a specific amount of evidence in favour of each category, such that the quality and quantity of TypeI evidence can be carefully monitored over the course of each trial. This allowed for the disambiguation of different suboptimalities affecting observers’ TypeI decisionmaking using computational modelling. Given the suggestion that TypeII evidence accumulation may continue after the TypeI decision bound has been reached, we were especially motivated to understand the relationship between TypeI and TypeII decisionmaking relative to the point at which the TypeI evidence crosses the decision bound. In the first TypeII context we therefore measured observers’ ability to set and maintain appropriate bounds on their TypeI evidence accumulation by asking them to make their TypeI judgement when they thought they had reached an instructed target performance level. We used the second TypeII context to measure observers’ default bound—i.e., how much evidence each observer feels they need to accumulate to commit to a perceptual decision. Last, in the third TypeII context, we tested whether observers implement a covert bound when they are presented with more evidence than needed to reach their default bound, as measured in the second context. In this third TypeII context, observers also rated their confidence that their TypeI decision was correct on each trial—i.e., an explicit TypeII decision. In addition to behavioural responses, pupil dilation was monitored throughout the experiment, as the literature has suggested strong links between pupil dilation and TypeI^{23}, and TypeII^{24,25} decisionmaking via pupillinked dynamics of the noradrenergic system^{26,27}. Both behaviour and pupillometry reflect a partial dissociation between TypeI and TypeII evidence, which could allow for a confidenceregulated accumulationtobound process that controls perceptual decisionmaking even in the absence of an explicit speedaccuracy tradeoff.
Results
Preliminary analyses
Across all three TypeII contexts, observers (N = 20) made the same TypeI decision: whether the orientations of the sequence of Gabor patches presented on each trial were drawn from the orange or the blue category. These categories were defined by circular Gaussian probability distributions over the orientations of the Gabor patches, as shown in Fig. 1a (see Methods). The three TypeII contexts were presented across two sessions and are depicted in Fig. 1b. For each observer, 100 trials of up to 40 stimuli were predefined and repeated six times in the Stopping task context and three times in the Free task and Replay task contexts.
In the first session, observers were placed in a Stopping task context in which they were continually shown samples until they entered their TypeI response. Importantly, they were asked to enter their response when they felt they had accumulated enough evidence to reach a certain probability of being correct (target performance). There were three target performance conditions (70%, 85% and 90% correct) in which observers scored an average proportion correct [95% betweensubjects CI] of 0.72 [±0.018]; 0.80 [±0.021]; and 0.82 [±0.023]. This corresponded to a TypeI sensitivity (d’) of 1.2 [±0.17], 1.7 [±0.14], and 1.9 [±0.19] in each target performance condition, which was found to significantly increase across target performance conditions using a Wilcoxon sign rank test; Z (70% vs. 85%) = 3.78, p_{bonf*2} < 0.001; Z (85% vs. 90%) = 2.35, p_{bonf*2} = 0.037, with these pvalues Bonferroni corrected for two comparisons). Observers also chose to enter their response later in the higher target performance conditions (average median sequence length = 6.5 [±1.13]; 11.5 [±2.09]; and 15.0 [±3.02]; Z (70% vs. 85%) = 3.71, p_{bonf*2} < 0.001; Z (85% vs. 90%) = 2.80, p_{bonf*2} = 0.010; additional analyses are provided in Supplementary Note 1 and Supplementary Fig. 1).
In a second session, completed on a separate day, observers were placed in a Free task context, followed by a Replay task context. In the Free task context, observers were also continually presented with samples until they entered their response. Unlike in the Stopping task, observers were not given specific performance targets, but instead asked to enter their response as soon as they ‘felt ready’. Observers scored an average proportion correct [95% betweensubjects CI] of 0.80 [±0.025], with observers choosing to respond after 10.5 [±1.51] samples on average. The same trials were then repeated to observers in the Replay task, completed immediately after the Free task, but in the Replay task observers were presented with a fixed number of samples and could only respond after the response cue. The number of samples presented on each trial was determined relative to how many samples the observer chose to respond to for the three repetitions of each predefined trial in the Free task. There were three intermixed conditions: Less (−2 samples from the minimum), Same (same number of samples as the median), and More (+4 samples from the maximum). This resulted in a median number of samples of 5.4, 10.4, and 18.5 in the Less, Same, and More conditions, respectively. We compared performance in the 100 trials of the Less, Same, and More conditions to the corresponding sets of 100 trials in the Free task exhibiting the minimum, median, and maximum number of observed samples across the three predefined trial repetitions. Performance in the Same condition was on par with performance on those same trials (the trials with the median number of samples) in the Free task (mean proportion correct = 0.80 [±0.022]; Z (Same vs. Free d’) = 0.82, p = 0.41, uncorrected). In the Less condition, two fewer samples corresponded to a substantial decrease in performance withinsubjects (Less d’ = 1.01; Free d’ on Less trials = 1.50; Mean withinsubjects difference = 0.49 [±0.13], Z = 3.51, p_{bonf*3} < 0.001) but in the More condition, four additional samples did not significantly improve performance withinsubjects (More d’ = 1.85 Free d’ on More trials = 1.77; Mean withinsubjects difference = −0.08 [±0.13]; Z = 0.04, p = 0.68).
In the Replay task, observers also gave a confidence rating after each TypeI response. The rating ranged from 1 to 4 and reflected observers’ confidence that they had made a correct response (1 corresponding to low confidence/guessing, and 4 high confidence/certain correct). These ratings were used to compute observers’ TypeII sensitivity using metad’, as has become common practice in metacognitive research^{28}. To account for different TypeI sensitivity, metad’ is divided by d’ to give metacognitive efficiency, or TypeII efficiency as we will call it here. The average TypeII efficiency was 0.75 [±0.096], more details on this analysis are available in the Supplementary Note 2 and Supplementary Fig. 2.
Observers’ behavioural responses were determined by the orientations they were presented with on each trial, and by their internal processing of these samples, which is affected by several sources of suboptimalities. These suboptimalities were quantified using computational modelling, based on the work of Drugowitsch et al.^{22}. The model describes the observer’s choice on each trial as an accumulation of evidence with each additional sample, as shown in Fig. 2a. The evidence accumulated with each sample was calculated as the difference in the log probabilities of the orientation given each category (blue vs. orange), which is the optimal evidence given by a Bayesian observer. This optimal evidence was disrupted by two sources of suboptimalities that impair the observer’s TypeI sensitivity: inference noise (disrupting the accurate representation of decision evidence) and a temporal bias (weighting early and late evidence differently), parameterised by σ and α, respectively (see Methods for details). These parameters were adjusted for each observer to best describe their behavioural choices. In order to achieve the performance targets, the observer imposes a bound on the accumulated evidence. To maintain a constant probability of a correct response, the ideal bound decreases with the number of samples—i.e., a collapsing bound (see Methods for more details). That is, as additional samples are accumulated, the observer requires less evidence per sample on average for the same probability of a correct response. We found that observers adjusted the rate of decline of the bound function over the number of samples (parameterised by λ) to accommodate for the different target performance conditions. On average, λ increased from 2.17 [±0.38] to 3.55 [±0.53] to 4.38 [±0.69] and this alone was sufficient to explain the differences between conditions in Fig. 2b, c. The relative placement of this bound, given the suboptimalities in their evidence accumulation, controlled how well observers were able to meet the target performance levels. The model was fit to both what and when observers responded. The ability of the model to describe observers’ behaviour in each target performance condition can be appreciated by simulating behaviour using each participants’ fitted parameters, as shown in Fig. 2b, c. Further details on the model parameters and fits can be found in the Methods. An indication of the model fit to behavioural performance is shown overlaid over each behavioural data figure, with open red markers corresponding to performance in simulations using the parameters fit to each observer.
This computational model was then used to examine the process of accumulating evidence for TypeI and TypeII decisions in the context of our three main questions: the relationship between efficient boundsetting for TypeI decisionmaking and TypeII sensitivity, the implementation of covert bounds on TypeI and TypeII evidence accumulation, and the relationship between suboptimalities in evidence accumulation for TypeI and TypeII decisions.
The relationship between TypeI and TypeII efficiency
Observers’ ability to set and maintain appropriate bounds on TypeI evidence accumulation was measured in the Stopping task. As shown in Fig. 3a, observers were able to adjust their performance according to the target performance condition, but did not do so optimally: there was substantial overperformance in the 70% correct condition and underperformance in the 85% and 90% correct conditions. The computational model explained this change in behaviour by a change in the rate of decline of the decision bound, which also explained the increase in the number of samples observers chose to respond to (Fig. 3b). Bound efficiency was then calculated as the change in bound across conditions that the observer actually implemented, divided by the change in bound they should have implemented if they were to reach the target performance levels. This expected behaviour was obtained by simulations given the other suboptimalities affecting observers’ performance (see Methods for more details). In this way, an observer with high boundefficiency could over or underestimate their performance (poor accuracy relative to the target performance levels), though they still adjust their bounds appropriately in the face of different performance targets (good precision). TypeII efficiency was then calculated using the confidence ratings in the Replay task (completed in a separate session to the Stopping task). As a check, Fig. 3c shows observers were using their ratings appropriately, as proportion correct increases with increasing confidence, and reflected increased decision evidence rather than number of samples per se (which could have been used as a proxy for confidence).
There was a strong, significant relationship between bound efficiency and TypeII efficiency, shown in Fig. 3d, found by assessing the slope of the line that minimises the perpendicular distance from the data (y = 1.41× + 0.25; p = 0.004, using nonparametric bootstrapping; y = 1.5× + 0.2, p = 0.026 with two outliers removed). Importantly, the bound efficiency was not found to relate to observers’ TypeI sensitivity in the Replay task (p = 0.64), nor was TypeII efficiency related to the parameters contributing to TypeI sensitivity in the Stopping task (inference noise (σ); p = 0.51; and temporal biases (α) p = 0.17). This indicates that the relationship between the bound efficiency and TypeII efficiency did not arise from other, potentially confounding computational parameters underlying TypeI and TypeII sensitivity.
Covert bounds on TypeI evidence accumulation
In the Replay task, the additional samples in the More condition should have driven increased performance compared to the Same condition, but from the statistical analysis reported in preliminary analysis, this was not the case (Fig. 4a). This lack of improvement has two possible explanations. First, there could be a performance limit, due to the suboptimalities in evidence accumulation parameterised by σ (inference noise) and α (inference leak). Second, and more interestingly, observers could be employing a covert bound on evidence accumulation, where observers do not accumulate evidence beyond this bound, even when additional evidence is available. To compare these hypotheses, we fit two models, one with and one without a covert bound, fixing the suboptimalities to those fit to the Same and Less condition. There was a significant improvement in model fit with a covert bound relative to the model without a covert bound, assessed using a fivefold cross validation (mean relative increase = 0.047, Z = 2.46, p_{bonf*3} = 0.041; as shown in the leftmost bar of Fig. 4b), suggesting that the employment of a covert bound is a better explanation of the behavioural responses than the suboptimalities in evidence accumulation alone. This model comparison included two other models in which a covert bound was fit to TypeII evidence accumulation (more details below) of which neither model showed significant evidence of an improvement in fit (for a hard bound, the same as TypeI; Z = 1.79, p = 0.073; nor an independent bound Z = 0.11, p = 0.91). Further details on reflexive vs. absorbing^{29} covert bound comparisons can be found in the Supplementary Information (Note 5 and Fig. 7).
There was a clear effect of crossing the decision bound evident in observers’ pupil dilation, which provided direct support for the modelling results. In all three conditions of the Stopping task, there was a phasic increase in pupil size beginning immediately prior to the response, peaking immediately after the response. Based on a clusterlevel analysis^{30} of Wilcoxon sign rank tests against baseline, this increase became significantly greater than baseline from 0.28 s prior to the response (p_{bonf*3} < 0.002 in each condition). In contrast, in the Replay task there were substantial betweencondition differences. In the Same and the Less conditions the increase relative to baseline did not survive clusterlevel comparisons (p = 0.064 in the Same condition, and p = 0.281 in the Less condition; uncorrected), but in the More condition pupil size was significantly greater than baseline from −3.92 to −1.86 s relative to the response (p_{bonf*3} = 0.018). Pupil size was also found to peak earlier in the More condition (M = −2.46 s) relative to the Less condition (M = −2.8 s) (using jackknife resampling t(17) = −13.06, p_{bonf*3} < 0.001) though neither peak was significantly different from the Same condition (p > 0.1). The difference in the median number of samples shown in the More and the Same condition was about 8. If the difference in the peaks were due to the different number of samples, there would also be a difference between conditions in the Stopping task, where there was also a difference of about 8 samples between the 70% and 90% target performance conditions (Fig. 4c).
Suboptimalities in TypeI and TypeII evidence accumulation
Confidence ratings were modelled by placing criteria on the accumulated evidence such that the confidence rating was given based on the position of the evidence relative to these criteria (further details in the Methods). The ideal TypeII observer is defined as the one that is using the exact same evidence as for the TypeI decision. If this were the case, there would be no systematic difference in the parameters fit to describe both TypeI and TypeII responses, compared to when only TypeI responses were fit. On the contrary, we found a significant increase in inference noise (σ; Wilcoxon sign rank test Z = −3.81, p_{bonf*2} < 0.001), and significant decrease in temporal bias (increase in α toward 1; Z = −3.62, p_{bonf*2} < 0.001). If the TypeI and TypeII evidence accumulation processes were entirely independent, then there should be no significant correlation between the suboptimalities affecting each separate accumulation process. On the contrary, we found both σ and α to correlate when allowing model parameters to vary independently across the TypeI and TypeII evidence accumulation processes (σ, τ = 0.66, p_{bonf*4} < 0.001; α, τ = 0.57, p_{bonf*4} = 0.001) while the TypeII parameters remained significantly increased relative to the TypeI parameters (σ, Z = −3.92, p_{bonf*4} < 0.001; and an increase in α, Z = −3.73, p_{bonf*4} = 0.001). The results are therefore consistent with a partially dissociable model, where the TypeI and TypeII accumulators receive the same noisy decision evidence, but thereafter the TypeII accumulator incurs additional noise. This additional noise could occur either at the accumulation stage or the decision output stage, however these partially correlated models could not be distinguished (the difference in the loglikelihoods of the models is 0.67). Figure 5 depicts the additional TypeII noise as occurring at the accumulation stage. By simulating confidence ratings using the fitted parameters, we were able to closely estimate observers’ TypeII sensitivity (rho = 0.59 p = 0.007), furthermore, estimating the placement of TypeI bounds based on the confidence evidence provided a good estimate of bound efficiency (see Supplementary Information, Note 3 and Fig. 5, for details).
A partial dissociation between TypeI and TypeII decision processes was also evident from observers’ pupil dilation. There was a significant difference in pupil size between trials rated with high confidence (ratings of 3 and 4) and trials rated with low confidence (ratings of 1 and 2) from 1.2 s following the response (Fig. 6a, clusterlevel p = 0.004). Comparing trials predicted (based on fitted model parameters) to have crossed observers’ covert bounds to those that did not cross the covert bound also revealed a significant difference in pupil size following the response (Fig. 6b, from −0.2 s relative to the response, p < 0.002). There was also a significant difference prior to the response, reflecting the earlier peak in pupil size when the bound is crossed (from −3 to −1.62 s relative to the response, p = 0.022). It should be noted that these trial divisions were not completely independent: 61% of the trials in which the bound was crossed were highconfidence trials, and 67% of the lowconfidence trials were trials where the bound had not been crossed. Fig. 6c, d show clearly that the difference in constriction after the response is attributable to differences in confidence, while the temporal difference in when peak pupil size occurs is attributable to boundary crossing. The differences in pupil size after the responses corresponded to differences in the speed of pupil constriction, based on the analysis of the derivative, see Supplementary Information for details (Note 6 and Fig. 9).
Discussion
The experiment presented here was designed to clarify the relationship between the accumulation of evidence for perceptual and confidence decisions. The results suggest a far more intricate relationship than has previously been assumed. While the confidence decision relies on the same sensory evidence as the perceptual decision, this evidence undergoes additional noise and a distinct temporal bias. Furthermore, the sensory evidence accumulation process is terminated by a stricter bound, in such a way that observers may make their perceptual decision prior to accumulating all available evidence, while the confidence accumulation continues. The relationship between observers’ confidence efficiency and their ability to set and maintain appropriate bounds on sensory evidence accumulation suggests a common mechanism behind the two: The TypeII system imposes bounds on Type1 evidence accumulation. Put together, this evidence indicates that confidence decisions are not the result of some inert postdecisional process, but reflect an online control process that moderates sensory evidence accumulation.
By implementing a covert bound, observers were able to terminate the TypeI evidence accumulation process prior to the end of the trial; making their decision despite there being more evidence available that could have increased their accuracy. This was evident from the behavioural results, where performance in the More condition did not significantly increase relative to the Same condition, even though there were an additional four samples on which to base their decision. Our computational modelling suggested a covert bound significantly improved the fit of the model to behavioural responses. Observers showed a phasic increase in pupil dilation as they made their decisions (the evidence reached the decision bound), which was temporally aligned with the response in the Stopping task, but occurred much earlier in the More condition of the Replay task, supporting the behavioural and modelling results that suggest that observers were covertly committing to their TypeI decisions early. This implementation of covert bounds could help to optimise perception for efficient TypeI decisionmaking. However, in the Replay situation, the bounds did not improve efficiency in terms of overall time to complete the task (observers were forced to wait till the end of the trial to enter their response), but rather in terms of cognitive resources (by terminating the effortful evidence accumulation process early). The TypeI evidence accumulation system is thus not only hasty, but also indolent.
In comparison, there was no evidence that the TypeII evidence accumulation system was subject to the same premature termination: the model with covert bounds did not improve the description of observers’ confidence judgements, and the confidence of the observer was not discernible from their pupil dilation until after the TypeI response. This is consistent with other findings in the literature, suggesting that observersʼ TypeII responses inherit noise from the TypeI process^{31}, and may continue to accumulate TypeII evidence even after their TypeI response with additional TypeII noise^{8,16}, and explains how some observers are able to show superior performance in their TypeII decisions compared to what would be expected from their TypeI decisions^{32}. The additional accumulation of TypeII evidence, in combination with the additional inference noise affecting TypeII evidence accumulation, can accommodate for all levels of metacognitive performance described in the literature thus far. Moreover, the particular relationship between the suboptimalities in TypeI and TypeII evidence accumulation, with more noise but less temporal bias in TypeII accumulation supports the characterisation of these processes as a hasty but efficient (TypeI) process moderated by a cautious but inefficient TypeII process^{3}. Here, these effects were found without any explicit experimental manipulation: there was no speedaccuracy tradeoff to induce the early boundary on TypeI accumulation in the Replay task, and there was no manipulation of the timings of TypeI and TypeII responses to induce additional TypeII accumulation. In other words, the observed relationship between TypeI and TypeII evidence accumulation arose from observers’ intrinsic predispositions.
This partial dissociation between TypeI and TypeII evidence accumulation processes was also evident from observers’ pupil dilation responses, suggesting differential activations of the noradrenergic system. The pupil response to reaching the decision bound may be associated with the phasic activity of Locus Coeruleus (LC) neurons to task relevant stimuli, the timing of which is more tightly related to the response time (though occurring prior to the response) than to the timing of the stimulus^{33}. This pupil response to covert decisions has been demonstrated in previous experiments^{34}, as has the distinct profile we saw related to confidence^{35,36}. While phasic pupil dilation was associated with crossing the decision bound (temporally dissociated from the time of the response in the Replay task), a distinct effect was seen related to confidence: faster pupil constriction following the response (possibly due to an indirect effect of confidence on task disengagement). This means that these two distinct responses visible in the pupil dilation likely correspond to distinct processes that can be measured simultaneously, perhaps corresponding to the functional differences related to tonic and phasic activity of noradrenergic neurons^{26}.
Despite all these differences between TypeI and TypeII evidence accumulation, there was a surprisingly strong correlation between observers’ TypeII efficiency and their ability to efficiently set and maintain TypeI decision bounds. It is unlikely that this correlation emerges indirectly due to some common underlying variable, such as the observers’ motivation in general, because of the lack of a relationship between observer’s TypeI sensitivity (which should also vary with task commitment) in the Replay task and bound efficiency in the Stopping task. Likewise, there was no relationship between TypeII efficiency in the Replay task and the magnitude of the suboptimalities in evidence accumulation in the Stopping task. The strongest interpretation of the correlation is that there is a causal relationship between TypeII efficiency and bound efficiency: TypeII evidence is being used to set and maintain boundaries on TypeI evidence accumulation (though a test for causality is left for future research). This ‘metacognitive control’ has been previously suggested to operate in a postdecisional manner: the current confidence will influence future TypeI evidence accumulation^{37,38,39}. Our interpretation goes further by postulating that the TypeII process is acting on the TypeI evidence accumulation online, as accumulation occurs. Indeed, we were able to reproduce observers’ bound efficiency by simulating bounds based on the TypeII evidence estimate of the probability of correct responses (Supplementary Fig. 5). This kind of interaction would readily explain how observers actively seek more information when they are uncertain^{40}, and how observers can integrate loss functions into their TypeI decisions^{41}. In addition, this kind of relationship offers a mechanism by which decision bounds could be implemented at the neural level, which is supported by evidence showing activity in the dlPFC is modulated according to changes in the speedaccuracy tradeoff^{42,43,44} and according to metacognitive confidence^{45,46}. However, it remains to be tested whether observers utilise their TypeII evidence for boundsetting in other contexts, in particular, in contexts where the tight relationship between TypeII evidence and boundsetting is not suggested in the task instructions.
In summary, perceptual and confidence decisions were best explained by partially dissociable, yet causally related evidence accumulation processes. This intricate relationship allows human observers to simultaneously represent sensory evidence as a categorical variable while maintaining a graded representation of the uncertainty in that variable, which can then signal the commitment to a perceptual decision. This allows for fast and efficient perceptual processing under the control of more cautious confidence evaluation. Rather than adding complexity, this characterisation of perceptual processing as concomitant with the processing of uncertainty provides a general computational framework for describing several features of human decisionmaking. That confidence controls perceptual decisionmaking explains how observers can adjust decisionmaking across speedaccuracy tradeoffs, learn to make decisions in volatile environments, incorporate priors and loss functions into perceptual decisions, and optimise perceptual processing flexibly across instances requiring detailed scrutiny and instances requiring the integration of global cues.
Methods
Participants
A total of 22 participants with normal or correctedtonormal vision were recruited via the French Relais d’Information en Sciences de la Cognition (RISC) mailing list. The experimental protocol was approved by the Conseil en éthique pour les recherches en santé and preregistered on the Open Science Framework platform (https://osf.io/gy2t3/), in adherence with the declaration of Helsinki, and participants gave written informed consent prior to completing the experiment. Two participants were removed from the analysis based on preregistered criteria (performance not significantly above chance), leaving 20 participants in the full analysis as planned. One further participant was removed from the pupillometry analyses due to a technical problem with the recording.
Apparatus and stimuli
Stimuli were presented on a 24inch LCD monitor (BenQ) running at 60 Hz with a resolution of 1920 × 1080 pixels and mean luminance 45 cd/m^{2}. Stimulus generation and presentation was controlled by MATLAB (Mathworks) and the Psychophysics toolbox^{47,48,49}, run on a Mini Mac (Apple Inc). An EyeLink 1000 infrared monocular eyetracker system (SR Research Ltd. Ontaro, Canada), running at 500 Hz on a dedicated PC, was used to monitor blinks and pupil dilation in the observer’s dominant eye. Observers viewed the monitor from a distance of 60 cm, with their head supported by a chin rest. Stimuli were oriented Gabor patches subtending 4° of visual angle with a Michelson contrast of 0.2 and spatial frequency of 2 cycles/degree. Gabors were embedded in spatially filtered noise with an amplitude spectrum of 1/f^{1.25} and contrast of 0.15. The orientation of each Gabor was chosen from one of two VonMises distributions with concentration parameter κ = 0.5, and means of –45° and + 45° from vertical (0°). An annular colour guide was drawn around each Gabor to aid participants in the visualisation of these distributions, where the red and blue RGB channels reflected the probability density of each angle in the two distributions, respectively^{22}. This colour guide was present throughout the trial, in addition to a black, circular fixation point subtending 0.3° at the centre of the screen. These distributions and example stimuli are shown in Fig. 1.
Procedure
The task was a modified version of the weather prediction task^{21,22}. On each trial the observer was presented with a sequence of stimuli and was asked to guess which of the two categories (defined by the distributions of the orientations) the stimuli were drawn from. Observers were instructed to press the left arrow key for the category with mean of –45° and the right arrow key for the category with mean of +45°, which was described to the participants using the colour guide. The stimuli were presented at a rate of 4 Hz, with Gabors presented at maximum contrast for 150 ms, temporally bordered by a 25 ms cosine ramp, with a 50 ms interstimulus interval. This same basic procedure was used across three variants of the task (visually depicted in Fig. 1b), completed over two sessions of approximately one hour each. For each observer, 100 trials of 40 samples were predefined by randomly sampling from the orientation distributions (50 trials from each distribution) and saving the random number generator seeds for recreating the spatially filtered noise added to each Gabor. These 100 predefined trials were repeated over the experiment.
In the first session, observers completed the Stopping task. In this task, samples were presented to the observer until they entered their response (or until all 40 samples were shown). There were three response conditions, where observers were instructed to enter their response as soon as they thought they had reached a certain performance target (a 70%, 85%, or 90% chance of being correct). These conditions were completed in a random order over six blocks of 100 trials (two blocks of each condition). Before starting this task, observers completed a practice block, where they were first shown 20 trials of 4, 8, 12, or 16 samples with immediate feedback as to which distribution the orientations were actually drawn from. They were then asked to practise responding at each of the performance targets over 10 trials for each target. During this part of the practice, observers were given feedback as to their average percent correct performance over the 10 trials. In the Stopping task, participants were informed of their average performance over miniblocks of 20 trials, and were symbolically awarded 10 points for achieving the performance target over the 20 trials, or 5 points for coming within 5% of the target (for achieving 5% more or less than the target), but 0 points otherwise.
In the second session, participants completed two tasks: the Free task and the Replay task, in that order. The Free task was the same as the Stopping task except that participants were not given performance targets, but were instead asked to respond when they ‘felt ready’. There were three blocks of 100 trials in this task, and participants were not given any feedback as to their performance. In the Replay task, observers were shown a specific number of samples and could only respond after the sequence finished, which was signalled by the fixation point changing to red. After entering their response, observers were also required to give a confidence rating of how certain they were that they were correct on that trial, on a scale of 1 to 4. Unbeknownst to observers, the trials in the Replay task were actually designed based on the observers’ responses in the Free task. There were three conditions; in the Same condition, observers were shown the median number of samples that they chose to respond to over the three repeats of that exact same trial in the Free task. In the Less condition, observers were shown two fewer than the minimum number of samples they had chosen to respond to on that trial in the Free task, and in the More condition they were shown four more than the maximum number of samples. For example, if the observer chose to respond after 4, 5, and 10 samples on the three repeats of one predefined trial, they would be presented with 2, 5, and 14 samples in the Less, Same, and More conditions of the Replay task, respectively. These conditions were randomly intermixed over three blocks of 100 trials.
Statistics and reproducibility
This manuscript presents a single experiment with 20 participants. This sample size was preregistered and allows the detection of a moderate effect size of 0.68 with a power of 0.8 at an alpha level of 0.05 for standard twosided ttests. The majority of statistical comparisons were performed withinsubjects using nonparametric tests, making these analyses more conservative, but robust to deviations from normality, which cannot be reliably tested in small sample sizes. All measurements were taken from distinct samples unless otherwise specified.
Behavioural analysis
Raw behavioural data were used to calculate proportion correct, sensitivity (d’), and the median sequence length (number of samples observers saw) for each participant, for each experimental condition, and across experimental blocks. We present the average proportion correct across conditions in the Results. Nonparametric withinsubject statistics were applied to sensitivity (d’) to examine differences in performance across conditions, and to the median sequence length. We also present parametric confidence intervals on the proportion correct data. Throughout the analysis, Bonferroni corrections were applied to pvalues less than 0.05 when more than one statistical test was carried out per hypothesis (with this indicated by bonf*[number of tests corrected for] in subscript), while nonsignificant pvalues were reported uncorrected.
Pupillometry
Blinks were identified using the EyeLink automatic blink detection algorithm and pupil dilation was linearly interpolated using a 100 ms window before and after each blink. Additional outlier pupil dilation data points (greater than 8 SD from the mean) were also interpolated. Data were downsampled to 50 Hz by averaging over consecutive windows of 20 ms. Timepoints where the observer was not fixating within 200 pix from fixation were tagged for exclusion and trials were excluded from the analysis if there was more than 100 ms of exclusory data, or if any data were more than 3 SD from the mean (on average, 2.6% of trials in the Stopping task, and 5.3% of trials in the Replay task). Each participant’s pupil size data were zscored and epochs were taken relative to the start of each trial (0 to 3 s) and relative to the response (−6 to 2 s). Epochs were baselined by subtracting the group mean for each condition at the start of the epoch from each participant. Pupil change was calculated based on the derivative of the pupil size, which was conducted with some smoothing, by taking the differences between the sum of five timepoints before and after each timepoint and dividing by five times the sample rate (50 Hz). Statistical inferences were performed using a clusterbased procedure^{30}. Significant clusters were found using Wilcoxon sign rank tests at each timepoint (at a statistical threshold of p < 0.05) and comparing the sum of the zscores in these clusters to those obtained over 3000 permutations. For tests against baseline in the responsealigned epochs, data were permuted by shuffling the response times within each condition. For tests between conditions, data were permuted by shuffling the condition labels of the trials. Clusterbased corrections corrected for tests over multiple timepoints, additional Bonferroni corrections were applied to significant pvalues when more than one set of tests was performed (e.g., totalling three clusterlevel analyses, one for each condition separately in the Stopping and Replay tasks). As a secondary check that the results were not influenced by the pupil response to stimulus onset, the analyses were also performed on data where the pupil impulse response function to stimulus onset was removed using an autoregressive model with exogenous inputs^{50}, these results can be found in the Supplementary Note 6 and Supplementary Fig. 8.
Computational modelling
A computational model was defined based on^{22}. The model takes the Bayesian optimal accumulation of sensory evidence in this task, and disrupts this process with several sources parametrically defined suboptimalities. The Bayesian optimal observer is assumed to know the category means, \(\mu _1 =  \frac{\pi }{4},\mu _2 = \frac{\pi }{4}\), and the concentration, κ = 0.5, and takes the evidence in favour of category ψ (ψ = 1 or ψ = 2) based on the orientation presented on a specific trial, for a specific sample n, as the probability of the orientation θ_{n} given each category:
Where I_{0}(·) is the modified Bessel function of order 0. The optimal observer then chooses the category ψ with the greatest posterior probability over all samples for that trial, T (T varies from trial to trial), given a uniform category prior, \(p\left( \psi \right) \propto \frac{1}{2}\) :
This is achieved by accumulating the log probabilities of each category, given each orientation presented in the sequence:
Given the evidence for each category is perfectly anticorrelated over the stimulus orientations, the evidence from each sample can be summarised as:
and the optimal observer sums this evidence over all samples in the trial (T):
Such that the Bayesian optimal decision is 1 if z > 0 and 2 if z ≤ 0.
This optimal decisionmaking was disrupted by several sources of suboptimality in order to account for each observer’s behaviour. First, variability is added to the evidence accumulation process, such that independent and identically distributed (i.i.d) noise, ε_{n}, is added to each evidence sample. The noise is Gaussian distributed with zero mean, and the degree of variability parameterised by σ, the standard deviation:
This noise represents inference noise, as it is added to the decision update as opposed to the representation of stimulus orientation. It is noted that there could be some contribution of sensory noise, where the representation of the stimulus orientation does not veridically match the sensory input. However, previous evidence^{22} suggests that the contribution of sensory noise to this task is minimal (only very large values of sensory noise would contribute significantly to decision variability in this task), thus no sensory noise parameter was explicitly fitted in reported analyses.
The suboptimal observer does not accumulate evidence perfectly. Functionally, during accumulation, the current accumulated evidence is weighted by α, before accumulating the next sample, so that when α > 1 this creates a primacy effect and later evidence affects the decision less than the initial evidence. In contrast when α < 1 this creates a recency effect, and the observer’s decision places greater weight on the more recent evidence than the initial samples. Thus, by the end of the sequence, the weight on each sample n is equal to:
Where T is the total samples in that trial and n ∈ [1,T]. Altogether, the suboptimal accumulation of decision evidence takes the following form:
Several additional parameters were necessary to describe when observers would respond. The optimal observer makes a decision as soon as the relative decision evidence, given the sequence length (n), has crossed a decision boundary, Λ. In order to maintain a constant likelihood of a correct response (as required by the task) this bound was found to decrease as sequence length increases (such that the bound represents a constant bound on proportion correct over sequence length, further details in Supplementary Note 3 and Supplementary Fig. 4):
For the positive decision bound (the negative bound, Λ_{n– }= − Λ_{n+}). The likelihood f(n) of responding at sample n was estimated by computing the frequencies, over 1000 samples from ε_{n} (Monte Carlo simulation), of first times where the following inequality is verified:
As we do not have access to when the decision is made, only when the response is entered, two additional parameters are used to describe the mean, μ_{U}, and variance, \(\sigma _U^2\), of the nondecision time, which is assumed to be i.i.d across trials and take the form of a Gaussian distribution. Thus, the likelihood of responding over all samples n is calculated as:
where f(n) is the likelihood of responding at sample n as above in Eq. 10, f′(n) is the modified likelihood that takes into account a smoothing of the choices in time, and \(g\left( {n;\mu _U,\sigma _U^2} \right)\) is the Gaussian kernel with mean μ_{U} and variance \(\sigma _U^2\) applied at sample n.
Model fitting
First, the full model was fit to TypeI behaviour in each task and condition separately. Responses in the Stopping task and the Free task were modelled by optimising parameters to minimise the negative loglikelihood of the observer making response r at sample n on each trial, using a Bayesian Adaptive Direct Search^{51}. As there is no known analytic solution to the likelihood function of the model, the probability of the observer making each response at each sample, given the parameters, was numerically estimated using Monte Carlo simulation. The sensitivity of this approach was tested using parameter recovery. Simulating 300 trials we found a significant correlation between the simulated and recovered parameters (using a Spearman’s correlation, all p < 10^{−5}, more details in Supplementary Note 4, and Supplementary Fig. 6). The numerical estimation approach was also applied in the Replay task for consistency, even though when only the TypeI discrimination response was fit, the model does have an analytic solution.
The full model was then simplified using a knockout procedure, by comparing the Bayesian Information Criterion (BIC) of the full model with the BIC of models with each parameter fixed to a neutral value in turn using Bayesian Model Selection (implemented in SPM12^{52,53}), for each condition of each task. The full model contained response bias and lapse rate parameters (not described above) that could be removed as they did not significantly improve the fit (the exceedance probability of the model with response bias = 0 was xp = 0.93; and fixing lapse = 0.001, xp > 0.99). There was also little evidence for response bias from the behavioural data (mean criterion = 0.03 [±0.05]). Thus, the final TypeI model contained seven parameters for fitting both what and when observers responded, and just two parameters for fitting only the categorisation response. Next, we examined whether any of the parameters systematically varied across conditions within tasks. For the Replay task, this is described in the Results. For the Stopping task, this was used to assess which parameters observers were adjusting to control their bounds. The only parameter to significantly vary across conditions in the Stopping task was λ (Kruskal–Wallis test χ^{2} = 14.34, p_{bonf*7} = 0.006) with all other p > 0.1 (uncorrected; and specifically, b, the other important bound parameter, showed no evidence of adjustment across conditions: χ^{2} = 1.51, p = 0.47). We then fit all conditions of the Stopping task together, with three λ, one for each condition.
Estimating optimal bounds
In order to calculate bound efficiency, observers’ actual bound separation was divided by the optimal bound separation. The optimal bound separation was estimated by simulating observers’ performance across all bounds, and taking the bounds that produced the target performance. Performance was simulated by fixing all parameters except λ (the only parameter found to systematically adapt across target performance conditions) and producing responses to the orientations shown to observers over 1000 samples of noise from each observers’ ε_{n}. Bound efficiency was then calculated as the actual difference in observers’ λ in the 70% and 85% target performance condition, divided by the simulated difference in the λ that would have achieved 70% and 85% correct, leaving all other parameters the same. This meant that the bound efficiency really corresponded to the ability of the observer to appropriately adjust their bound, irrespective of any absolute bias to set bounds too high or too low. The data from the 90% correct condition was not used in this calculation as the model predicted that 13 observers would have never reached 90% correct, and indeed, no observer actually reached 90% correct overall in any task of the experiment.
Fitting TypeII responses
To fit TypeII confidence ratings, additional criteria were required to partition the evidence for each confidence rating. We examined the absolute evidence, \(\mathop {\sum}\nolimits_{n = 1}^N {\left( {\ell _n + \varepsilon _n} \right) \cdot v_n} \), that observers were exposed to in the Replay task, as a function of sequence length, for each confidence rating. We found that the evidence for each confidence rating tended to follow the same function as for the ideal bound on the TypeI decision (Eq. 9; see Supplementary Note 3 and Supplementary Fig. 3 for more details). Therefore, TypeII responses were modelled by implementing three bounds (Λ_{1,}Λ_{2,}Λ_{3}) as the upper limit on the evidence for each confidence rating (with the highest confidence rating having no upper limit). The three bounds were modelled with the same a and b, but different λ. Model fitting was performed using the same method as for the TypeI behaviour, except both the TypeI and the TypeII response were fit, such that the model would respond [TypeI, TypeII]:
First, a model was fit using the same z for TypeI and TypeII responses. Entirely separate parameters (leading to independent z for TypeI and TypeII responses) were fit in the parallel model, where the TypeI parameters were fixed to those fit to only the TypeI responses. In partially correlated models some parameters for the TypeII z were fixed to be the same as those affecting the TypeI z. These models compared all combinations of fixed/varied noise and leak, and compared whether additional noise was added with each sample of evidence, or a single sample of noise irrespective of sequence length. Model comparison showed a partially correlated model, where TypeII z is affected by additional noise and a different leak, best accounted for TypeI and TypeII responses: the exceedance probability of this model over the model fit using the same z for TypeI and TypeII responses was xp > 0.99; the exceedance probability over the parallel model was xp > 0.999; the exceedance probability over the next best partial model (a model with leak fixed to the TypeI leak) was xp = 0.54. We then compared models in which the observer accumulates TypeII evidence over all samples and models implementing a bound on TypeII evidence accumulation (either the same bound as the TypeI bound, or an independent bound). There was no evidence for a TypeII accumulation bound (Fig. 4b); TypeII z accumulated evidence across all presented samples.
If the λ of a higher confidence bound was smaller than the λ of a lower confidence bound, this resulted in negative likelihoods (as it is paradoxical to require less evidence for higher confidence), and the model would sometimes become stuck in a local minimum. We therefore implemented plausible lower and upper bounds on the parameters, based on initial fits to participants where the model was successfully able to apply the bounds. These plausible bounds are used by the Bayesian Adaptive Direct Search to design the initial mesh of the parameter search, and by specifying increasing but overlapping plausible bounds on the λ’s, the model was able to successfully recover sensible parameters for all participants, while not limiting the model’s ability to describe the behaviour of some of the more extreme participants.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
Data can be downloaded from a fork of the preregistration on Open Science Framework (https://osf.io/c9xfr/).
Code availability
Model fitting code can be downloaded from a fork of the preregistration on Open Science Framework (https://osf.io/s6zfb/).
References
 1.
Helmholtz, H. L. F. Treatise on Physiological Optics. (Thoemmes Continuum, Bristol, 1856).
 2.
Galvin, S. J., Podd, J. V., Drga, V. & Whitmore, J. Type 2 tasks in the theory of signal detectability: discrimination between correct and incorrect decisions. Psychonomic Bull. Rev. 104, 843–876 (2003).
 3.
Mamassian, P. Visual confidence. Annu. Rev. Vis. Sci. 2, 459–481 (2016).
 4.
Stone, M. Models for choicereaction time. Psychometrika 25, 251–260 (1960).
 5.
LaBerge, D. A recruitment theory of simple behavior. Psychometrika 27, 375–396 (1962).
 6.
Vickers, D. Evidence for an accumulator of psychophysical discrimination. Ergonomics 13, 37–58 (1970).
 7.
Ratcliff, R. A theory of memory retrieval. Psychol. Rev. 85, 59–108 (1978).
 8.
Pleskac, T. J. & Busemeyer, J. R. Twostage dynamic signal detection: a theory of choice, decision time, and confidence. Psychol. Rev. 1173, 864 (2010).
 9.
Zylberberg, A., Roelfsema, P. R. & Sigman, M. Variance misperception explains illusions of confidence in simple perceptual decisions. Conscious. Cognition 27, 246–253 (2014).
 10.
De Gardelle, V. & Mamassian, P. Weighting mean and variability during confidence judgments. PLoS ONE 103, e0120870 (2015).
 11.
Castañón, S. H. et al. Human noise blindness drives suboptimal cognitive inference. Nat. Commun. 101, 1719 (2019).
 12.
Barthelmé, S. & Mamassian, P. Evaluation of objective uncertainty in the visual system. PLoS Comput. Biol. 59, e1000504 (2009).
 13.
Zylberberg, A., Barttfeld, P. & Sigman, M. The construction of confidence in a perceptual decision. Front. Integr. Neurosci. 6, 79 (2012).
 14.
Aitchison, L., Bang, D., Bahrami, B. & Latham, P. E. Doubly Bayesian analysis of confidence in perceptual decisionmaking. PLoS Computat. Biol. 1110, e1004519 (2015).
 15.
Li, H. H., & Ma, W. J. Confidence reports in decisionmaking with multiple alternatives violate the Bayesian confidence hypothesis. Preprint at: https://www.biorxiv.org/content/10.1101/583963v1.full, https://doi.org/10.1038/s41467020155816.
 16.
Baranski, J. V. & Petrusic, W. M. Probing the locus of confidence judgments: experiments on the time to determine confidence. J. Exp. Psychol.: Hum. Percept. Perform. 243, 929 (1998).
 17.
Rabbitt, P. & Vyas, S. Processing a display even after you make a response to it. How perceptual errors can be corrected. Q. J. Exp. Psychol. 333, 223–239 (1981).
 18.
Yeung, N. & Summerfield, C. In Fleming S. M. & Frith C. D. (editors) The Cognitive Neuroscience of Metacognition. 147–167 (Springer, Berlin, Heidelberg, 2014).
 19.
Charles, L. & Yeung, N. Dynamic sources of evidence supporting confidence judgments and error detection. J. Exp. Psychol.: Hum. Percept. Perform. 451, 39 (2019).
 20.
Resulaj, A., Kiani, R., Wolpert, D. M. & Shadlen, M. N. Changes of mind in decisionmaking. Nature 4617261, 263–266 (2009).
 21.
Knowlton, B. J., Mangels, J. A. & Squire, L. R. A neostriatal habit learning system in humans. Science 2735280, 1399–1402 (1996).
 22.
Drugowitsch, J., Wyart, V., Devauchelle, A. D. & Koechlin, E. Computational precision of mental inference as critical source of human choice suboptimality. Neuron 926, 1398–1411 (2016).
 23.
Murphy, P. R., Boonstra, E. & Nieuwenhuis, S. Global gain modulation generates timedependent urgency during perceptual choice in humans. Nat. Commun. 7, 13526 (2016).
 24.
Lempert, K. M., Chen, Y. L. & Fleming, S. M. Relating pupil dilation and metacognitive confidence during auditory decisionmaking. PLoS ONE 105, e0126588 (2015).
 25.
Allen, M. et al. Unexpected arousal modulates the influence of sensory noise on confidence. Elife 5, e18103 (2016).
 26.
AstonJones, G. & Cohen, J. D. An integrative theory of locus coeruleusnorepinephrine function: adaptive gain and optimal performance. Annu. Rev. Neurosci. 28, 403–450 (2005).
 27.
Laeng, B., Sirois, S. & Gredebäck, G. Pupillometry: a window to the preconscious? Perspect. Psychol. Sci. 71, 18–27 (2012).
 28.
Fleming, S. M. & Lau, H. C. How to measure metacognition. Front. Hum. Neurosci. 8, 1–9 (2014).
 29.
Zhang, J., Bogacz, R. & Holmes, P. A comparison of bounded diffusion models for choice in time controlled tasks. J. Math. Psychol. 534, 231–241 (2009).
 30.
Maris, E. & Oostenveld, R. Nonparametric statistical testing of EEGand MEGdata. J. Neurosci. Methods 1641, 177–190 (2007).
 31.
Maniscalco, B., & Lau, H. The signal processing architecture underlying subjective reports of sensory awareness. Neurosci. Conscious. 2016, https://doi.org/10.1093/nc/niw002 (2016).
 32.
Barrett, A. B., Dienes, Z. & Seth, A. K. Measures of metacognition on signaldetection theoretic models. Psychol. Methods 184, 535–552 (2013).
 33.
Clayton, E. C., Rajkowski, J., Cohen, J. D. & AstonJones, G. Phasic activation of monkey locus coeruleus neurons by simple decisions in a forced choice task. J. Neurosci. 24, 9914–9920 (2004).
 34.
Einhauser, W., Koch, C. & Carter, O. Pupil dilation betrays the timing of decisions. Front. Hum. Neurosci. 4, 18 (2010).
 35.
Satterthwaite, T. D. et al. Dissociable but interrelated systems of cognitive control and reward during decision making: evidence from pupillometry and eventrelated fMRI. Neuroimage 37, 1017–1031 (2007).
 36.
Preuschoff, K., ’t Hart, B. M. & Einhauser, W. Pupil dilation signals surprise: Evidence for noradrenaline’s role in decision making. Front. Neurosci. 5, 115 (2011).
 37.
Rabbitt, P. M. A. Errors and error correction in choiceresponse tasks. J. Exp. Psychol. 71, 264–272 (1966).
 38.
Botvinick, M. M., Braver, T. S., Carter, C. S., Barch, D. M. & Cohen, J. D. Conflict monitoring and cognitive control. Psychol. Rev. 108, 624–652 (2001).
 39.
Notebaert, W. et al. Posterror slowing: an orienting account. Cognition 111, 275–279 (2009).
 40.
Desender, K., Boldt, A. & Yeung, N. Subjective confidence predicts information seeking in decision making. Psychol. Sci. 29, 761–778 (2018).
 41.
Whiteley, L. & Sahani, M. Implicit knowledge of visual uncertainty guides decisions with asymmetric outcomes. J. Vis. 83, 2–2 (2008).
 42.
van Veen, V., Krug, M. K. & Carter, C. S. The neural and computational basis of controlled speed–accuracy tradeoff during task performance. J. Cogn. Neurosci. 20, 1952–1965 (2008).
 43.
Ivanoff, J., Branning, P. & Marois, R. fMRI evidence for a dual process account of the speed–accuracy tradeoff in decisionmaking. PLoS ONE 3, e2635 (2008).
 44.
Bogacz, R., Wagenmakers, E. J., Forstmann, B. U. & Nieuwenhuis, S. The neural basis of the speed–accuracy tradeoff. Trends Neurosci. 331, 10–16 (2010).
 45.
Fleck, M. S., Daselaar, S. M., Dobbins, I. G. & Cabeza, R. Role of prefrontal and anterior cingulate regions in decisionmaking processes shared by memory and nonmemory tasks. Cereb. Cortex 16, 1623–1630 (2006).
 46.
Fleming, S. M. & Dolan, R. J. The neural basis of metacognitive ability. Philos. Trans. R. Soc. B: Biol. Sci. 3671594, 1338–1349 (2012).
 47.
Brainard, D. H. The psychophysics toolbox. Spat. Vis. 10, 433–436 (1997).
 48.
Pelli, D. G. The VideoToolbox software for visual psychophysics: transforming numbers into movies. Spat. Vis. 104, 437–442 (1997).
 49.
Kleiner, M. et al. What’s new in Psychtoolbox3. Perception 3614, 1 (2007).
 50.
Zénon, A. Timedomain analysis for extracting fastpaced pupil responses. Sci. Rep. 7, 41484, https://doi.org/10.1038/srep41484, (2017).
 51.
Acerbi, L. & Ji, W. Practical Bayesian optimization for model fitting with Bayesian adaptive direct search. In Guyon, I. et al. (editors) Advances in neural information processing systems. 1836–1846 (2017).
 52.
Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J. & Friston, K. J. Bayesian model selection for group studies. NeuroImage 46, 1004–1017 (2009).
 53.
Rigoux, L., Stephan, K. E., Friston, K. J. & Daunizeau, J. Bayesian model selection for group studies revisited. NeuroImage 84, 971–985 (2014).
Acknowledgements
This project was supported by funding from Labex (ANR10LABX0087 IEC), INSERM (Inserm U960), the CNRS (CNRS UMR 8248) and in part by ANR18CE280015 grant ‘VICONTE’.
Author information
Affiliations
Contributions
T.B. contributed to experimental design, preregistration, data collection, analysis, and wrote the manuscript. V.W. contributed to experimental design, preregistration, analysis, writing of the manuscript, and sourcing funding and provision of materials. P.M. contributed to experimental design, preregistration, analysis, writing of the manuscript, and sourcing funding and provision of materials.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Benedetto De Martino, Brian Maniscalco and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Balsdon, T., Wyart, V. & Mamassian, P. Confidence controls perceptual evidence accumulation. Nat Commun 11, 1753 (2020). https://doi.org/10.1038/s4146702015561w
Received:
Accepted:
Published:
Further reading

Confidence ForcedChoice and Other Metaperceptual Tasks*
Perception (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.