Introduction

Metacognition, the ability to internally evaluate our cognitive processes, is critical for adaptive behavioral control, particularly as many real-life decisions lack immediate feedback. Specifically, action outcomes can be ambiguous1, delayed, occur only after a sequence of subsequent decisions2, or might never occur at all3,4. Yet behavioral and neural evidence indicate that subjects are able to evaluate their choices online in the absence of immediate feedback, forming estimates of decision confidence5,6 and detecting and correcting response errors7,8. Previous research on metacognition in both humans and animals has focused on mechanisms supporting “local” decision confidence, elicited at or around the time of a particular decision9,10. The formation of decision confidence is informed by stimulus evidence11,12,13,14, reaction times15,16,17, and integration of post-decision evidence18. Theoretically, decision confidence is proposed to correspond to a probability that a choice was correct19, and, empirically, confidence computations are thought to depend on a network of the prefrontal and parietal brain areas5,15,20,21.

Despite this intensive focus on the construction of “local” confidence at the level of individual decisions, it remains unclear whether and how local confidence estimates might be aggregated over time to form “global” self-performance estimates (SPEs). Global beliefs about our abilities play an important role in shaping our behavior22, determining the goals we choose to pursue, and the motivation and effort we put into our endeavors23,24. Put simply, if we believe we are unable to succeed at a particular task, we may be unlikely to try in the first place. In certain situations, such beliefs may exert stronger influences on our behavior than objective performance25,26, and distortions in global self-evaluation have been associated with various psychiatric symptoms27,28.

However, despite their widespread behavioral influence, little is known about the mechanisms supporting the formation of global SPEs on a given task. It is likely that global SPEs incorporate external feedback when it is available. For instance, when choosing our next career move, we may learn about our self-competence over multiple evaluations of performance (such as formal appraisals), and accumulate these local evaluations into coherent global beliefs. Critically, however, when external feedback is absent, it may prove adaptive to use decision confidence as a proxy for success, aggregating local confidence estimates over longer timescales to form global SPEs.

Here, we developed a paradigm to investigate how external feedback and local decision confidence relate to global SPEs, and whether local fluctuations in decision confidence inform SPEs when external feedback is unavailable. In three experiments, human subjects played short interleaved tasks and were subsequently asked to choose the task on which they think they performed best. These task choices provided a simple behavioral assay of global SPEs. Strikingly, subjects pervasively underestimate their performance in the absence of feedback, compared with a condition with full feedback, despite objective performance being similar in the two cases. Moreover, we observe that local decision confidence influences SPEs over and above accuracy and reaction times. Our findings create a bridge between local confidence signals and global SPEs, and support a functional role for confidence in higher-order behavioral control.

Results

Experimental design

In Experiment 1, conducted online, human subjects (N = 29) performed short learning blocks (24 trials) featuring random alternation of two tasks, which were signaled by two arbitrary color cues (Fig. 1a). Each task required a perceptual discrimination as to which of two boxes contained a higher number of dots (Fig. 1c). Two factors controlled the characteristics of each task: the task was either easy or difficult (according to the dot difference between boxes), and subjects received either veridical feedback (correct, incorrect) or no feedback following each choice (Fig. 1b). This factorial design resulted in six possible task pairings within a learning block; for instance, an Easy-Feedback task could be paired with a Difficult-Feedback task, or a Difficult-Feedback task could be paired with a Difficult-No-Feedback task, and so forth (Fig. 1a).

Behavioral results

We initially examined how our experimental factors affected subjects’ performance on the tasks. A 2 × 2 ANOVA on performance revealed a main effect of Difficulty (F(1,28) = 292.8, p < 10–15), but no main effect of Feedback (F(1,28) = 0.02, p = 0.90) and no interaction (F(1,28) = 0.44, p = 0.51). In particular, subjects’ performance averaged 67 and 85% correct in the difficult and easy tasks, respectively (difference: t28 = −17.02, p < 10–15); this difference in performance between difficulty levels was also present for every subject individually. Critically, within each of the two difficulty levels, objective performance was similar in the presence and absence of feedback (both difficulty levels t28 < 0.58, p > 0.57, BF < 0.22; substantial evidence for the null hypothesis), indicating that we were able to examine how feedback affects SPEs irrespective of variations in performance. A similar pattern was observed for reaction times (RTs) (main effect of Difficulty, F(1,28) = 23.87, p < 10–4; no main effect of Feedback, F(1,28) = 0.16, p = 0.69 and no interaction, F(1,28) = 0.08, p = 0.78). RTs were significantly faster in easy (mean = 672 ms) as compared with difficult tasks (mean = 707 ms) (t28 = 4.88, p < 10–4); this difference in RTs was observed in 24 out of 29 subjects. Conversely, within each of the two difficulty levels, RTs were similar in the presence and absence of feedback (both difficulty levels t28 < 0.41, p > 0.68; BF < 0.21 i.e., substantial evidence for the null hypothesis), allowing us to examine how feedback affects SPEs independently of variations in both objective performance and RTs (Fig. 2a).

Likewise, subjective ratings of overall performance were greater on easy tasks compared with difficult tasks, and again, despite task performance remaining stable in the presence and absence of feedback, performance was rated as worse in the absence of feedback (Fig. 2c). A 2 × 2 ANOVA revealed a significant influence of Feedback (F(1,28) = 32.1, p < 10–5) and Difficulty (F(1,28) = 51.1, p < 10–7) on subjective ratings together with a significant interaction (F(1,28) = 10.5, p = 0.003), indicating that tasks were rated higher in the presence of feedback and even more so for easy tasks. We note that the presence and direction of interactions between factors predicting SPEs differed across task choices and ratings (and also across experiments, see below), which may be due to the boundedness of the rating scale creating ceiling/floor effects that do not affect task choices. Importantly, however, the main effects of our Feedback and Difficulty manipulations reliably and consistently impacted SPEs across all measures.

Despite task choices being slightly more extreme than task ability ratings, their patterns were notably similar, with identical direction of effects in all six task pairings (Fig. 3a, b). Moreover, subjects rated chosen tasks more highly than unchosen tasks in 72% of the blocks, which reveals a high level of consistency between our two proxies for global SPEs (rating chosen vs. unchosen task: t28 = 6.92, p < 10–6) (Fig. 3c). Accordingly, a logistic regression showed that the difference in task ratings strongly predicted task choice (β = 0.24, p < 10–15, r2 = 0.41), again indicating consistency across our two ways of operationalizing SPEs. Taken together, the results of Experiment 1 indicate that participants are sensitive to changes in task difficulty when constructing SPEs, and that self-performance is systematically underestimated in the absence of feedback as compared with when feedback is available, despite objective performance remaining stable.

Learning dynamics

We next sought to replicate these effects in an independent data set while additionally investigating the dynamics of SPE formation. In Experiment 2 (N = 29 new subjects), we varied the duration of blocks from 2 to 10 trials per task to ask how the amount of experience with each task influenced SPEs (Fig. 4a). Since task ratings followed a similar pattern as task choices in Experiment 1 (Fig. 2b, c and Fig. 3), they were omitted in Experiment 2 (see Methods). Replicating Experiment 1, a 2 × 2 ANOVA indicated a main effect of Feedback (F(1,28) = 44.5, p < 10–6) and Difficulty (F(1,28) = 73.8, p < 10–8) on task choice in the absence of an interaction (F(1,28) = 0.32, p = 0.57). In particular, we again found that in the absence of feedback, tasks were chosen less often (Fig. 4b and Supplementary Fig. 2b), despite objective performance remaining similar with and without feedback (for both difficulty levels: both t28 < 0.59, both p > 0.56, both BF < 0.218; substantial evidence for the null hypothesis) (Supplementary Fig. 2a and Supplementary Notes).

A potential explanation of this last observation is that subjects prefer to gamble on tasks on which they are informed about their performance, due to a value-of-information effect. We consider this explanation as less likely than a true decrease in SPE in the absence of feedback (see Discussion), because the effect of feedback differentially affected easy/difficult task pairings, despite these two blocks being strictly equivalent in terms of information gain (Fig. 2b, c). Importantly, these differences in SPE were not confounded by trial number: blocks with a larger difference in performance between tasks did not have a significantly greater number of trials than blocks with a smaller difference in performance (for the pairing Feedback-Difficult vs. No-Feedback-Difficult, blocks with a smaller difference in performance had a greater number of trials, t28 = −2.43, p = 0.02, for all other pairings, all t28 < 0.65, p > 0.52). Taken together, these findings indicate that subjects consider local fluctuations in their accuracy when choosing between tasks.

To establish how SPEs emerge over the course of a block, we next unpacked data according to learning block duration. Overall, block duration had a small influence on task choice frequencies (Fig. 4c). Using logistic regression (see Methods), we found that block duration significantly influenced task choice in some of the task pairings (Feedback-Easy vs. Feedback-Difficult, No-Feedback-Easy vs. Feedback-Difficult and Feedback-Easy vs. No-Feedback-Difficult; all p < 0.01), but not others (Feedback-Difficult vs. No-Feedback-Difficult, Feedback-Easy vs. No-Feedback-Easy and No-Feedback-Easy vs. No-Feedback-Difficult; all p > 0.47) (see Methods). When feedback was given on both tasks, block duration significantly influenced task choice (p = 0.005, statistical significance of regression coefficient) such that subjects became better at discriminating between easy and difficult tasks as more feedback was accrued (Fig. 4c, first panel). In contrast, when feedback was absent for both tasks, there was no influence of block duration (p = 0.69, statistical significance of regression coefficient, Fig. 4c, last panel).

Computational modeling

We next considered a candidate hierarchical learning model of how SPEs are constructed over the course of learning in order to explain end-of-block task choices (Fig. 3c and Supplementary Fig. 4a). The model is composed of two hierarchical levels, a perceptual module generating a perceptual choice and confidence on each trial, and a learning module updating global SPEs across trials from local decision confidence and feedback, which are then used to make task choices at the end of blocks (Supplementary Methods). An interesting property naturally emerging from this model is that over the course of trials, posterior distributions over SPEs become narrower around expected performance slightly more rapidly with feedback than without (Supplementary Fig. 4b). Model simulations provided a proof of principle that such a learning scheme is able to accommodate qualitative features of participants’ learned SPEs (Supplementary Fig. 4c and Supplementary Notes). In particular, (1) the model ascribed higher SPEs to easy tasks than difficult tasks and (2) the presence of feedback led to higher SPEs than the absence of feedback, even at the expense of objective performance (Supplementary Fig. 4c, third panel). We found that the extent to which a No-Feedback-Easy task was chosen over a Feedback-Difficult task correlated across individuals with the fitted kconf parameter (which captures each subject’s sensitivity to the input when making confidence judgments, allowing this to differ from their sensitivity kch to the input when making choices) (Spearman ρ = 0.77, p < 0.000005, Pearson ρ = 0.71, p < 0.0001). This result means that participants with more sensitive local confidence estimates were also better at tracking objective difficulty in their SPEs (Supplementary Fig. 4e). However, we also found notable differences between model predictions and participants’ behavior: for instance, tasks providing external feedback were chosen more frequently by participants than predicted by the model (Supplementary Fig. 4c, lower-left panels), indicating that influences beyond those considered in the current model may affect SPE construction (see Discussion).

Establishing a direct link between local confidence and global SPEs

Taken together, the results of Experiments 1 and 2 and associated model fits suggest that subjects trade-off external feedback against internal estimates of confidence when estimating SPEs. However, these experimental findings and corresponding model fits provided only indirect evidence that subjects were sensitive to fluctuations in internal confidence when building global SPEs. In Experiment 3, we sought to obtain direct evidence that changes in local confidence were predictive of end-of-block SPEs. To this end, a new sample of subjects (N = 46) were instead asked to give confidence ratings in their perceptual judgments on no-feedback trials. All other experimental features remained identical to Experiment 2 (see Methods).

Replicating Experiments 1 and 2, we found in a 2 × 2 ANOVA that both the Feedback/Confidence manipulation (F(1,45) = 76.9, p < 10–10) and Difficulty (F(1,45) = 87.6, p < 10–11) impacted task choice, with an interaction between these factors (F(1,45) = 4.8, p = 0.03). Tasks with external feedback were again chosen more often than tasks in which subjects rated their confidence on each trial, in the absence of feedback (Supplementary Fig. 3a). When blocks were separated according to the difference in objective performance between tasks, we again found that subjects’ task choices reflected fluctuations in local performance over and above differences in objective difficulty levels (Supplementary Fig. 3c). Overall, task choice patterns when rating confidence in Experiment 3 were similar to those found for the no-feedback condition of Experiment 2 (Supplementary Fig. 3a and 3c), suggesting these trials were treated similarly in the two Experiments. Critically and consistent with Experiments 1 and 2, performance was better in easy compared with difficult tasks (both t45 > 15.9, both p < 10–19), but did not differ according to the feedback/confidence manipulation (both t45 < 1.44, both p > 0.16, both BF < 0.420; anecdotal evidence for the null hypothesis) (Supplementary Fig. 3a). However, unlike in Experiment 2, we found no significant influence of block duration on task choice (logistic regression; all p > 0.25, except marginally for the third pairing, p = 0.03) (Supplementary Fig. 3d).

We next turned to the novel aspect of Experiment 3: local ratings of confidence. Subjects gave higher confidence ratings for easy (mean = 0.82) compared with difficult (mean = 0.76) trials (t45 = 8.90, p < 10–10), and reported greater confidence for correct (mean = 0.80) than error (mean = 0.72) trials (t45 = 10.2, p < 10–12) (Fig. 5a and Supplementary Fig. 3b), demonstrating a degree of metacognitive sensitivity to performance fluctuations. We also computed metacognitive efficiency (meta-d’/d’), an index of the ability to discriminate between correct and incorrect trials, irrespective of performance and confidence bias29,30 (see Methods). We found that the posterior mean for group metacognitive efficiency was 0.80, close to the SDT-optimal prediction of 1 and providing further evidence that participants’ confidence ratings effectively tracked their objective performance.

We sought to directly test whether fluctuations in local confidence affected end-of-block SPEs by splitting blocks according to differences in confidence level between tasks. In line with our hypothesis, we found that the larger the difference in confidence, the more often the objectively easier task was chosen (Fig. 5c), such that SPEs were consistent with local confidence ratings. To further quantify this effect, we asked whether the difference in confidence level between tasks explained subjects’ task choices over and above differences in objective performance and/or RTs. We found that fluctuations in confidence indeed explained significant variance in subjects’ task choices (β = 1.04, p < 0.0001), over and above variations in accuracy (β = −0.036, p = 0.81) and RTs (β = −0.009, p = 0.95) (Fig. 5b). Critically, this regression model was better able to explain subjects’ task choices than a reduced model, which included only the difference in accuracy and RTs as predictors (Bayesian Information Criteria: BIC = 282 for the regression model including confidence, BIC = 310 for the reduced model), confirming that local confidence fluctuations are important for explaining variance in participants’ global SPEs. Moreover, in additional analyses in which regressors were orthogonalized to each other, we found virtually identical results regarding the effect of confidence on end-of-block task choices, regardless of regressor order (confDiff: all β > 15.7, all p < 0.0001; accDiff and rtDiff: all |β| < 2.08, all p > 0.35). A regression model with only confidence as a predictor (BIC = 282) was also better at predicting task choices than the reduced model with only accuracy and RTs (BIC = 310).

Finally, we considered that if subjects use local confidence to inform their SPEs, subjects who are better at discriminating between their correct and incorrect judgments should also form more accurate SPEs. In line with this hypothesis, we found that participants with higher metacognitive efficiency were also more likely to choose the easy task over the difficult task on blocks without feedback (Pearson ρ = 0.35, p = 0.02; non-parametric correlation coefficient: Spearman ρ = 0.43, p = 0.003; N = 46 participants) (Fig. 5d; see Supplementary Fig. 5 for correlation between global SPEs and other measures of metacognitive ability). Together, the findings of Experiment 3 support a hypothesis that fluctuations in local confidence are predictive of global SPEs and that the formation of accurate global SPEs is linked to metacognitive ability.

Discussion

Beliefs about our abilities play a crucial role in shaping behavior. These self-performance estimates influence our choices22, the motivation and effort we engage when pursuing our objectives31, and are thought to be distorted in many mental health disorders27. However, in contrast to the recent progress made in understanding the neural and computational basis of local estimates of decision confidence19,20,32, little is known about the formation of such global self-performance estimates (SPEs). Here, using a novel experimental design, we examined how human subjects construct SPEs over time in the presence and absence of external feedback—situations akin to many real-world contexts in which feedback is not always available.

Across three independent experiments, we observed that subjects were able to construct SPEs efficiently for short blocks of a perceptual decision task of variable difficulty. Because subjects were instructed that every new block featured two new tasks, indicated by two new color cues, they were encouraged to form their SPEs anew at the start of each block. Specifically, we found that subjects were sensitive to local fluctuations in performance and confidence within blocks when forming global SPEs. Indeed, despite stimulus evidence (i.e., difficulty level) remaining constant, variability in accuracy from block to block was reflected in subjects’ task choices (Fig. 4b). This observation rules out the possibility that subjects were merely using stimulus evidence as a cue to choose between tasks at the end of blocks. Critically, we further show that fluctuations in trial-by-trial confidence were related to end-of-block SPEs, over and above effects of objective performance and reaction times (Fig. 5).

How SPEs are represented and updated at a computational and neural level remain to be determined. As an initial step in this direction, in Supplementary Notes we outline a candidate hierarchical learning model, which links local confidence to the construction of SPEs. This model includes local confidence as an internal feedback signal, formalizing the fact that the evidence available for updating SPEs is more uncertain in the absence versus presence of external feedback, as illustrated through simulations of task choices (Supplementary Fig. 4). Despite the model qualitatively capturing subjects’ SPEs, more work is required to understand why subjects’ task choices favor tasks with external feedback to a greater extent than predicted by the model. Due to the global nature of SPEs (as compared with trial-by-trial estimates of local confidence) and the blocked structure of our experimental design, we only had access to a limited number of data points per subject (12 task choices in Experiment 1; 30 in Experiments 2 and 3) preventing us from reliably fitting model parameter(s) and discriminating between competing models41. Future work could attempt to combine manipulations of local confidence with a denser sampling of SPEs to allow such model development.

The ventromedial prefrontal cortex and adjacent perigenual anterior cingulate cortex, a key hub for performance monitoring, are candidate regions for maintaining long-run SPEs20,21,42. Subjects may either represent SPEs in an absolute format, separately for both tasks, or in a relative format, for instance how much better they are at one task compared with another. It is possible that our experimental design with interleaved tasks may encourage a relative representation of SPEs within a block. It also remains unclear how SPEs interact with notions of expected value. In the present protocol, participants received a monetary incentive for reporting the task they thought they were better at, and thus the expected value of task choices and SPEs were correlated. Moreover, value and confidence representations are often found to rely on similar brain areas20,33. Conceptually, subjects might therefore represent SPEs in the frame of expected accuracy (as postulated in the present Bayesian learning scheme) or in the frame of expected value (as a reinforcement learning framework would predict; e.g6,43.); further work is required to distinguish between these possibilities.

In Experiment 3, we found evidence that individual differences in metacognitive efficiency were related to the extent to which SPEs discriminated between easy and difficult tasks (Fig. 5d). This finding echoes a previous observation of a relationship between metacognitive efficiency and the ability to learn from predictive cues over time44. Metacognitive efficiency indexes the extent to which one’s confidence judgments are sensitive to objective performance. To the extent to which local confidence informs SPEs, it is thus plausible that more sensitive confidence estimates translate into more accurate SPEs. However, although easy trials are more likely to be correct than difficult ones, there is only partial overlap between the determinants of metacognitive efficiency and our current measure of SPE sensitivity. More work is required to determine when and how individual variation in metacognitive efficiency influences the formation of global SPEs.

There is increasing recognition that local confidence estimates integrate multiple cues45,46. Interestingly, higher-order beliefs about self-ability—assayed here as SPEs—might in turn influence local judgments of confidence over and above bottom-up information obtained on individual trials22,44,47. This interplay of local and global confidence might be one mechanism for supporting transfer of SPEs to new tasks not encountered before40, in a way that could prove either adaptive or maladaptive48. Indeed these global estimates may constitute useful internal priors on expected performance in other tasks, known as self-efficacy beliefs22,31. An overgeneralization of low SPEs between different tasks may even engender lowered self-esteem, leading to pervasive low mood49, as visible in depression where subjects hold low domain-general self-efficacy beliefs22,50. Building models for understanding how humans learn about global self-performance from local confidence represents a first canonical step toward developing interventions for modifying this process24,31,51.

Methods

Participants

In Experiment 1, 39 human subjects were recruited online through the Amazon Mechanical Turk platform. Since we had no prior information about expected effect sizes, we based our sample size on similar studies conducted in the field of confidence and metacognition. Subjects were paid $3 plus up to$2 bonus according to their performance for a ~ 30 min experiment. They provided informed consent according to procedures approved by the UCL Research Ethics Committee (Project ID: 1260/003). The challenging nature of the perceptual stimuli, which appeared only briefly, ensured that it was impossible for subjects to perform above chance level if they were not paying careful and sustained attention during the experiment. To further ensure data quality, standard exclusion criteria were applied. Ten participants were excluded for responding at chance level and/or always selecting the same rating, leaving N = 29 participants (17 f/12 m, aged 22–31 and not color-blind according to self-reports) for data analysis.

In Experiment 2, 31 subjects were recruited using the same protocol as in Experiment 1. Identical exclusion criteria were applied leading to the exclusion of two subjects, leaving N = 29 subjects for analysis (9 f/20 m, aged 19–35).

In Experiment 3, to examine between-subjects relationships between the formation of self-performance estimates (SPEs) and metacognitive ability, 73 subjects were originally recruited online. After application of identical exclusion criteria to those used in Experiments 1 and 2, we additionally excluded subjects who failed comprehension questions about usage of the confidence scale (subjects passed if they rated “perfect” performance at least 10% greater than “chance” performance), leaving N = 46 subjects for analysis (16 f/30 m, aged 20–50).

Overall, our exclusion rates are consistent with recent online studies from our lab28 and a recent meta-analysis of online studies reporting typical exclusion rates of between 3 and 37%52. As Experiment 3 was slightly longer, subjects’ baseline pay was increased to $3.50 plus up to$2 bonus according to their performance. Subjects who participated in Experiment 1 were not permitted to take part in Experiments 2 and 3, and subjects who participated in Experiment 2 were not permitted to take part in Experiment 3.

Experiment 1

Subjects performed short learning blocks, which randomly interleaved two “tasks” identified by two arbitrary color cues (Fig. 1). Subjects were incentivised to learn about their own performance on each of the two tasks over the course of a learning block. Each block had 24 trials (12 trials from each task, presented in pseudo-random order). Each task required a perceptual judgment as to which of two boxes contained more dots (Fig. 1c). The difficulty level of the judgment was controlled by the difference in dot number between boxes. Any given task (as indicated by the color cue) was either easy or difficult and provided either veridical feedback or no feedback (Fig. 1b). These four task features provided six possible pairings of tasks in the learning blocks (Fig. 1a). Each possible pairing was repeated twice, and their order of presentation was randomized within participant.

At the end of each learning block, subjects were asked to choose the task for which they thought they performed better (Fig. 1a). Specifically, they were asked to report which task they would like to perform in a short subsequent “test block” in order to gain a reward bonus. Therefore, subjects were incentivised to choose the task they thought they were better at (even if that task did not provide external feedback). This procedure aimed at revealing global self-performance estimates (SPEs), as subjects should choose the task they expect to be more successful at in the test block in order to gain maximum reward. To indicate their task choice, subjects responded with two response keys that differed from those assigned to perceptual decisions to avoid any carry-over effects. The subsequent test block contained six trials from the chosen task (not illustrated in Fig. 1). No feedback was provided during test blocks.

After the test block, subjects were asked to rate their overall performance on each of the two tasks on a rating scale ranging from 50% (“chance level”) to 100% (“perfect”) to obtain explicit, parametric reports of SPEs (Fig. 1a). Ratings were made with the mouse cursor and could be given anywhere on the continuous scale. Intermediate ticks for percentages 60, 70, 80, and 90% correct were indicated on the scale, but without verbal labels. Perceptual choices, task choices, and ratings were all unspeeded. After each learning block, subjects were offered a break and could resume at any time, with the next learning block featuring two new tasks cued by two new colors.

Subjects’ remuneration consisted of a base payment plus a monetary bonus proportional to their performance during test blocks (see Participants). Subjects were also encouraged to give accurate task ratings (although their actual remuneration did not depend on this feature): “Your bonus winnings will depend both on your performance during the bonus [i.e., test block] trials and on the accuracy of your ratings”. As data were collected online, instructions were as self-explanatory and progressive as possible, including practice trials with longer stimulus presentation times on one task (one color cue) at a time.

Each learning block featured two tasks, with each trial starting with a central color cue presented for 1200 ms, indicating which of the two tasks will be performed in the current trial (Fig. 1c). The stimuli were black boxes filled with white dots randomly positioned and presented for 300 ms, during which time subjects were unable to respond. We used two difficulty levels characterized by a constant dot difference, but the spatial configuration of the dots inside a box varied from trial to trial. One box was always half-filled (313 dots out of 625 positions), whereas the other contained 313 + 24 dots (difficult conditions) or 313 + 58 dots (easy conditions). Those levels were chosen on the basis of previous online studies in order to target performance levels of around 70 and 85% correct, respectively28.

The location of the box that contained more dots was pseudo-randomized within a learning block with half of the trials appearing on the left, and half on the right. Subjects were asked to judge which box contained more dots and responded by pressing Z (left) or M (right) on their computer keyboard. The chosen box was highlighted for 300 ms. Afterwards, a colored rectangle (cueing the color of the current task) was presented for 1500 ms. The rectangle was either empty (on no-feedback trials) or contained the word “Correct” or “Incorrect” (on feedback trials), followed by an ITI of 600 ms. The experiment was coded in JavaScript, HTML, and CSS using jsPsych version 4.353, and hosted on a secure server at UCL. We ensured that subjects’ browsers were in full screen mode during the whole experiment.

Experiment 2

To investigate how SPEs emerge over the course of learning, in Experiment 2 each block now contained either 2, 4, 6, 8, or 10 trials per task (Fig. 3a). These five possible learning durations were crossed with the six pairings of our experimental design, giving 30 blocks (= 360 trials) per participant. The dot difference for the easy conditions was changed to 313 + 60 from 313 + 58 dots, with all other experimental features remaining the same. At the end of each learning block, subjects were asked to report on which task they believed they had performed better. They were instructed that their reward bonus will depend on their average performance at the chosen task over the past learning block (instead of performance during a subsequent “test block” as in Experiment 1). Thus in Experiment 2, task choice required retrospective rather than prospective evaluation of performance, thereby generalizing the findings of Experiment 1 to metacognitive judgments with a different temporal focus. Last, given the consistency between the pattern of task ratings and task choices in Experiment 1, we decided to omit ratings in Experiment 2 to save time.

Experiment 3

In Experiments 1 and 2, we obtained evidence that local fluctuations in performance were linked to changes in global SPEs at the end of the block. Experiment 3 was designed to directly test whether trial-by-trial fluctuations in internal confidence were instrumental in updating global SPEs. To this end, on no feedback trials, subjects were asked to provide a confidence rating about the likelihood of their perceptual judgment being correct. The rating scale was continuous from “50% correct (chance level)” to “100% correct (perfect)”, with intermediate ticks indicating 60, 70, 80, and 90% correct (without verbal labels). There was no time limit for providing confidence ratings. We did not add confidence ratings in the Feedback condition for two reasons. First, we wanted to be able to compare this condition directly to that of Experiments 1 and 2. Second, we sought to minimize the possibility that requiring a confidence judgment might affect subsequent feedback processing in a non-trivial manner. The experimental structure and other timings remained identical to Experiment 2.

Statistical analyses

In Experiment 1, trials from learning blocks with reaction times (RTs) beyond three standard deviations from the mean were removed from analyses (mean = 1.59% [min = 0.69%; max = 2.78%] of trials removed across subjects). Paired t tests were then performed to compare performance (mean percent correct), RTs and end-of-block task ratings between experimental conditions. To examine the influence of our experimental factors on SPEs, we carried out a 2 × 2 ANOVA with Feedback (present, absent) and Difficulty (easy, difficult) as factors predicting performance, task choice, and task ratings. Note that because task choice frequencies are proportions, they were transformed using a classic arcsine square-root transformation before entering the ANOVA. Task ability ratings were z-scored per subject non-parametrically (due to only 12 blocks per subject). As the absence of a difference in first-order performance (and RTs) between tasks with and without feedback is critical for interpreting differences in SPEs, we additionally conducted Bayesian paired samples t tests using JASP version 0.8.1.2 with default prior values (zero-centered Cauchy distribution with a default scale of 0.707). Specifically, we evaluated the evidence in favor of the null hypothesis of no difference in performance between tasks with and without feedback, and report the corresponding Bayes factors (BF).

Since objective performance may vary even within a given difficulty level, we performed a linear regression to further quantify the influence of fluctuations in objective performance on task ratings, entering objective performance and feedback presence as taskwise regressors (two per block). Regressors were z-scored to ensure comparability of regression coefficients. In addition, we examined potential recency effects to ask whether subjects weighted all trials equally when forming global SPEs. We performed a logistic regression with accuracy (X) predicting task choice (Y) (Supplementary Fig. 1). We included four regressors (X1–X4) corresponding to the four quartiles of each block, in chronological order, with all six pairings pooled together. Subjects were treated as fixed-effects due to a limited availability of task choice data points per subject precluding the use of full random-effect models.

To provide evidence that task choice and task ability ratings were consistent proxies for SPEs, we calculated how often the chosen task was rated higher than the unchosen one, and we compared the mean ratings given for chosen and unchosen tasks (paired t test). We also performed a logistic regression to examine whether the difference in ability rating between tasks predicted task choice, with subjects treated as fixed-effects due to a limited availability of task choice data points (12 blocks per subject).

To assess use of the confidence scale in Experiment 3, we compared mean confidence (subsequently labeled “Confidence level”) between correct and incorrect trials, and between easy and difficult trials (paired t tests). To further establish whether subjects’ confidence ratings were reliably related to objective performance, we computed metacognitive sensitivity (meta-d’54). Metacognitive sensitivity is a metric derived from signal detection theory (SDT), which indicates how well subjects’ confidence ratings discriminate between their correct and error trials, independent of their tendency to rate confidence high or low on the scale. When referenced to objective performance (d’), we can obtain a measure of metacognitive efficiency using the ratio meta-d’/d’. Because the meta-d’ framework makes the assumption of a constant signal strength across trials, we computed metacognitive efficiency separately for easy and difficult trials (corresponding to 90 trials each) and averaged the two values, which were in turn averaged at the group level. We applied a hierarchical Bayesian framework for fitting meta-d’, with all Ȓ < 1.001 indicating satisfactory convergence30. We also compared the obtained metacognitive efficiency values to a classic maximum-likelihood fit54. Finally, we computed a third, non-parametric measure of metacognitive ability, the area under the type 2 receiver operating curve (AUROC2), although unlike meta-d’/d’, this measure does not control for performance differences between conditions or subjects54 (Supplementary Fig. 5).

To investigate whether internal fluctuations in subjective confidence were related to end-of-block SPEs, task choices in blocks where both tasks required confidence ratings were additionally split according to the difference in confidence level between both tasks (Supplementary Fig. 3b). To examine whether the difference in confidence level between tasks (confDiff) explained task choices over and above differences in accuracy (accDiff) or reaction times (rtDiff), we conducted a logistic regression on data from blocks where confidence ratings were elicited from both tasks:

$${\hbox{Task choice}}\sim {\beta} _{\it{0}} + {\beta} _{\it{1}} \times {\mathrm{accDiff}} + {\beta} _{\it{2}} \times {\mathrm{rtDiff}} + {\beta} _{\it{3}} \times {\mathrm{confDiff}}$$

The regressors were not orthogonalised meaning that all their common variance was placed in the residuals (Fig. 5b). Subjects were again treated as fixed-effects because we had only five data points per subject. Regressors were z-scored to ensure comparability of regression coefficients. We also ran a series of regressions as described above, but with regressors ordered and orthogonalised to each other (see Results).

Finally, we hypothesized that participants who are better at judging their own performance on a trial-by-trial basis should also form more accurate global SPEs. To this end, we asked whether individual differences in metacognitive efficiency were related to the extent to which the easy task was chosen over the difficult task. Specifically, we examined the correlation between individual metacognitive efficiency scores and task choice difference in blocks with only confidence ratings (Fig. 5d). Because there was a limited number of blocks per subject, the possible task choice proportions are clustered in discrete levels in Fig. 5d; we thus calculated both parametric (Pearson) and non-parametric (Spearman) correlation coefficients for completeness.

Code availability

MATLAB code for reproducing the main figures, statistical analyses and model simulations are freely available at: http://github.com/marionrouault/RouaultDayanFleming. Further requests can be addressed to the corresponding author: Marion Rouault (marion.rouault@gmail.com).