Biased belief updating and suboptimal choice in foraging decisions

Garrett, Neil; Daw, Nathaniel D.

doi:10.1038/s41467-020-16964-5

Download PDF

Article
Open access
Published: 09 July 2020

Biased belief updating and suboptimal choice in foraging decisions

Nature Communications volume 11, Article number: 3417 (2020) Cite this article

6499 Accesses
18 Citations
14 Altmetric
Metrics details

Subjects

Abstract

Deciding which options to engage, and which to forego, requires developing accurate beliefs about the overall distribution of prospects. Here we adapt a classic prey selection task from foraging theory to examine how individuals keep track of an environment’s reward rate and adjust choices in response to its fluctuations. Preference shifts were most pronounced when the environment improved compared to when it deteriorated. This is best explained by a trial-by-trial learning model in which participants estimate the reward rate with upward vs. downward changes controlled by separate learning rates. A failure to adjust expectations sufficiently when an environment becomes worse leads to suboptimal choices: options that are valuable given the environmental conditions are rejected in the false expectation that better options will materialize. These findings offer a previously unappreciated parallel in the serial choice setting of observations of asymmetric updating and resulting biased (often overoptimistic) estimates in other domains.

Trial-history biases in evidence accumulation can give rise to apparent lapses in decision-making

Article Open access 22 January 2024

Sunk cost sensitivity during change-of-mind decisions is informed by both the spent and remaining costs

Article Open access 07 December 2022

Time pressure changes how people explore and respond to uncertainty

Article Open access 08 March 2022

Introduction

In contrast to classic economic decisions in which individuals choose from a menu of well-defined options presented simultaneously^{1,2,3,4,5,6,7}, the options in many real-world decisions are encountered serially and cannot directly be compared to one another. For instance, should I take a currently available option (e.g., hire this candidate for a position or accept this proposition of a romantic date)? Or should I forgo it in the expectation that a better option will come along? Insofar as accepting an option may require passing up unknown opportunities that might arise later, these choices involve opportunity cost and require comparing each prospect to some measure of the overall distribution of alternatives.

Such serial decisions arise canonically in several stylized tasks from animal ethology and optimal foraging theory, such as the prey selection task^8,9. Here, a predator encounters a series of potential prey, which may each be accepted or forgone. If accepted, an option provides gain (e.g., calories) but pursuing and consuming it costs time that could instead be used to search for additional prey. The optimal choice rule, given by the Marginal Value Theorem (MVT)¹⁰, is to accept an option only if its local return (calories divided by time) exceeds the opportunity cost of the time spent. This is just the calories per timestep that otherwise would be expected to be earned on average: the overall, long-run reward rate in the environment. Thus, foragers should be pickier when the environment is richer: a mediocre target (e.g., a skinny, agile animal that takes time to chase down) may be worthwhile in a barren environment, but not in one rich with better alternatives.

In this setting, choice turns largely on estimating the environment’s present rate of return, which establishes the threshold against which each prospective prey item can be assessed. Importantly, the MVT specifies the optimal static choice policy in an environment, and thus predicts differences in behavior between environments. But it does not prescribe the dynamics of how an organism might attain the optimum. Accordingly, its predictions have most often been studied in terms of asymptotic behavior in stable environments. But estimating an environment’s richness from experience with offers is in fact a learning problem, especially in dynamic environments in which the availability of resources fluctuates over time, as with the seasons^11,12,13.

Here we ask how humans update beliefs about their environment’s rate of return in a prey selection task and extend a trial-by-trial learning model previously used for patch foraging^11,13 to account for deviations from the MVT that we observe. Participants are tasked with choosing whether to accept or reject serially presented options which vary in terms of points earned and time expended (if accepted). By manipulating the rate of reward between blocks, we are able to examine how individuals respond to changes in the overall richness of their environment as well as examine dynamic trial-by-trial adjustments that occur within blocks. Across three experiments, participants were sensitive to both global and trial-to-trial changes in the environment, in the direction predicted by the MVT^9,10. But in all experiments, model estimates indicated that information integration was greater for positive compared to negative prediction errors. This asymmetry captures a key deviation we observed from the MVT-predicted policy: a reluctance to revise beliefs and change choices when environments deteriorate, leading in that circumstance to an overoptimistic bias and a pattern of overselective choices.

Results

Participants adapt to global fluctuations

We first conducted two online experiments (see Fig. 1 and Methods for further details). On each trial, participants were offered an option (styled as an alien) and chose to accept or reject it. Accepting provided a reward (points, later converted to a bonus payment) but also incurred an opportunity cost in the form of a time delay. There were four possible options (low delay, high reward: LDHR; low delay, low reward: LDLR; high delay, high reward: HDHR; high delay, low reward: HDLR; Fig. 1b). Following an accept decision, participants had to wait for the delay to elapse before the points were accrued and the next trial began. Following a reject decision, the experiment progressed to the next trial.

**Fig. 1: Behavioral task and variables.**

Participants were exposed to two blocks (environments, rich and poor), differing in the frequency of the best (LDHR) and worst (HDLR) options (Fig. 1c). In the rich environment, the best option outnumbered the others; in the poor environment, the worst option predominated. Block order was counterbalanced across participants, so that each either completed rich first (RichPoor) or the opposite (PoorRich). Note that the two intermediate options, LDLR and HDHR, were identical in profitability (i.e., reward per second) and occurred with identical frequency. To simplify the analysis, we collapse these options into one intermediate option category. Separating them does not change the pattern of results (see Supplementary Information Fig. 1). Experiments 1 and 2 differed only in the duration of each block (Experiment 1: 15 min per block; Experiment 2: 10 min per block).

The MVT predicts that acceptance should depend on an option’s profitability and the overall quality of the environment. Accordingly, the decision whether to accept vs. reject an option was sensitive to both its profitability (i.e., reward per second) and also the environment (Fig. 2a, b). Specifically, a repeated measures ANOVA on the percentage of accept decisions with option (best, intermediate, worst) and environment (rich, poor) as repeated factors revealed a main effect of environment (Experiment 1: F(1, 39) = 15.40, p < 0.001, partial η² = 0.28; Experiment 2: F(1, 37) = 8.38, p = 0.006, partial η² = 0.19), a main effect of option (Experiment 1: F(2, 78) = 540.41, p < 0.001, partial η² = 0.93; Experiment 2: F(2, 74) = 367.15, p < 0.001, partial η² = 0.91) and an environment by option interaction (Experiment 1: F(2, 78) = 70.27, p < 0.001, partial η² = 0.64; Experiment 2: F(2, 74) = 38.11, p < 0.001, partial η² = 0.51). As predicted by the MVT, in each experiment the interaction was driven by a selective change between environments in acceptance of the intermediate options; this change was greater than the change in acceptance for the best or worst options (intermediate vs. best option, Experiment 1: t(39) = 8.34, p < 0.001, 95% CI [0.23, 0.38]; Experiment 2: t(37) = 5.60, p < 0.001, 95% CI [0.20, 0.42]; intermediate vs. worst, Experiment 1: t(39) = 9.89, p < 0.001, 95% CI [0.26, 0.40]; Experiment 2: t(37) = 7.59, p < 0.001, 95% CI [0.28, 0.48], two-tailed paired sample t-tests on the difference in acceptance rates between environments for intermediate versus either alternative). However, while this effect was in the direction predicted by the MVT (i.e., participants were more selective in the richer environment), the magnitude of the change was not as dramatic as theoretically predicted. This is due in part, as discussed below, to a slow adjustment in particular circumstances.

**Fig. 2: Global and local effects of the environment.**

Participants adapt to local fluctuations

Since the quality of environments was not explicitly instructed, the foregoing results imply that subjects learn from experience how selective to be. Although the MVT itself does not prescribe dynamics, it does suggest a simple class of learning model^11,13: dynamically estimate the reward rate in the environment (e.g., by an incremental running average), and then use this as an acceptance threshold in the MVT choice rule. Such learning predicts that choices should be sensitive not just to block-wise manipulation of the environmental richness, but also to local variation in offers, since for instance receiving a poor offer will incrementally decrease the estimated reward rate, and incrementally decrease selectivity on the very next trial.

Accordingly, we examined evidence for such trial-to-trial learning by investigating whether the decision to accept an option fluctuated according to recent experience. We separated trials according to the option participants encountered on the previous trial, independent of the decision on the previous trial. To ensure any effect of local context was not merely driven by the block-wise environment type effect discussed above, we controlled for block type by entering environment (rich, poor) as a repeated factor in the analysis along with previous offer (best, intermediate, worst). This analysis revealed that participants increased their acceptance rates (over all options) on the current trial, the worse the option on the previous trial had been (main effect of previous option Experiment 1: F(2, 78) = 43.69, partial η² = 0.53, p < 0.001; Experiment 2: F(2, 74) = 31.68, p < 0.001, partial η² = 0.46, Fig. 2c, d). In other words, participants became more selective (less likely to accept the current option) immediately following evidence that the environment was rich (previous encounter with the best option) and less selective when it was poor (previous encounter with the worst option), consistent with an MVT-inspired learning model.

Evidence of block-wise learning

One additional piece of evidence speaking to the learning process was apparent in block order effects. In particular, we examined whether global fluctuations in acceptance rates were modulated by the order in which the environments were encountered. We did this by implementing a new repeated measures ANOVA with option (best, intermediate, worst) and environment (rich, poor) as repeated factors, and this time included order condition (RichPoor, PoorRich) as a between-participant factor. This revealed an interaction between environment and order condition (Experiment 1: F(1, 38) = 12.11, p = 0.001, partial η² = 0.24; Experiment 2: F(1, 36) = 7.24, p = 0.011, partial η² = 0.17) as well as, as before, main effects of environment (Experiment 1: F(1, 38) = 18.22, p < 0.001, partial η² = 0.32; Experiment 2: F(2, 35) = 8.00, p < 0.008, partial η² = 0.18) and option (Experiment 1: F(2, 76) = 555.15, p < 0.001, partial η² = 0.94; Experiment 2: F(2, 72) = 402.48, p < 0.001, partial η² = 0.92).

The interaction reflected PoorRich participants (those who encountered the poor environment first) being more sensitive to the change in environments compared to RichPoor participants. That is, PoorRich participants (compared to the opposite order) showed a greater increase in their percentage of accept decisions in the poor environment compared to the rich environment (Supplementary Information Fig. 1). To better visualize this, we calculated a difference score for the change in acceptance rates (across all options, see Methods) between the poor environment and rich environment for each participant. Positive scores indicate an overall increase in acceptance rates in the poor environment relative to the rich environment. We then compared these difference scores for participants in the RichPoor condition versus the PoorRich condition (Fig. 3a, b). This revealed that PoorRich participants showed a significant increase in their percentage of accept decisions in the poor environment compared to the rich environment (Experiment 1: t(20) = 6.79, p < 0.001, 95% CI [0.122, 0.230]; Experiment 2: t(20) = 4.99, p < 0.001, 95% CI [0.104, 0.254], two-tailed one-sample t-test on the difference scores versus 0) in contrast to RichPoor participants who exhibited only a marginally higher rate of acceptance in the poor compared to the rich environment in Experiment 1 (t(18) = 1.97, p = 0.07, 95% CI [−0.0035, 0.1048]) which was not significant in Experiment 2 (t(16) = 1.17, p = 0.26, 95% CI [−0.046, 0.159]) with there being a significant difference between these two groups in the difference scores (Experiment 1: t(38) = 3.41, p = 0.002, 95% CI [0.051, 0.199]; Experiment 2: t(36) = 2.08, p = 0.045, 95% CI [0.003, 0.242], two-tailed independent sample t-tests, comparing difference scores for PoorRich versus RichPoor participants).

**Fig. 3: Order effect and model comparison.**

Computational modeling

We reasoned that this asymmetry reflected the operation of the underlying learning rule by which subjects adjusted their behavior from the poor to the rich environment or vice versa. Accordingly, we fit the behavioral data to two reinforcement learning models: a Symmetric Model and an Asymmetric Model. In each of these models, participants were assumed to maintain an ongoing estimate of the reward rate (reward per second), ρ, which updated every second. This estimate was then used on each trial to calculate the opportunity cost of accepting an option (ρ multiplied by the option’s time delay) at the time of choice. Under this model specification, even though the absolute cost (in terms of number of seconds delay) imposed by accepting a specific option was the same throughout the task, the opportunity cost could vary according to participant’s current estimate of ρ. The decision to accept or reject was modeled as a comparison between the opportunity cost of accepting an option (effectively the value of rejecting) against the reward that the option would collect (effectively the value of accepting). This was implemented using a softmax decision rule with an inverse temperature parameter (β₁) governing the sensitivity of choices to the difference between these two quantities, and an intercept (β₀) capturing any fixed, overall bias toward or against acceptance (see Methods for further details on the model fitting procedure). Both models used a delta-rule running average¹⁴ to update ρ according to positive and negative prediction errors. Negative prediction errors were generated every second that elapsed without a reward (for example, each second of a time delay). Positive prediction errors were generated on seconds immediately following a time delay when rewards were received. Note therefore that updates to the reward rate are dependent on the decision of the participant. This captures a key feature of the prey selection problem⁹: that the relevant decision variable is the participant’s average earnings given their status quo choice policy, so an option merely offered (but not accepted) should not count.

The difference between the Symmetric Model and the Asymmetric Model was whether there were one or two learning rate parameters. The Symmetric Model contained just a single learning parameter, α. This meant that ρ updated at the same rate regardless of whether the update was in a positive or a negative direction. The Asymmetric Model had two learning parameters: α⁺ and α⁻. This enabled ρ to update at a different rate, according to whether the update was in a positive (α⁺) or a negative (α⁻) direction. This causes an overall (asymptotic) bias in ρ (e.g., if α⁺ > α⁻, then the environmental quality is overestimated), but also dynamic, path-dependent effects due to slower adjustment to one direction of change over the other.

Asymmetric Model a better fit to choice data

We fit the per-participant, per-trial choice timeseries to each model, and compared models by computing unbiased per-participant negative log marginal likelihoods via subject-level Leave One Out cross-validation (LOOcv) scores for each participant. The Asymmetric Model provided a superior fit to the choice data than the Symmetric Model both in Experiment 1 (t(39) = 7.20, p < 0.001, 95% CI [18.77, 33.45], two-tailed paired sample t-tests comparing LOOcv scores for the Asymmetric versus the Symmetric Model, Fig. 3c and Table 1) and in Experiment 2 (t(37) = 6.04, p < 0.001, 95% CI [9.39, 18.86], Fig. 3d and Table 1). At the population level, formally comparing the two learning rates (see Methods) in the Asymmetric Model revealed that this asymmetry was, on average, significantly biased toward α⁺ > α⁻ (Experiment 1: z = 2.83, p < 0.01; Experiment 2: z = 3.73, p < 0.001, Table 1). This meant that prediction errors that caused ρ to shift upwards (following receipt of a reward) had a greater impact than prediction errors that caused ρ to shift downwards (following the absence of a reward). Individually, 88% of participants in Experiment 1 and 92% of participants in Experiment 2 were estimated to have a higher learning parameter for positive (α⁺) compared to negative (α⁻) errors. To examine whether asymmetric learning was a stable feature of learning, or rather emerged only after the environment switch (e.g., in response to expectations set up by the first block), we also reran both models, this time fitting only the data from the first environment participants encountered. The Asymmetric Model was again a better fit to the data than the Symmetric Model (Experiment 1: t(39) = 5.56, p < 0.001, 95% CI [4.83, 10.36]; Experiment 2: t(37) = 6.01, p < 0.001, 95% CI [3.90, 7.87], two-tailed paired sample t-test) with α⁺ > α⁻ in the Asymmetric Model (Experiment 1: z = 5.64, p < 0.01; Experiment 2: z = 9.96, p < 0.01).

Table 1 Model fitting and parameters across the three experiments.

Full size table

Asymmetric Model accounts for order effect

Next, we examined the impact of this asymmetry when the environment changed. Importantly, the model carried ρ over between environments rather than resetting at the start of a new block. (This feature was motivated by the block order effect, which rejects the otherwise identical model that resets values to some constant at each block and thus predicts order invariance¹⁵.) Accordingly, participants had to update beliefs that had been established in the first environment they encountered (poor for the PoorRich group, rich for the RichPoor group). Given that the learning asymmetry was revealed to be in a positive direction (α⁺ > α⁻) we reasoned that this update ought to occur faster when going from a poor environment into a rich environment, as large rewards become more commonplace (i.e. the PoorRich group), which would explain the block order effect.

Indeed, simulating the experiment using a population of subjects drawn according to the best-fitting parameter distributions for each model (Fig. 3a, b), we found that both models reproduced an overall effect of environment type, but only the Asymmetric Model captured the block order effect. Returning to the fits to actual participants’ choices, we further unpacked the model’s account of this effect by extracting trial-by-trial estimates of ρ from the model’s fit to each trial and participant and entering them into a repeated measures ANOVA with environment (rich, poor) as a repeated factor and order condition (RichPoor, PoorRich) as a between-participant factor. This revealed a pattern in accord with the intuition that the global effect arose from slower adjustment of the acceptance threshold by the RichPoor group. In particular, there was a significant environment by condition interaction (Experiment 1: F(1, 38) = 14.67, p < 0.001, partial η² = 0.28, Fig. 4a; Experiment 2: F(1, 36) = 21.42, p < 0.001, partial η² = 0.37, Supplementary Information Fig. 2). This arose out of a significant difference in ρ between environments for participants in the PoorRich condition (Experiment 1: t(20) = 8.64, p < 0.001, 95% CI [5.98, 9.79]; Experiment 2: t(20) = 6.08, p < 0.001, 95% CI [4.56, 9.32], two-sided paired sample t-test) which was absent among RichPoor participants (Experiment 1: t(18) = 1.16, p = 0.26, 95% CI [−1.32, 4.54]; Experiment 2: t(16) = 0.004, p = 0.997, 95% CI [−1.87, 1.88]).

**Fig. 4: Reward rate change and Experiment 3.**

Greater exposure to poor environment reduces order effect

To further probe whether asymmetric learning caused strategies to perseverate among RichPoor participants, we ran a third cohort of participants. The procedure was exactly as for Experiments 1 and 2 but with one key difference. Now, between the two environments (either between the Rich and the Poor environment or between the Poor and the Rich environment) we inserted a third environment in which participants only saw the worst option (HDLR, see Methods for full details). We predicted that with the addition of this environment, participants in the RichPoor condition would now have sufficient exposure to a lean environment to allow their expectations to adjust before entering the poor environment. In doing so, we predicted that we ought to no longer to observe a difference in the change in acceptance rates between environments for RichPoor and PoorRich participants.

The Asymmetric Model once again proved a better fit to choice behavior compared to Symmetric Model (t(37) = 5.47, p < 0.001, 95% CI [6.39, 13.91], two-tailed paired sample t-tests comparing LOOcv scores for the Asymmetric Model versus the Symmetric Model, see also Table 1) with α⁺ being greater than α⁻ (z = 2.28, p < 0.05). But in contrast to before, both groups significantly changed their acceptance rates between environments (RichPoor: t(14) = 3.07, p = 0.008, 95% CI [0.04, 0.21]; PoorRich: t(22) = 4.47, p < 0.001, 95% CI [0.09, 0.24], two-tailed one-sample t-test versus 0) and there was no difference in this change in acceptance rates between RichPoor and PoorRich participants (t(36) = 0.77, p = 0.45, 95% CI [0.07, 0.15], two-tailed independent sample t-test, Fig. 4b).

Discussion

Across three experiments using a prey selection task, we find that individuals adjusted their choices both between environments—becoming less selective in a poor compared to a rich environment—and within environments—altering selectivity on a trial-by trial basis according to the previous encounter. The block-wise effect is consistent with foraging theory⁹ and previous empirical data both from animals^{9,10,16,17,18,19} and humans^{12,20,21,22,23}. The trial-wise effect is a rather direct prediction of incremental learning rules for the acceptance threshold, which have previously been proposed and studied in terms of a different class of foraging tasks, patch leaving tasks^11,24,25,26, where choice-by-choice effects are harder to observe due to the task structure.

However, inconsistent with optimal foraging theory, we observed interesting suboptimalities in individuals’ choices. In particular, adjustments in choices between environments were attenuated when an individual’s environment became worse compared to when the environment improved. The experience-dependent nature of this bias strongly suggests that it is rooted in learning, and we present modeling showing that it can be understood in terms of an asymmetric learning rule, which scales positive and negative prediction errors differently. Such decomposition of learning by valence is a recurring theme in decision neuroscience, but its consequences in the foraging setting—path-dependent biases in estimates of opportunity cost, leading to systematic choice biases—have not previously been appreciated. Indeed, optimism biases of the sort implied by these models may have particularly important effects in many real-world foraging-like tasks (including hiring and employment decisions and mate selection) because these learned estimates play such a key role in choice: encountering opportunities serially forces the decision-maker to compare them to a learned (and potentially biased) estimate of the other fish in the sea, rather than directly to their alternatives.

We were able to account for the deviations from foraging theory by augmenting a learning rule that had previously been used in the patch foraging setting^11,13. Across all three experiments, endowing the model with separate learning rates for positive and negative adjustments to the environment’s reward rate provided a better qualitative and quantitative fit to participants’ choices compared to a baseline model that did not distinguish between these two types of adjustments. This feature provided the model with the capacity to shift estimates of the environments reward rate up and down at different rates, enabling it to recapitulate the differences in choice adjustments depending on the sequence in which participants experienced rich and poor environments.

Interestingly, the order effect we observe, and the learning account of it, indicate that participants carry over information about the reward rate from one environment into the next. This is despite the fact that participants are explicitly told at the start of the experiment that they will experience different environments in the task and, during the experiment, new environments are clearly signaled. It may be that a variety of factors such as the introduction of more transitions, more environments, time spent in each environment, volatility, and/or greater contextual differences between environments might prompt a shift towards resetting or reinstating previously learned reward rates thereby mitigating the carryover biases we observe.

Our learning-based account is further supported, at least to a limited extent, by our third experiment, in which the block order effect was not significant in a version of the task that included additional experience with a poor reward rate. This negative finding is predicted by the learning model, and tends to mitigate against other explanations (e.g., ones in which the block order asymmetry relates to some sort of primacy or anchoring on the initial experience) that would predict equivalent effects to the (similarly designed and powered) Experiments 1 and 2. Of course, null effects (even predicted ones) should be interpreted with caution, and the difference in significance between experiments does not necessarily imply the effects are themselves different. Future experiments including both conditions in a single design will be required to permit their direct statistical comparison.

One reason that learning of a global environmental reward estimate, as studied here, is important is that this variable plays quite a ubiquitous role governing many aspects of choice. Apart from serving as the decision variable for prey foraging and patch leaving, the same variable also arises in theoretical and experimental work on physical vigor²⁷ and cognitive effort²⁸, deliberation²⁹, action chunking³⁰, risk sensitivity and time discounting³¹, mood and stress²⁵. Furthermore, whereas in optimal foraging theory⁹ and many of these other examples, the average reward is thought to play a direct role in choice as a decision variable, a similar global average reward estimating can also affect choice more indirectly, by serving as a comparator during learning so as to relativize what is learned to a contextual reference^32,33. In this respect, our study parallels another recent line of work^34,35,36 that studies choices between options originally trained in different contexts to reveal effects of a global contextual average reward on relativizing learned bandit valuations. It seems likely that these learning-mediated effects are a different window on the same underlying average reward estimates driving choice in the current study, and (for instance) vigor in others³⁷. It has not been tested whether hallmarks of learning asymmetry as we report here can also be detected in these other paradigms.

There is also another respect in which these results may reflect a broader feature of learning. In a variety of other domains, beliefs are more readily updated when people receive desirable compared to undesirable information. This pattern emerges when people receive information about the likelihood of different life events occurring in the future^38,39,40,41, information about their financial prospects⁴², feedback about their intellectual abilities^43,44, personality⁴⁵, and physical traits such as attractiveness⁴³. Although the results we present are consistent with these past instances of asymmetric learning, we believe it of note to find that it extends to a foraging setting. This setting captures a key aspect of many of the real-world choices humans undertake whereby an exhaustive explicit menu of all options available does not exist. As a result, choices are heavily reliant on subjective beliefs about what options will materialize in the future. Optimistic or pessimistic biases in this aspiration level, should they occur, can cause costly errors due to excessive rejection of decent offers, or, respectively, over-acceptance of objectively inadequate ones. By comparison, although asymmetric learning has also been observed in bandit tasks where subjects choose between a set of options, its effects on earnings are less pernicious in this setting because it biases evaluations of all the options being compared in the same direction and tends therefore to wash out. Thus, overall, the finding of asymmetric biases in foraging is a mechanism by which a wide range of suboptimal decisions could potentially arise.

Neural accounts of learning have also stressed the separation of negative and positive information and updates, which (given that firing rates cannot be negative) are potentially represented in opponent systems or pathways^3,33,46. Appetitive and aversive expectancies, and approach and avoidance, are also associated with engagement of partly or altogether distinct brain regions^{34,47,48,49,50,51}. More particularly, the direct and indirect pathways through the basal ganglia have been argued to support distinct pathways for action selection versus avoidance^3,46, with positive and negative errors driving updates toward either channel. These types of models also typically incorporate asymmetric updating, though not necessarily consistently biased in all circumstances toward positive information (α⁺ > α⁻).

One important interpretational caveat, at least with respect to the apparent analogy with other work on biased updating is that, compared to these other cases, in the foraging scenario we use here, upward and downward adjustments in reward rate estimates differ in ways other than valence. Specifically, upward errors are driven by reward receipt and downward ones by the passage of time, events which may be perceived and processed differently in ways that may also contribute to asymmetric learning. Future experiments will be required to uncover if valence—whether a piece of information is good or bad—is the key attribute that gives rise to the asymmetry we observe.

Another question that awaits future work is whether the learning-based mechanism we describe also contributes to biases that have been observed in other foraging scenarios. Notably, a number of other reported biases, especially in patch foraging tasks, tend to reflect over-acceptance or overstaying^11,25,52. Asymmetric learning of the sort modeled here offers a class of explanations for this longstanding and puzzling pattern of deviations from the predictions of optimal foraging theory. However, if overstaying is to be explained by asymmetric learning, it would imply pessimistic rather than the optimistically biased opportunity costs we observe here, and a bias toward negative updating rather than positive updating.

This apparent contrast between over-selectivity in the current study, compared to over-acceptance in others, in turn, raises a further experimental and theoretical question: what features of experience or environment determine the balance in sensitivity to positive and negative prediction errors³⁴? Recent theoretical work has proposed that positively (α⁺ > α⁻) and negatively (α⁻ > α⁺) biased beliefs can each be advantageous, depending on circumstances^53,54. For example, computational simulations have suggested that overconfidence about one’s own abilities can be beneficial in relatively safe environments (where the potential cost of error is lower than the potential gains). But such overconfidence is detrimental in threatening environments in which costs exceed gains⁵³. In two arm bandit tasks, it has also been shown that a positive learning asymmetry can be advantageous over the long run if payouts from each bandit are rare. But conversely a negative learning asymmetry is advantageous if payouts are common⁵⁴. However, the logic of all of these predictions is specific to the modeled class of tasks, and does not appear to extend straightforwardly to the foraging setting. Thus, the current results underline the need for future theoretical work clarifying to what type of foraging situation such a bias might be adaptive or adapted. Relatedly, the demonstration (in other tasks) that different circumstances favor different asymmetries has led to the prediction that people should adjust their bias to circumstance via some type of meta-learning^54,55. To date though, empirical work has failed to find evidence for such an adaptive adjustment in bandit tasks^56,57, although there is evidence that biases in updating beliefs about oneself do flexibly adjust under threat³⁸. Here, too, there remains the need for both theoretical and experimental work investigating how such adjustment plays out in the foraging scenario.

The possibility of biased reward rate evaluations, and particularly that they might be pessimistic in some circumstances, may also be of clinical relevance. Huys et al.⁵⁸ propose that a pessimistic estimate of an environment’s rate of reward may offer a common explanation for a range of diverse symptoms of Major Depressive Disorder (MDD) such as anergia, excessive sleeping, lack of appetite, and psychomotor retardation. There is empirical evidence for learning rates becoming less positively biased in patients with MDD in the case of some (non-foraging related) tasks^59,60 but not in the case of others⁶¹. However, to date no empirical work has examined learning asymmetries among MDD patients in foraging related tasks, where the content of the beliefs being updated (the reward rate of an environment) could underlie these different MDD symptoms if pessimistic. Whether the learning asymmetry we observe disappears or reverses toward a negative direction in MDD patients is therefore an important question for future research.

Methods

Participants

A total of 62 participants were recruited for Experiment 1. Twenty-two of these were excluded leaving a final sample of 40 participants (mean [standard deviation] age: 32.73 [7.94]; 11 female). A total of 55 participants were recruited for experiment 2. Seventeen of these were excluded leaving a final sample of 38 participants (mean [standard deviation] age: 33.71 [8.83]; 16 female; 2 gender undisclosed). Participants were recruited online via Amazon Mechanical Turk. Following best practice for studies with this population⁶², several a priori exclusion criteria were applied to ensure data quality.

Participants were excluded if any of the following applied: (1) did not finish the task (n = 5); (2) made 20 or more missed responses (n = 14); (3) made 10 or more incorrect force trial responses (either accepting options when forced to reject or rejecting options when forced to accept, n = 1); (4) poor discriminability, defined as choosing the worst option (low reward, high delay) a greater percentage of times than the best option (high reward, low delay) or accepting/rejecting all options on every single trial (n = 2). Participants were paid $1.5 plus a bonus payment between $1.5 and $4 depending on performance in the task. All participants provided fully informed consent. The study design complied with all relevant ethical regulations and was approved by Princeton University’s Institutional Review Board.

Behavioral task

The task began by asking participants to provide basic demographic information (age and gender) (Fig. 1). After providing this information, participants read task instructions on screen at their own pace and undertook a practice session for approximately 4 min. The two options and the environment used in the training session were not used in the actual task. After completing the training session, participants were then required to attempt a multiple-choice quiz to check their understanding of the task and instructions. Participants needed to get all the questions correct to pass the quiz and continue to the task. Participants who failed the quiz were required to reread the instructions and attempt the quiz again. (They were not asked to retake the training again.)

The main experiment comprised two different environments, each lasting 15 (Experiment 1) or 10 (Experiment 2) minutes in a block design (i.e., participants completed 15/10 min of one environment and then switched to the other environment for a new period of 15/10 min). Participants were told in the instructions that they would encounter the same options in each environment but the frequency of the options could vary between the environments and that their goal was to gain as collect as many points as possible in the time available. There was a break in-between the two environments. The background color was different for each environment and before continuing to the second environment; participants were explicitly told that they would now experience a new environment.

On each trial, one of four stimuli (options) appeared and moved toward one of two targets: one on the left side of the screen and another on the right. Participants accepted an option by selecting the target the option approached or rejected an option by selecting the alternate target. Participants had 2 s to respond by selecting the left/right target using separate keys on their computer keyboard. Following acceptance of an option, participants faced a time delay (1 or 7 s depending on the stimulus) during which they were required to keep the key used to accept the option pressed down. During this time the option on screen gradually approached them. At the end of the time delay, the target turned yellow and the number of points obtained appeared above the target (1 s) at which point participants could release the key press. Following rejection of an option, the experiment progressed to the next trial. If the participant failed to respond during the encounter screen or released a keypress before a time delay had finished, they faced a timeout of 8 s. This meant that having made the decision to accept an option, it was disadvantageous to then abort. Note that the delays and reward associated with each option were not previously learnt or instructed to participants. Participants sequentially learnt these during the task itself following an accept decision when they then observed the time required to capture the accepted option and, following this, observed the reward collected.

As an attention check and to encourage learning, 25% of trials were forced choice trials. On these trials, participants saw a red asterisk (*) appear over one of the two targets and had to choose that target. This meant that if it appeared over a target that a stimulus was approaching, they had to accept the stimulus on that trial. If it appeared over a target the stimulus was not approaching, they had to choose to reject the stimulus on that trial. Participants were told that more than five incorrect forced choice trials would see their bonus payment reduced by half. At the end of the experiment, participants were told how many points they had amassed in total and their corresponding bonus payment.

The task was programmed in JavaScript using the toolbox jsPsych⁶³ version 5.0.3.

Stimuli

Four different stimuli provided participants with one of two levels of points (presented to participants as 20/80% the length of a horizontal bar displayed on the reward screen (Fig. 1a) which corresponded to 20/80 points) and incurred one of two time delays (2 or 8 s including the 1 s for reward screen display). These options therefore assumed a natural ordering in terms of their value from best (low delay high reward (LDHR)), intermediate (low delay low reward (LDLR), high delay high reward (HDHR)) to worst (high delay low reward (HDLR)). The frequency of each option varied between the two different environments. In the rich environment, the best option was encountered four times more than the other three options. In a sequence of seven trials, the participant would encounter (in a random order) LDHR four times, and also encountering each of LDLR, HDHR and HDLR once. In the poor environment, in a sequence of seven trials, the participant would encounter HDLR four times and the other three options (LDHR, LDLR and HDHR) once (see Fig. 1c).

Behavioral analysis

To examine changes in acceptance rates between environments, we calculated the percentage of accept decisions for each participant for each option they saw in each environment. Forced choice trials and missed responses were excluded from this analysis. They are not, however, excluded from the computational models. To simplify the analysis presented in the main paper we collapsed the two intermediate options—which have an identical profitability (i.e. reward per second) of 10 points per second—together. We then entered these percentages into a repeated measures ANOVA with option (best/intermediate/worst) and environment (rich/poor) as repeated factors. We also ran the same ANOVA treating the two intermediate options as separate levels. In this instance, option (LDHR/LDLR/HDHR/HDLR) and environment (rich/poor) were entered as repeated factors (Supplementary Information).

To examine how changes in acceptance rates related to the order with which participants encountered the environment, we entered the percentage of accept decisions into a separate repeated measures ANOVA with option (LDHR/intermediate/HDLR) and environment (rich/poor) as repeated factors. Condition (RichPoor or PoorRich), which indicates which ordering of environments participants faced, was entered as a between-participant factor. As above, we also ran the same ANOVA treating the two intermediate options as separate levels (Supplementary Information Fig. 1).

To better characterize order effects, we calculated the difference in acceptance rates between environments (poor minus rich) for each option (LDHR, LDLR, HDHR, and HDLR). Hence positive scores indicate an increase in acceptance rates in the poor environment compared to the rich environment. We then calculated the mean change across the four options as a measure of overall change in acceptance rate between environments. To calculate whether this change was significant for each group separately we conducted a one sample t-test (versus 0, two tailed). We compared the overall change in acceptance scores between participants (RichPoor and PoorRich) using independent sample t-tests (two tailed).

Finally, to examine trial-to-trial fluctuations, we first partitioned trials according to the option presented (best, intermediate, worst). We then calculated separately for each participant and for each environment, the percentage of times on the next trial the decision was to accept. We entered these acceptance scores into a 3 × 2-way repeated measure ANOVA with option (best, intermediate, worst) and environment (rich, poor) as factors. To visualize these fluctuations (Fig. 2c, d), we calculated the average of the two acceptance scores—one score for each environment—for each option.

Computational models

The optimal policy from the MVT is to accept an option (indexed by i) whenever the reward (r_i) that the option obtains exceeds the opportunity cost (c_i), of the time taken to pursue the option. This opportunity cost (c_i) is calculated as the time (t_i) that the option takes to pursue (in seconds) multiplied by the estimated reward rate (per second) of the environment at the current point in time (ρ_t):

$$c_i = \rho _tt_i.$$

(1)

Participants should therefore accept an option whenever r_i ≥ c_i

Note we assumed that the quantities r_i and t_i were known to participants from the outset since they were easily observable and each of the four options (i = {1, 2, 3, 4}) always provided the exact same r_i and t_i. But models that dropped this assumption (and instead assumed r_i and t_i were learned via experience rather than known) provided similar patterns of results (see Supplementary Information Learn Options Models).

We assumed that subjects learned in units of reward, using a Rescorla-Wagner learning rule^14,64 which is applied at every second. After each second, the value of the environment is updated according to the following rule:

$$\rho _{t + 1} = \rho _t + \alpha \delta _t.$$

(2)

Here t indexes time in seconds. δ(t) is a prediction error, calculated as

$$\delta _t = r_t - \rho _t$$

(3)

r_t is the reward obtained. This will either be 0 (for every second in which no reward is obtained, i.e. during search time, handling time, and timeouts from missed responses) or equal to r_i (following receipt of a reward).

The learning rate α acts as a scaling parameter and governs how much participants change their estimate of the reward rate of the environment (ρ) from 1 s to the next. This estimate increases when r is positive (i.e. when a reward is obtained) and decreases every second that elapses without a reward.

We implemented two versions of this reinforcement learning model. A Symmetric Model, with only a single α and a modified version, an Asymmetric Model, which had two α: α⁺ and α⁻. In this second model, updates to ρ (Eq. (2)) apply α⁺ if r_t > 0 and α⁻ if r_t ≤ 0. This second model allows updates to occur differently according to whether a reward is received in the environment or not. This is close to identical to updating ρ contingent on whether the prediction error, δ_t, is positive or negative^65,66,67, as for the vast majority of trials (95%) in which r_t > 0, it was also the case that δ_t > 0 while in all trials (100%) in which r_t = 0, it was also the case that δ_t < 0. We refer to the mean difference in learning rates as the learning bias (α⁺ − α⁻). A positive learning bias (α⁺ > α⁻) indicates that participants adjust their estimates of the environments reward rate to a greater extent when a reward is obtained compared to when rewards are absent. The converse is true when the learning bias is negative (α⁺ < α−). If there is no learning bias (α⁺ = α⁻) then this model is equivalent to the simpler Symmetric Model with a single α.

To account for both the delay and the reward received in the final second of handling time (when participants received a reward), this was modeled as two separate updates to ρ; one update from the delay (in which δ_t = 0 − ρ_t) followed by a second update from the reward received (in which δ_t = r_i − ρ_t). Swapping the order of these updates or omitting the first update (from the delay) altogether did not alter the pattern of results.

The probability of choosing to accept an option is estimated using a softmax choice rule, implemented at the final (2nd) second of the encounter screen as follows:

$$P\left( {\mathrm{{accept}}} \right) = \frac{1}{{1 + {{\exp}}\left( {\beta _0 - \beta _1\left( {r_i - c_i} \right)} \right)}}.$$

(4)

This formulation frames the decision to accept an option as a stochastic decision rule in which participants (noisily) choose between two actions (accept/reject) according to the respective value of each of them. The temperature parameter β₁ governs participants’ sensitivity to the difference between these two values while the bias term β₀ captures a participant’s general tendency towards accepting/rejecting options (independent of the values of each action). Note that under this formulation, negative values for β₀ indicate a bias towards accepting options and positive values indicate a bias towards rejecting options.

At the beginning of the experiment, ρ was initialized to the average (arithmetic) reward rate across the experiment¹¹ (although clearly not realistic as a process level model, this was included for simplicity to avoid estimation pathologies and special-case model features associated with initial conditions). In subsequent environments, the average reward rate carried over from the previous environment. In other words, there was no resetting when participants entered a new environment. Rather, they had to unlearn what they had learnt in the previous environment.

For each participant, we estimated the free parameters of the model by maximizing the likelihood of their sequence of choices, jointly with group-level distributions over the entire population using an Expectation Maximization (EM) procedure⁶⁸ implemented in the Julia language⁶⁹, version 0.7.0. Models were compared by first computing unbiased per subject marginal likelihoods via subject-level cross-validation and then comparing these likelihoods between models (Asymmetric versus Symmetric) using paired sample t-tests (two sided).

To formally test for differences in learning rates (α⁺, α−) we estimated the covariance matrix over the group-level parameters using the Hessian of the model likelihood⁷⁰ and then used a contrast to compute the standard error on the difference α⁺−α⁻.

Simulations

To examine the qualitative fit of each model to the data we ran simulations for both the Symmetric Model and the Asymmetric Model. For each simulation (n = 1000), we ran a group of 40 virtual participants. For each virtual participant, we randomly assigned a set of parameters (β₀, β₁ and α in the case of the Symmetric Model; β₀, β₁, α⁺ and α⁻ in the case of the Asymmetric Model) from the best-fit parameters generated by the computational model (fit to actual participants choices) and randomly assigned the order with which they encountered the environments (either rich to poor or poor to rich). Then, we simulated the learning process by which ρ evolved as options were sequentially encountered and stochastically accepted/rejected. The learning process was exactly as described for the respective computational models. Crucially, ρ was initialized exactly as for the model and allowed to carry over between blocks. We then calculated the difference in average acceptance rates for each virtual participant over the four options (LDHR, LDLR, HDHR, and HDLR) between environments (poor minus rich). We then calculated the average difference score over participants in each order condition. We then averaged these two sets of acceptance rates over each of the simulations.

To quantify the cost of the learning asymmetry under asymmetric and symmetric learning (see Supplementary Information) we ran another set of simulations for the Symmetric Model and the Asymmetric Model. For each simulation (n = 500), we once again ran a group of 40 virtual participants with half (n = 20) assigned to the RichPoor condition and the other half (n = 20) assigned to the PoorRich condition. Here we used the average learning rates (α, α⁺, and α⁻) and slope (β₁) from Experiment 1 and fixed the intercept to 0 (β₀ = 0) in order to isolate the pure cost of a learning bias (i.e. independent of differences between models in the general tendency to accept/reject options). We then calculated the average amount earnt for each of the 40 subjects in each model (excluding force trials) and compared earnings between symmetric and asymmetric learners using an independent sample t-test (two tailed).

Participants and procedure Experiment 3

A total of 59 participants were recruited for Experiment 3. Twenty-one of these were excluded (identical exclusion criteria as for Experiments 1 and 2) leaving a final sample of 38 participants (mean [standard deviation] age: 34.63 [8.58]; 12 females). The experiment was exactly as described as for Experiment 1 and Experiment 2 with one key difference. In this version there were three environments instead of two. A rich environment and a poor environment (exactly as for Experiments 1 and 2) as well as a third HDLR environment. The only options participants encountered in the HDLR environment were the worst (HDLR) option. Participants had 10 min in each environment. The ordering was either Rich, HDLR, Poor or Poor, HDLR, Rich.

Statistics and reproducibility

Each of the three experiments reported were run once with an independent group of participants. Experiment 2 is a near replication of Experiment 1—the only difference being the duration of the two blocks (Experiment 1: 15 min, Experiment 2: 10 min).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

All data are available at https://github.com/NeilGarrett/PreySelection. The source data underlying Figs. 2a–d, 3a–d, 4a, b, Supplementary Figs 1a–d, Supplementary Figs 2, 3, Supplementary Figs. 4a, b, Table 1 and Supplementary Table 1 are provided as a Source Data file. A reporting summary for this Article is available as a Supplementary Information file. Source data are provided with this paper.

Code availability

All code is available at https://github.com/NeilGarrett/PreySelection.

References

De Martino, B., Fleming, S. M., Garrett, N. & Dolan, R. J. Confidence in value-based choice. Nat. Neurosci. 16, 105–110 (2013).
PubMed Google Scholar
FitzGerald, T. H. B., Seymour, B. & Dolan, R. J. The role of human orbitofrontal cortex in value comparison for incommensurable objects. J. Neurosci. 29, 8388–8395 (2009).
CAS PubMed PubMed Central Google Scholar
Frank, M. J., Seeberger, L. C. & O’reilly, R. C. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306, 1940–1943 (2004).
ADS CAS PubMed Google Scholar
Kable, J. W. & Glimcher, P. W. The neural correlates of subjective value during intertemporal choice. Nat. Neurosci. 10, 1625–1633 (2007).
CAS PubMed PubMed Central Google Scholar
Rangel, A. & Hare, T. Neural computations associated with goal-directed choice. Curr. Opin. Neurobiol. 20, 262–270 (2010).
CAS PubMed Google Scholar
Tom, S. M., Fox, C. R., Trepel, C. & Poldrack, R. A. The neural basis of loss aversion in decision-making under risk. Science 315, 515–518 (2007).
ADS CAS PubMed Google Scholar
Hunt, L. T. et al. Mechanisms underlying cortical activity during value-guided choice. Nat. Neurosci. 15, 470–476 (2012).
CAS PubMed PubMed Central Google Scholar
Krebs, J. R., Erichsen, J. T., Webber, M. I. & Charnov, E. L. Optimal prey selection in the great tit (Parus major). Anim. Behav. 25(Part 1), 30–38 (1977).
Google Scholar
Stephens, D. W. & Krebs, J. R. Foraging Theory (Princeton University Press, 1986).
Charnov, E. L. Optimal foraging, the marginal value theorem. Theor. Popul. Biol. 9, 129–136 (1976).
CAS PubMed MATH Google Scholar
Constantino, S. M. & Daw, N. D. Learning the opportunity cost of time in a patch-foraging task. Cogn. Affect. Behav. Neurosci. 15, 837–853 (2015).
PubMed PubMed Central Google Scholar
Hutchinson, J. M. C., Wilke, A. & Todd, P. M. Patch leaving in humans: can a generalist adapt its rules to dispersal of items across patches? Anim. Behav. 75, 1331–1349 (2008).
Google Scholar
McNamara, J. M. & Houston, A. I. Optimal foraging and learning. J. Theor. Biol. 117, 231–249 (1985).
MathSciNet Google Scholar
Rescorla, R. A. & Wagner, A. R. Classical Conditioning II: Current Research and Theory (eds. Black, A. H. & Prokasy, W. F.) Vol. 2, 64–99 (Appleton-Century-Crofts, 1972).
Palminteri, S., Wyart, V. & Koechlin, E. The importance of falsification in computational cognitive modeling. Trends Cogn. Sci. 21, 425–433 (2017).
PubMed Google Scholar
Hayden, B. Y., Pearson, J. M. & Platt, M. L. Neuronal basis of sequential foraging decisions in a patchy environment. Nat. Neurosci. 14, 933–939 (2011).
CAS PubMed PubMed Central Google Scholar
Freidin, E. & Kacelnik, A. Rational choice, context-dependence and the value of information in European starlings (Sturnus vulgaris). Science 334, 1000–1002 (2011).
ADS CAS PubMed Google Scholar
McNickle, G. G. & Cahill, J. F. Plant root growth and the marginal value theorem. Proc. Natl. Acad. Sci. 106, 4747–4751 (2009).
ADS CAS PubMed Google Scholar
Kacelnik, A. Central place foraging in Starlings (Sturnus vulgaris). I. Patch residence time. J. Anim. Ecol. 53, 283–299 (1984).
Google Scholar
Kolling, N., Behrens, T. E. J., Mars, R. B. & Rushworth, M. F. S. Neural mechanisms of foraging. Science 336, 95–98 (2012).
ADS CAS PubMed PubMed Central Google Scholar
Jacobs, E. A. & Hackenberg, T. D. Humans’ choices in situations of time-based diminishing returns: effects of fixed-interval duration and progressive-interval step size. J. Exp. Anal. Behav. 65, 5–19 (1996).
CAS PubMed PubMed Central Google Scholar
Smith, E. & Winterhalder, B. Evolutionary Ecology and Human Behavior (Aldine Transaction, 1992).
McCall, J. J. Economics of Information and Job Search. Q. J. Econ. 84, 113–126 (1970).
Google Scholar
Constantino, S. M. et al. A neural mechanism for the opportunity cost of time. Preprint at https://www.biorxiv.org/content/10.1101/173443v1.full (2017).
Lenow, J. K., Constantino, S. M., Daw, N. D. & Phelps, E. A. Chronic and acute stress promote overexploitation in serial decision-making. J. Neurosci. 37, 5681–5689 (2017).
CAS PubMed PubMed Central Google Scholar
Zhang, J., Gong, X., Fougnie, D. & Wolfe, J. M. Using the past to anticipate the future in human foraging behavior. Vis. Res. 111, 66–74 (2015).
PubMed Google Scholar
Niv, Y., Daw, N. D., Joel, D. & Dayan, P. Tonic dopamine: opportunity costs and the control of response vigor. Psychopharmacology (Berl.) 191, 507–520 (2007).
CAS Google Scholar
Otto, A. R. & Daw, N. D. The opportunity cost of time modulates cognitive effort. Neuropsychologia 123, 92–105 (2019).
PubMed Google Scholar
Keramati, M., Dezfouli, A. & Piray, P. Speed/accuracy trade-off between the habitual and the goal-directed processes. PLoS Comput. Biol. 7, e1002055 (2011).
ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Dezfouli, A. & Balleine, B. W. Habits, action sequences and reinforcement learning. Eur. J. Neurosci. 35, 1036–1051 (2012).
PubMed PubMed Central Google Scholar
Cools, R., Nakamura, K. & Daw, N. D. Serotonin and dopamine: unifying affective, activational, and decision functions. Neuropsychopharmacology 36, 98–113 (2011).
CAS PubMed Google Scholar
Schwartz, A. A reinforcement learning method for maximizing undiscounted rewards. Proceedings of the Tenth International Conference on Machine Learning. Amherst, Massachusetts. Vol. 298, 298–305 (1993).
Daw, N. D., Kakade, S. & Dayan, P. Opponent interactions between serotonin and dopamine. Neural Netw. 15, 603–616 (2002).
PubMed Google Scholar
Palminteri, S., Khamassi, M., Joffily, M. & Coricelli, G. Contextual modulation of value signals in reward and punishment learning. Nat. Commun. 6, 8096 (2015).
ADS CAS PubMed PubMed Central Google Scholar
Klein, T. A., Ullsperger, M. & Jocham, G. Learning relative values in the striatum induces violations of normative decision making. Nat. Commun. 8, 16033 (2017).
ADS CAS PubMed PubMed Central Google Scholar
Bavard, S., Lebreton, M., Khamassi, M., Coricelli, G. & Palminteri, S. Reference-point centering and range-adaptation enhance human reinforcement learning at the cost of irrational preferences. Nat. Commun. 9, 1–12 (2018).
CAS Google Scholar
Beierholm, U. et al. Dopamine modulates reward-related vigor. Neuropsychopharmacology 38, 1495–1503 (2013).
CAS PubMed PubMed Central Google Scholar
Garrett, N., Gonazalez-, A., Foulkes, L., Levita, L. & Sharot, T. Updating beliefs under perceived threat. J. Neurosci. 38, 7901–7911 (2018).
CAS PubMed PubMed Central Google Scholar
Garrett, N. & Sharot, T. et al. Optimistic update bias holds firm: three tests of robustness following Shah. Conscious. Cogn. 50, 12–22 (2017).
PubMed PubMed Central Google Scholar
Sharot, T., Korn, C. W. & Dolan, R. J. How unrealistic optimism is maintained in the face of reality. Nat. Neurosci. 14, 1475–1479 (2011).
CAS PubMed PubMed Central Google Scholar
Kuzmanovic, B., Jefferson, A. & Vogeley, K. Self‐specific optimism bias in belief updating is associated with high trait optimism. J. Behav. Decis. Mak. 28, 281–293 (2015).
Google Scholar
Wiswall, M. & Zafar, B. How do college students respond to public information about earnings? J. Hum. Cap. 9, 117–169 (2015).
Google Scholar
Eil, D. & Rao, J. M. The good news-bad news effect: asymmetric processing of objective information about yourself. Am. Econ. J. Microecon. 3, 114–138 (2011).
Google Scholar
Mobius, M. M., Niederle, M., Niehaus, P. & Rosenblat, T. S. Managing Self-Confidence: Theory and Experimental Evidence, NBER Working Paper No. 17014 (2011).
Korn, C. W., Prehn, K., Park, S. Q., Walter, H. & Heekeren, H. R. Positively biased processing of self-relevant social feedback. J. Neurosci. 32, 16832–16844 (2012).
CAS PubMed PubMed Central Google Scholar
Collins, A. G. E. & Frank, M. J. Opponent actor learning (OpAL): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive. Psychol. Rev. 121, 337–366 (2014).
PubMed Google Scholar
Bartra, O., McGuire, J. T. & Kable, J. W. The valuation system: a coordinate-based meta-analysis of BOLD fMRI experiments examining neural correlates of subjective value. NeuroImage 76, 412–427 (2013).
PubMed PubMed Central Google Scholar
Bernardi, S. & Salzman, D. in Decision Neuroscience (eds. Dreher, J.-C. & Tremblay, L.) Ch. 3, 33–45 (Academic Press, 2017).
Garrison, J., Erdeniz, B. & Done, J. Prediction error in reinforcement learning: a meta-analysis of neuroimaging studies. Neurosci. Biobehav. Rev. 37, 1297–1310 (2013).
PubMed Google Scholar
Hayes, D. J., Duncan, N. W., Xu, J. & Northoff, G. A comparison of neural responses to appetitive and aversive stimuli in humans and other mammals. Neurosci. Biobehav. Rev. 45, 350–368 (2014).
PubMed Google Scholar
Palminteri, S. & Pessiglione, M. in Decision Neuroscience (eds. Dreher, J.-C. & Tremblay, L.) Ch. 23, 291–303 (Academic Press, 2017).
Wikenheiser, A. M., Stephens, D. W. & Redish, A. D. Subjective costs drive overly patient foraging strategies in rats on an intertemporal foraging task. Proc. Natl Acad. Sci. USA 110, 8308–8313 (2013).
ADS CAS PubMed Google Scholar
Johnson, D. D. P. & Fowler, J. H. The evolution of overconfidence. Nature 477, 317–320 (2011).
ADS CAS PubMed Google Scholar
Cazé, R. D. & van der Meer, M. A. A. Adaptive properties of differential learning rates for positive and negative outcomes. Biol. Cybern. 107, 711–719 (2013).
MathSciNet PubMed Google Scholar
Sharot, T. & Garrett, N. Forming beliefs: why valence matters. Trends Cogn. Sci. 20, 25–33 (2016).
PubMed Google Scholar
Gershman, S. J. Do learning rates adapt to the distribution of rewards? Psychon. Bull. Rev. 22, 1320–1327 (2015).
PubMed Google Scholar
Chambon, V. et al. Choosing and learning: outcome valence differentially affects learning from free versus forced choices. Preprint at https://www.biorxiv.org/content/10.1101/637157v1 (2019).
Huys, Q. J., Daw, N. D. & Dayan, P. Depression: a decision-theoretic analysis. Annu. Rev. Neurosci. 38, 1–23 (2015).
CAS PubMed Google Scholar
Garrett, N. et al. Losing the rose tinted glasses: neural substrates of unbiased belief updating in depression. Front. Hum. Neurosci. 8, 639 (2014).
PubMed PubMed Central Google Scholar
Korn, C. W., Sharot, T., Walter, H., Heekeren, H. R. & Dolan, R. J. Depression is related to an absence of optimistically biased belief updating about future life events. Psychol. Med. 44, 579–592 (2014).
CAS PubMed Google Scholar
Chase, H. W. et al. Approach and avoidance learning in patients with major depression and healthy controls: relation to anhedonia. Psychol. Med. 40, 433–440 (2010).
CAS PubMed Google Scholar
Crump, M. J. C., McDonnell, J. V. & Gureckis, T. M. Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research. PLoS ONE 8, e57410 (2013).
ADS CAS PubMed PubMed Central Google Scholar
Leeuw, J. Rde jsPsych: A JavaScript library for creating behavioral experiments in a Web browser. Behav. Res. Methods 47, 1–12 (2015).
ADS PubMed Google Scholar
Sutton, R. S. & Barto, A. G. Introduction to Reinforcement Learning. Vol. 135 (MIT Press, Cambridge, 1998).
Frank, M. J., Moustafa, A. A., Haughey, H. M., Curran, T. & Hutchison, K. E. Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning. Proc. Natl. Acad. Sci. USA 104, 16311–16316 (2007).
ADS CAS PubMed Google Scholar
Lefebvre, G., Lebreton, M., Meyniel, F., Bourgeois-Gironde, S. & Palminteri, S. Behavioural and neural characterization of optimistic reinforcement learning. Nat. Hum. Behav. 1, 0067 (2017).
Google Scholar
Niv, Y. et al. Reinforcement learning in multidimensional environments relies on attention mechanisms. J. Neurosci. 35, 8145–8157 (2015).
CAS PubMed PubMed Central Google Scholar
Huys, Q. J. M. et al. Disentangling the roles of approach, activation and valence in instrumental and Pavlovian responding. PLOS Comput. Biol. 7, e1002028 (2011).
MathSciNet CAS PubMed PubMed Central Google Scholar
Bezanson, J., Karpinski, S., Shah, V. B. & Edelman, A. Julia: a fast dynamic language for technical computing. Preprint at https://arxiv.org/abs/1209.5145 (2012).
Oakes, D. Direct calculation of the information matrix via the EM. J. R. Stat. Soc. Ser. B Stat. Methodol. 61, 479–482 (1999).
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This research was supported by a US Army Research Office grant (W911NF-16-1- 0474) to N.D.D. and a Sir Henry Wellcome Postdoctoral Fellowship (209108/Z/17/Z) to N.G. We would like to thank Scott Grafton and Neil Dundon for helpful insight and discussions.

Author information

Authors and Affiliations

Princeton Neuroscience Institute and Department of Psychology, Princeton University, Princeton, NJ, 08544, USA
Neil Garrett & Nathaniel D. Daw
Department of Experimental Psychology, University of Oxford, Woodstock Rd, Oxford, OX2 6GG, UK
Neil Garrett

Authors

Neil Garrett
View author publications
You can also search for this author in PubMed Google Scholar
Nathaniel D. Daw
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.G. and N.D.D. conceived the study, designed the study and wrote the manuscript. N.G. collected and analyzed the data.

Corresponding author

Correspondence to Neil Garrett.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Stefano Palminteri and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Garrett, N., Daw, N.D. Biased belief updating and suboptimal choice in foraging decisions. Nat Commun 11, 3417 (2020). https://doi.org/10.1038/s41467-020-16964-5

Download citation

Received: 02 August 2019
Accepted: 27 May 2020
Published: 09 July 2020
DOI: https://doi.org/10.1038/s41467-020-16964-5

This article is cited by

Pathways to the persistence of drug use despite its adverse consequences
- Gavan P. McNally
- Philip Jean-Richard-dit-Bressel
- Andrew J. Lawrence
Molecular Psychiatry (2023)
PET-measured human dopamine synthesis capacity and receptor availability predict trading rewards and time-costs during foraging
- Angela M. Ianni
- Daniel P. Eisenberg
- Karen F. Berman
Nature Communications (2023)
Cooperation dynamics in public goods games with evolving cognitive bias
- Ji Quan
- Haoze Li
- Xianjia Wang
Management System Engineering (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.