Abstract
For decades now, normative theories of perceptual decisions, and their implementation as drift diffusion models, have driven and significantly improved our understanding of human and animal behaviour and the underlying neural processes. While similar processes seem to govern valuebased decisions, we still lack the theoretical understanding of why this ought to be the case. Here, we show that, similar to perceptual decisions, drift diffusion models implement the optimal strategy for valuebased decisions. Such optimal decisions require the models’ decision boundaries to collapse over time, and to depend on the a priori knowledge about reward contingencies. Diffusion models only implement the optimal strategy under specific task assumptions, and cease to be optimal once we start relaxing these assumptions, by, for example, using nonlinear utility functions. Our findings thus provide the muchneeded theory for valuebased decisions, explain the apparent similarity to perceptual decisions, and predict conditions under which this similarity should break down.
Introduction
In everyday ambiguous and noisy environments, decisionmaking requires the accumulation of evidence over time. In perceptual decisionmaking tasks (for example, discriminating a motion direction), choices and reaction times are wellfit by drift diffusion models (DDMs)^{1,2,3}. These models represent the accumulated belief informed by sensory evidence as the location of a diffusing particle that triggers a decision once it reaches one of two decision boundaries. DDMs are known to implement theoretically optimal algorithms, such as the sequential likelihood ratio test^{4,5,6} and more general algorithms that handle varying task difficulty^{7}.
Recently, DDMs have been shown to also describe human behaviour in valuebased decisions, where subjects compare the endogenous values of rewarding items (for example, deciding between two lunch options). This suggests that humans perform valuebased decisions by computations similar to those used for standard perceptual decision (such as visual discrimination of random dot motion directions). In this case, the DDMs are driven only by the difference in item values, and thus predict the choices to be insensitive to the absolute values of the compared items (Fig. 1)^{8,9,10}. In particular, relying only on the relative value means that it might take on average longer to decide between two equally good options than between items of very different values.
This raises an important question: do DDMs indeed implement the optimal strategy for valuebased decisions? Intuitively, absolute values should also influence the decision strategy, such that relying only on relative values appears suboptimal. In particular, it seems unreasonable to wait for a long time to decide between two nearly similar highly rewarding options. Nonetheless, DDMs or related models are generally better at explaining human behaviour than alternative models. For example, race models (RMs) assume independent ‘races’ to accumulate evidence for individual options. Once one of these races reaches a decision criterion the corresponding choice is triggered^{11,12}. Even though RMs are sensitive to absolute choice values and as such predict more rapid choices for higher rewarded options, they neither fit well human behaviour in perceptual decisionmaking tasks^{12} nor in valuebased decision tasks in which decisions are usually better described by relying only on relative values^{13,14}. Does this mean that humans use DDMs even though these model implement suboptimal strategies, or that DDMs indeed implement the optimal strategy for valuebased choices? What is clear is that we need to understand (i) what the optimal strategy for valuebased decisions is, (ii) why the valuebased and perceptual decision seem to be fitted by the same class of models (DDMs) despite the qualitative difference between these tasks and (iii) to which degree valuebased and the perceptual decisions differ in terms of their normative computational strategies.
In this paper, we derive the theoretically optimal strategy for valuebased decisions, and show that this strategy is in fact equivalent to a particular class of DDMs that feature ‘collapsing boundaries’ whose distance shrinks over time. We show that the exact shape of these boundaries and the associated average reaction times depend on averagereward magnitudes even if decisions within individual trials are only guided by the relative reward between choice options. Finally, we highlight the difference between valuebased and standard perceptual decisions, reveal specific conditions under which the optimality of DDMs are violated, and show how to reconcile the ongoing debate on whether decision makers are indeed using collapsing decision boundaries. In contrast to previous work that assumed particular a priori mechanisms underlying valuebased choices, such as RMs or DDMs^{15,16,17,18}, our work instead deduces optimal decisionmaking mechanisms based solely on a description of the information available to the decision maker. Thus, the use of diffusion models for valuebased choices is not an a priori assumption of our work, but rather a result that follows from the normative decisionmaking strategy.
Results
Problem setup and aim
Consider a decision maker choosing between options that yield potentially different rewards (or ‘values’), as, for example, choosing between two lunch menu options in the local restaurant. If the decision maker knew these rewards precisely and immediately then she should instantly choose the more rewarding option. However, in realistic scenarios, the reward associated with either option is uncertain a priori. This uncertainty might, for example, arise if she has a priori limited information about the choice options. Then, it is better to gather more evidence about the reward associated with the compared options before committing to a choice (for example, when choosing among lunch menus, we can reduce uncertainty about the value of either menu by contemplating the composition of each menu course separately and how these separate courses complement each other). However, how much evidence should we accumulate before committing to a choice? Too little evidence might result in the choice of the lowerrewarding option (the less appreciated lunch menu), whereas long evidence accumulation comes at the cost of both time and effort (for example, missing the passing waiter yet another time). In what follows, we formalize how to best tradeoff speed and accuracy of such choices, and then derive how the decision maker ought to behave in such scenarios. We first introduce each component of the decisionmaking task in its most basic form, and discuss generalizations thereof in later sections.
We assume that, at the beginning of each trial, the two options have associated true rewards, z_{1} and z_{2}, which are each stochastically drawn from separate normal distributions with a fixed mean for option j∈{1,2} and common variance . These true rewards are unknown to the decision maker, as they are never observed directly. Instead, we assume that the decision maker observes some momentary evidence with mean for both options j∈{1,2} simultaneously in small timesteps i of duration δt. Note that variability (and associated ambiguity) of the momentary evidence can arise through noise sources that are both internal or external to the decision maker—sources that we discuss in more detail further below.
Before observing any momentary evidence, we assume that the decision maker holds a normally distributed belief with mean and variance , which are, respectively, the mean and variance of the distribution from which the reward are being drawn from at the beginning of each trial. In other words, this a priori belief corresponds to the actual distribution from which the true rewards are drawn (that is, the decision maker uses the correct generative model), and entails that option j is most likely to yield reward , but might also yield other rewards, with the spread of rewards around controlled by the level of uncertainty about z_{j}. For now, we only consider the case in which the amounts of reward associated with both options are uncorrelated and, on average, the same . In terms of choosing between lunch menu options, either menu would a priori yield the same reward, and the true rewards of either menu option are independently of each other drawn from the aforementioned normal distribution (Fig. 2a). Later, we discuss the consequences of a correlation between true option values.
As soon as being presented with sensory evidence δx_{j,i}, the decision maker accumulates further information about the rewards associated with either choice option. This momentary evidence δx_{j,i} reveals noisy information about the true reward z_{j}, such that each additional piece of momentary evidence reduces the uncertainty about this reward. We emphasize that neither of the true rewards is ever observed without noise. As a result, the decision maker needs to accumulate evidence to reduce uncertainty about the underlying true rewards by averaging out the noise. Longer evidence accumulation results in a better average and lower associated uncertainty.
The noise in the momentary evidence itself can have both internal and external sources. External sources constitute the potentially stochastic nature of stimuli, perceptual noise, ambiguity and incomplete knowledge. For example, having not yet read the main course and dessert of a particular menu option causes uncertainty about the option’s value due to incomplete knowledge. Internal sources could result from uncertain memory, or value inference that extends over time. One example for such value inference would be to sequentially contemplate the value of different features of a particular menu course over time.
Formally, after observing the valuerelated evidence δx_{j}(0:t) from time 0 (onset of momentary evidence) to some time t, the decisionmaker’s posterior belief about the true reward, z_{j}, of option j is given by
The posterior mean is an evidenceweighted combination of the a priori mean and the timeaveraged accumulated evidence , and the posterior variance (that is uncertainty) decreases monotonically with time (see Methods section). Due to uncertainty in the momentary evidence, the accumulated evidence x_{j}(t) itself describes a stochastic process. Here, and in contrast to other models of decisionmaking (both perceptual^{19,20} and valuebased^{15,16}), all stochasticity in the accumulated evidence results from ambiguity in the momentary evidence itself, rather than from noise in the mechanisms that implement the decisionmaking process. In other words, the process responsible for the accumulation of the evidence is assumed to be noiseless, an assumption consistent with recent neurophysiological recordings.^{21}
What are the costs and rewards that the decision maker incurs during the course of her decisions? In terms of costs we assume that the decision maker pays a cost c per second of accumulating evidence, from onset of the choice options until an option is chosen. This cost could, for example, be an explicit cost for delayed choices, or represent the effort induced by evidence accumulation. In the context of choosing between lunch menus, this cost might arise from missing the passing waiter yet again, or from being late for a postlunch meeting. Choosing option j is associated with experiencing some reward r_{j} that is a function of the true reward z_{j} associated with this option, as, for example, when experiencing reward for consuming the lunch. For now, we assume experienced and true reward to be equivalent, that is r_{j}=z_{j}. For a single choice, the overall aim of the decision maker is to maximize expected reward minus expected cost,
where the expectation is across choices j and evidence accumulation times T, given the flow of evidence δx_{j} (0:T) from time 0 to T. We first derive the optimal behaviour, or ‘policy’, that maximizes this objective function for single, isolated choices and later generalize it to the more realistic scenario in which the total reward in a long consecutive sequence is maximized.
Optimal decisions with DDMs with collapsing boundaries
To find the optimal policy, we borrow tools from dynamic programming (DP). One of these tools is the ‘value function’, which can be defined recursively through Bellman’s equation. In what follows, we show that the optimal policy resulting from this value function is described by two timedependent parallel bounds in the twodimensional space of current estimates of the true option rewards. These bounds are parallel with unity slopes, approach each other over time and together form a bound on the difference of reward estimates. This difference is efficiently inferred by diffusion models, such that DDMs can implement the optimal strategy for valuebased decisionmaking.
Bellman’s equation for optimal valuebased decisionmaking. To define the value function, assume that the decision maker has accumulated some evidence about the option rewards for some time t. Given this accumulated evidence, the value function returns the total reward the decision maker expects to receive when following the optimal policy. This value includes both the cost for evidence accumulation from time t onwards and the reward resulting from the final choice. The expected rewards, , and elapsed time t are sufficient statistics of the accumulated evidence (see Methods section), such that the value function is defined over these quantities. At each point in time t during evidence accumulation we can either commit to a choice or accumulate more evidence and choose later. When committing to a choice, it is best to choose the option associated with the higher expected reward, such that the total expected reward for choosing immediately is given by the value for ‘deciding’, (Fig. 3a). When accumulating more evidence for a small duration δt, in contrast, the decision maker observes additional evidence on which she updates her belief about the true rewards while paying accumulation cost cδt. At this stage, she expects to receive a total reward of . Therefore, the total expected reward for accumulating more evidence is given by the value for ‘waiting’, (Fig. 3b), where the expectation is over the distribution of future expected rewards, and , given that they are and at time t (see Methods section for an expression of this distribution). The decision maker ought to only accumulate more evidence if doing so promises more total reward, such that the value function can be written recursively in a form called Bellman’s equation (Fig. 3ac,e; see Supplementary Note 1 for formal derivation),
With knowledge of the value function, optimal choices are performed as follows. Before having accumulated any evidence, the subjective expected reward associated with option j equals the mean of the prior belief, , such that the total expected reward at this point is given by . Once evidence is accumulated, and evolve over time, reflecting the accumulated evidence and associated updated belief of the true reward of the choice options. It remains advantageous to accumulate evidence as long as the total expected reward for doing so is larger than that for deciding immediately. As soon as deciding and waiting become equally valuable, that is, , it is best to choose option j associated with the higher rewarded expected rewarded . This optimal policy results in two decision boundaries in space that might change with time (Fig. 3f). Inbetween these boundaries it remains advantageous to accumulate more evidence, but as soon as either boundary is reached, the associated option ought to be chosen.
Parallel optimal decision boundaries. For the task setup considered above, the decision boundaries take a surprisingly simple shape. When plotted in the space of estimated option rewards for some fixed time t, the two boundaries are always parallel to the diagonal (Fig. 3f). Furthermore, they are always above and below this diagonal, reflecting that the diagonal separates the regions in which the choice of either option promises more reward. Here, we provide an informal argument why this is the case.
The argument relies on the fact that, for each time t, the decision boundaries are determined by the intersection between the value for deciding and that for waiting (Fig. 3c,d). Both of these values share the property that, in lines parallel to the diagonal, they are linearly increasing with slope one. Formally, both functions satisfy for any fixed time t, reward estimates and , and arbitrary scalar C. This implies that, if they intersect at some point , thus forming part of the decision boundary, they will intersect at the whole line that is parallel to the diagonal (Fig. 3c,e,f). Therefore both decision boundaries are parallel to the diagonal.
How can we guarantee that the values for both deciding and waiting are linearly increasing in lines parallel to the diagonal? For the value for deciding, , this is immediately obvious from its definition (Fig. 3a and caption). Showing the same for the value for waiting requires more work, and is done by a backwards induction argument in time (see Methods section for details). Intuitively, after having accumulated evidence about reward for a long time (t→∞), the decision maker expects to gain little further insight by any additional evidence. Therefore, deciding is better than waiting, such that the value function will be that for deciding, , which, as previously mentioned, is linearly increasing in lines parallel to the diagonal, providing the base case. Next, it can be shown that, if the value function at time t+δt is linearly increasing in lines parallel to the diagonal, then so is the value of waiting at time t, and, as a consequence, also the value function at time t—essentially because the uncertainty about how the reward estimate evolves over time is shiftinvariant (does not depend on current expected rewards, ; see Methods section). The value function at time t is the maximum over the value for deciding and that for waiting. As both increase linearly in lines parallel to the diagonal, so does this value function, (Fig. 3c,e). This completes the inductive step.
To summarize, an induction argument backward in time shows that both the values for deciding and waiting increase linearly in lines parallel to the diagonal for all t. As a consequence, the decision boundaries, which lie on the intersection between these two values, are parallel to this diagonal for all times t. In Supplementary Methods, we demonstrate the same property with an argument that does not rely on induction. In both cases, the argument requires, for any fixed t, a stochastic temporal evolution of our expected reward estimates that is shiftinvariant with respect to our current estimates . In other words, for any estimates , the decision maker expects them to evolve in exactly the same way. This property holds for the task setup described above and some generalizations thereof (Supplementary Note 1), but might be violated under certain, more complex scenarios, as described further below.
Optimal decisions with collapsing boundaries, and by diffusion models. A consequence of parallel decision boundaries is that optimal choices can be performed by tracking only the difference in expected option rewards, , rather than both and independently. To see this, consider rotating these boundaries in space by −45° such that they come to be parallel to the horizontal axis in the new space (Fig. 4a,b). After the rotation they bound and are independent of .
For Gaussian a priori rewards (Fig. 2a), numerical solutions reveal that the distance between the two boundaries decreases over time, resulting in ‘collapsing boundaries’ (Fig. 4c) that can be explained as follows. In the beginning of the decision, the true option rewards are highly uncertain due to a lack of information. Hence, every small piece of additional evidence will make the running reward estimates substantially more certain. This makes it worth to withhold decisions by farseparated decision boundaries (Fig. 4c for small t). Once a significant amount of evidence is accumulated, further evidence will barely increase certainty about the true rewards. Thus, it becomes more preferable to decide quickly rather than to withhold choice for an insignificant increase in choice accuracy (even for similar reward estimates, , and residual uncertainty about which option yields the higher reward). The narrowing boundary separation ensures such rapid decisions (Fig. 4c for large t).
We can further simplify the optimal decision procedure by implementing the computation of the expected option reward difference by a diffusion model. As long as , such an implementation remains statistically optimal, as the diffusing particle, x(t)≡x_{1}(t)−x_{2}(t), (recall that ) and elapsed time t form a set of sufficient statistics of the posterior r_{1}(t)−r_{2}(t)δ x(0:t) over this difference (see Methods section). Furthermore, x_{j}(t) can be interpreted as the sample path of a particle that diffuses with variance σ^{2} and drifts with rate z_{j}. For this reason, x(t) diffuses with variance 2σ^{2} and drifts with rate z_{1}−z_{2}, thus forming the particle in a diffusion model that performs statistically optimal inference. The same mapping between expected reward difference and diffusing particle allows us to map the optimal boundary on reward into boundaries on x(t) (Fig. 4c,d). Therefore, models as simple as diffusion models can implement optimal valuebased decisionmaking.
Moving from single choices to a sequence thereof
So far we have focused on single choices in which the decision maker trades off the expected reward received for this choice with the cost associated with accumulating evidence about the true option rewards. This setup assumes a single choice and, besides the accumulation cost, infinite time to perform it. In realistic scenarios, however, such choices are usually embedded within a sequence of similar choices. Here, we consider how such embedding influences the form of the optimal policy.
Maximizing the reward rate across choices. We assume that each choice within the sequence follow the previous singlechoice setup. That is, after onset of the choice options, the decision maker pays a cost c per second for accumulating evidence about the true option rewards. At choice, she receives the true reward associated with the chosen option. The choice is followed by a (possibly stochastic) waiting time of t_{w} seconds on average, after which two new choice options appear and new evidence is accumulated. The true reward associated with either option is before choice option onset drawn according to the previously described Gaussian prior (Fig. 2a), such that these rewards remain constant within individual choices, but vary across consecutive choices. Rather than maximizing the total expected reward for each individual choice, we assume that the aim is to maximize the total expected reward within a fixed time period, independent of how many choices are performed within this period. To avoid boundary effects, we assume the period duration to be closetoinfinite, such that maximizing the total expected reward within this period becomes equivalent to maximizing the reward rate ρ, given by
where the expectation is, as for equation (2), across choices j and evidence accumulation times T, given the flow of evidence. Here, it is critical that we fix the time period while leaving open the number of choices that can be performed. If we instead were to fix the number of choices while leaving open the time to make them, it again becomes optimal to maximize the total expected reward for each of these choices separately, such that the optimal policy for each such choice is the same as that for single, isolated choices.
Infinite choice sequences make using the standard value function difficult. This value function returns the total expected reward for all current and future choices when starting from the current state. For an infinite number of such future choices, the value function might thus become infinite. One way to avoid this is to use instead the ‘averageadjusted value’ function, which—in addition to an accumulation cost—penalizes the passage of some time duration δt by −ρδt, where ρ is the reward rate. This reward rate is by equation (4) the total reward received (including accumulation costs) per second, averaged over the whole choice sequence. Penalizing the value function by this reward rate makes explicit the implicit loss of rewards due to potential future choices that the decision maker misses out on when accumulating too much evidence for the current choice. This penalization allows us to treat all choices in the sequence as if they were the same, unique choice. A further consequence of this penalization is that the value function for accumulating more evidence for some duration δt undergoes a more significant change, as accumulating this evidence now comes at a cost −(c+ρ)δt instead of the previous −cδt (see Methods section for the associated Bellman equation). For positive reward rates, ρ>0, this cost augmentation implies more costly evidence accumulation such that it becomes advantageous to accumulate less evidence than for single, isolated choices. This change is implemented by decision boundaries that collapse more rapidly (shown formally in Supplementary Note 1, see also Supplementary Fig. 1). Thus, collapsing decision boundaries implement the optimal policy for both single choices and sequences of choices, with the only difference that these boundaries collapse more rapidly for the latter. The duration of interchoice waiting t_{w} modulates this difference, as with t_{w}→∞, the reward rate described by equation (4) reduces to the expected reward for single, isolated choices, equation (2). Therefore the policy for single trials is a special case of that for maximizing the reward rate in which the waiting time between consecutive choices becomes closetoinfinite.
Dependency of the policy on the prior distribution of reward. As shown above, optimal valuebased decisions are achieved by accumulating only the difference of reward estimates, as implementable by DDMs. However, this does not mean that the absolute reward magnitudes have no effect on the decision strategy; they affect the decision boundary shape. Figure 5a shows how the optimal decision boundaries depend on the mean of the a priori belief about the true rewards across trials. When both options are likely to be highly rewarding on average, the boundaries should collapse more rapidly to perform more choices within the same amount of time. In the light of a guaranteed high reward, this faster collapse promotes saving time and effort of evidence accumulation. The boundary shape does not change for trialbytrial variations in true rewards (which are a priori unknown) for the same prior, but only when the prior itself changes. This sensitivity to the prior and associated average rewards also differentiates reward ratemaximizing valuebased decisionmaking from decisions that aim at maximizing the reward for single, isolated choices (Supplementary Note 1), and from classic paradigms of perceptual decisionmaking (Fig. 5b, see also Discussion section). To summarize, for valuebased decisions that maximize the reward rate, the a priori belief about averagereward magnitudes affect the strategy (and, as a consequence, the average reaction time) by modulating the speed of collapse of the decision boundaries, even if choices within individual decisions are only guided by the relative reward estimates between options.
The limits of diffusion models for valuebased decisions
For all scenarios we have considered so far, diffusion models can implement the optimal decisionmaking policy. Here, we discuss that this is still the case for some, but not all generalizations of the task. For some tasks, the optimal policy won’t even be representable by parallel boundaries in the space of expected reward estimates. This is, for example, the case when the prior/likelihood distributions of reward/evidence are correlated in a particular way (see Methods section and Supplementary Note 1), or when the utility function is nonlinear (see Fig. 6 for an example).
Thus, diffusion models only seem to implement the optimal decision strategy under very constrained circumstances. However, even beyond these circumstances, diffusion models might not be too far off from achieving closetooptimal performance, but their loss of reward remains to be evaluated in general circumstances. Laboratory experiments could satisfy conditions for diffusion models to be closetooptimal even in the presence of a nonlinear utility function. Such experiments often use moderate rewards (for example, moderately valued food items, rather than extreme payoffs) in which case a potentially nonlinear utility would be wellapproximated by a linear function within the tested range of rewards.
Discussion
We have theoretically derived the optimal behaviour for valuebased decisionmaking with noisy evidence about rewards. Our analysis revealed that the optimal strategy in a natural problem setup (where values are linear in rewards) reduces to a DDM with timevarying boundaries. This result provides a theoretical basis for why human decision makers seem to feature behaviour in such tasks that, just as in accuracybased (conventional perceptual) decisions, is well captured by DDMs—despite the profound qualitative difference in task structures (for example, a twodimensional value functions for valuebased tasks, but not for accuracybased ones). Furthermore, we found that the optimal strategy does not always reduce to DDMs if we assume nonlinear relationships between value and reward (Fig. 6), predicting that human behaviour may deviates from DDMs in specific experimental conditions (perceived utility following a nonlinear saturating function of this reward; Fig. 6d); interestingly, such decision boundary structure might be better approximated by ‘correlated RMs’^{11,12}.
Simultaneous to our work, another theoretical study by Fudenberg et al. (unpublished work^{22}) has recently focused on optimal evidence accumulation and decisionmaking for valuebased decisions. This study provides a more indepth mathematical characterization of the optimal policy implemented by diffusion model with collapsing boundaries. Their analysis, however, is restricted to single, isolated choices, and—unlike us—does not consider policy changes for reward rate maximization, nor nonlinear utility functions that invalidate the use of diffusion models.
Whether human and animal use collapsing decision boundaries is a topic of debate in the recent accuracybased^{23} and valuebased^{9} decisionmaking studies. Interestingly, a recent metaanalysis study reports that whether subject uses collapsing boundaries varies strongly across tasks and individuals^{23}. Our theory suggests that the optimal boundary dynamics is sensitive to task demands (for example, rewardrate maximization or correctrate maximization) as well as the absolute mean reward magnitude (in contrast to perceptual decisionmaking; see Supplementary Note 2). Thus, subjects might switch their decision strategies depending on those experimental factors, emphasizing the need to carefully control these factors in further studies.
Still, in both daily lives and laboratory experiments, humans can sometimes take a long time to decide between two valuable options, which might reflect suboptimal behaviour or an insufficiently fast collapse of the bound. For instance, a recent empirical study by Oud et al.^{24} reports slowerthanoptimal valuebased and perceptual choices of human decision makers in a reward rate maximization setting. These slow choices might arise, however, from incompletely or incorrectly learned priors (Supplementary Note 3), and warrant further investigation. Another slowing factor is insufficient time pressure induced by, for example, fixing the number of choices instead of the total duration of the experiment. In this case, the slow reaction times may not reflect a suboptimal strategy. For example, Milosavljevic et al.^{9} have found that subjects can take a surprisingly long time to decide between two highvalued items but, in this experiment, subjects had to perform a fixed number of choices without any time constraint. Their reward at the end of the experiment was determined by drawing one item among all the items selected by the subject^{9}. With such a task design, there is no explicit incentive for making fast choices and, therefore, the optimal strategy does allow for long reaction times. All of the above cases highlight that the seeming irrationality of slow choices between two highvalued options might in fact reflect a completely rational strategy under contrived laboratory settings. Thus, by revealing the optimal policy for valuebased decisions, the present theory provides a critical step in studying the factors that determine our decisions about values.
What do collapsing boundaries in diffusion models tell us about the neural mechanisms involved in such decisions? Previous studies concerning perceptual decisions have linked such boundary collapse to a neural ‘urgency signal’ that collectively drives neural activity towards a constant threshold^{7,25}. However, note that in such a setup even a constant (that is, noncollapsing) diffusion model bound realizes a collapsing bound in the decision maker’s posterior belief^{7}. Analogously, a constant diffusion model bound in our setup realizes a collapsing bound on the value estimate difference. Furthermore, how accumulated evidence is exactly coded in the activity of individual neurons or neural populations remains unclear (for example, compare refs 6, 26), and even less is known about value encoding. For these reasons we promote diffusion models for behavioural predictions, but for now refrain from directly predicting neural activity and associated mechanisms. Nonetheless, our theory postulates what kind of information ought to be encoded in neural populations, and as such can guide further empirical research in neural value coding.
Methods
Structure of evidence and evidence accumulation
Here, we assume a slightly more general version of the task than the one we discuss throughout most of the main text, with a correlated prior and a correlated likelihood. Further below we describe how this version relates to the one in the main text. In particular, we assume the prior over true rewards, given by vector , to be a bivariate Gaussian, , with mean and covariance Σ_{z}. In each small time step i of duration δt, the decision maker observes some momentary evidence that informs her about these true rewards. After accumulating evidence for some time t=nδt, her posterior belief about the true rewards is found by Bayes’ rule, , and results in
where we have defined as the sum of all momentary evidence up to time t, and as the posterior covariance (hereafter, when Σ(t) is a function of time it denote the posterior covariance, rather than the covariance of evidence, Σ). For the case that experienced reward r≡(r_{1}, r_{2})^{T} equals true reward z, that is r=z, the mean estimated option reward is the mean of the above posterior.
Expected future reward estimates
Finding the optimal policy by solving Bellman’s equation requires computing the distribution of expected future rewards given the current expected rewards . Assuming a small δt such that the probability of an eventual boundary crossing becomes negligible, we can find this distribution by the marginalization
As is the mean of the posterior of z after having accumulated evidence up to time t+δt, it is given by
where we have used x(t+δt)=x(t)+δ x(t+δt) and , following from the definition of . Furthermore, by the generative model for the momentary evidence we have , and our current posterior is , which, together, gives . With these components, the marginalization results in
where we have only kept terms of order δt or lower. An extended version of this derivation is given in Supplementary Note 1.
More specific task setups
Here, we consider two more specific task setups. In the first one, the prior covariance is proportional to the likelihood covariance, that is Σ_{z}=α Σ. This causes the posterior z to be given by
In this case, the posterior mean becomes independent of the covariance, and is a weighted mixture of prior and accumulated evidence. The distribution over expected future reward estimates becomes . In terms of choosing among lunch menus, a positively correlated prior could correspond to differently skilled cooks working on different days, such that the true rewards associated with the different options fluctuate jointly. A correlated likelihood might correspond to fresh produce in one menu option predicting the same in the other menu option. If the likelihood covariance is proportional to that of the prior, diffusion models still implement the optimal choice policy.
In the second more specific setup we assume both prior and likelihood to be uncorrelated, with covariance matrices given by and Σ=σ^{2}I. This is the setup discussed throughout most of the work, and results in an equally uncorrelated posterior z, that is for option j given by equation (1). The distribution over expected future reward estimates is also uncorrelated, and for option j is given by .
A more general scenario than the ones we have discussed so far is that both the decisionmaker’s a priori belief about the true rewards, as well as the likelihood of the momentary evidence about these rewards are correlated, but the prior covariance is not proportional to the likelihood covariance. Once prior covariance and likelihood covariance are not proportional to each other anymore, diffusion models fail to implement the optimal policy. Even then, the optimal policy in the space of expected reward estimates is still given by two boundaries parallel to the identity line, such that we can again only bound the difference between these estimates. However, these are bounds on expected reward estimate differences, and not on a diffusing particle. Mapping the estimates into a single diffusing particle requires combining them linearly with combination weights that change over time, which is incompatible with the standard diffusion model architecture (although it can be implemented by an extended diffusion model as shown in (ref. 27). Thus, parallel decision boundaries on expected reward estimates do not automatically imply that diffusion models can implement optimal decisions.
Evidence accumulation and decisions with diffusion models
For the class of tasks in which the decision boundaries in are parallel to the diagonal, the optimal policy can be represented by two boundaries, ξ_{1}(t) and ξ_{2}(t), on the expected reward difference , such that evidence is accumulated as long as , and option 1 (option 2) is chosen as soon as . To implement this policy with diffusion models, we need to find a possibly timedependent function that maps the expected reward difference into a drifting/diffusing particle dx(t)=μdt+σ_{x}dW_{t}, that drifts with drift μ and diffuses with variance , and where dW_{t} is a Wiener process. Such a mapping allows us to find the boundaries , that implement the same policy by bounding particle x(t).
For the general case of a correlated prior and correlated likelihood, as discussed further above, we have , where x(t) drifts and diffuses according to . Using , where , and a Γ that satisfies ΓΓ^{T}=Σ, we find
with a_{j}(t) denoting the elements of vector and b_{ij}(t) being the elements of matrix . The above describes a diffusion process with drift and diffusion that vary over time in different ways. Therefore, we cannot find a function f(·, t) that maps into a diffusing particle with constant drift and diffusion. As a result, we cannot use diffusion models for optimal decisionmaking in this case.
One reason for this incompatibility is that the posterior covariance changes from prior covariance, , to likelihood covariance, , over time and influences the relation between drift and diffusion. If we set the prior covariance proportional to the likelihood covariance, that is , then we can find a mapping to diffusion models. Using the mean of the posterior z from the previous section, we find that , which results in the expected reward difference
where γ_{ij} are the elements of Γ. Now, as long as (a priori, both options have the same true reward), we can use the mapping to map the boundaries in the diffusion model space, which features a particle that drifts with drift μ=z_{1}−z_{2} and diffuses with variance .
The setup that is discussed throughout the main text becomes even simpler, with a diagonal prior covariance, , and a diagonal likelihood covariance . Using the mean of the posterior in equation (1), and again assuming , a similar argument as before shows that the mapping allows us to implement optimal decisionmaking with a diffusion model with drift μ=z_{1}−z_{2} and diffusion variance .
Bellman’s equation for single isolated trials
The rationale behind optimal decisionmaking in single, isolated trials is explained in the main text and is here repeated only briefly. At each point in time t after onset of the choice options, the decision maker performs the action that promises the largest sum of expected rewards from that point onwards (including the cost for accumulating evidence). Given that at this time the decision maker holds a posterior belief over values with sufficient statistics , the sum of expected rewards is denoted by the value function . The available actions are to either choose option one or two, or to accumulate more evidence and decide later. Deciding immediately, the decision maker would choose the option that is expected to yield higher reward, such that the value associated with deciding is . Accumulating evidence for another time period δt comes at cost cδt but is expected to yield reward . Here, the expectation is over how the sufficient statistics are expected to evolve when accumulating more evidence, and is given by the bivariate Gaussian that we have derived further above. Thus, the value for waiting is . At any time t, the decision maker chooses the action associated with the higher value, which leads to Bellman’s equation, as given by equation (3) in the main text. This equation on one hand defines the value function, and on the other hand determines the optimal policy: as long as the value for waiting dominates, the decision maker ought to accumulate more evidence. Once the value for deciding becomes larger, it is best to choose the option that is expected to yield the higher reward.
Bellman’s equation for reward rate maximization
In order to find Bellman’s equation and the associated optimal policy that maximizes the reward rate, we borrow concepts from averagereward DP (refs 7, 28). We do so to avoid that the value function associated with the first trial becomes infinite if this trial is followed by an infinite number of trials that, in total, promise infinite reward. Averagereward DP penalizes the passage of some time δt by cost ρδt, where ρ is the reward rate, equation (4), which equals the average expected reward per unit time. With this additional cost, and the value function turns into the ‘averageadjusted value’ function , which is the same for each trial in the sequence, and is defined as follows. Immediate decisions are expected to be rewarded by , followed by some waiting time t_{w} that comes at cost ρt_{w}. After this waiting time, the decision maker holds belief (recall that denotes the prior mean for option j) at the onset of the next trial, and therefore expects reward in this trial. Thus, the value for decision immediately is given by . The value for accumulating more evidence is the same as for single, isolated trials (see previous section), only that the cost increases from cδt to (c+ρ)δt. Bellman’s equation is again given by taking the maximum over all values. In contrast to single, isolated trials, the policy arising from Bellman’s equation is invariant to global shifts in the value function. That is, we can add some constant C to the averageadjusted value associated with all sufficient statistics, such that , and would recover the same policy^{28}. As a result, we can arbitrarily fix the averageadjusted value for one such statistic, and all other values follow accordingly. For convenience, we choose , which results in Bellman’s Equation
where is given by . This also gives us a recipe to find will only hold for the correct ρ, such that we can compute for some arbitrary ρ, and then adjust ρ until holds. This is guaranteed to provide the desired solution, as is strictly decreasing in ρ as long as t_{w}>0 (rather than t_{w}=0; see Supplementary Note 1).
Bellman’s equation for maximizing the correct rate
We now move to assuming that, all that matters to the decision maker is to identify the higher rewarded option, irrespective of the associated true reward. To do so, we abolish the identity between true and experienced reward, z and r, and instead assume that an experienced reward of r=R_{corr} is associated with choosing option j if z_{j}>z_{i}, i≠j, and a reward of r=r_{incorr} with the alternative choice. This captures the case of maximizing the correct rate for valuebased decisions, and also relates closely to simpler perceptual decisions in which the decision maker only gets rewarded for correct choices (for example, ref. 29), as long as the momentary evidence is wellapproximated by a Gaussian. Evidence accumulation in this setup remains unchanged from before, as the posterior contains all information required to compute the expected experienced reward. This posterior is fully specified by the sufficient statistics , where we have defined .
The value function for single, isolated trials changes in two ways. First, it is now defined over instead of (previously we had , which does not hold anymore). Second, the value for deciding changes as follows. When choosing option one, the decision maker receives reward R_{corr} with probability and reward R_{incorr} with probability . Thus, the expected reward associated with this choice is . The expected reward for option two is found analogously, and results in the value for deciding
As the posterior is Gaussian in all task setups we have considered, the probabilities in the above expression are cumulative Gaussian functions that are functions of the sufficient statistics . Besides these two changes, the value function and associated Bellman Equation remain unchanged (see Supplementary Note 1 for explicit expressions).
Moving from single, isolated trials to maximizing the correct rate over sequences of trials requires the same changes as when moving from single trials to maximizing the reward rate. In particular, the value function turns into the averageadjusted value function that penalizes the passage of some time δt by ρδt, where ρ is now the correct rate rather than the reward rate. The correct rate is still the average experienced reward (minus accumulation cost) per unit time, but—due to the changed definition of experienced reward—does not anymore relate to the true reward, but only if the option associated with the larger associated true reward was correctly identified. This causes the value for deciding to be additionally penalized by ρt_{w}. The value for waiting some more time δt to accumulate more evidence incurs an additional cost ρδt, but remains unchanged otherwise. The averageadjusted value function is again invariant under addition of a constant, such we choose . This fully specifies the value function and associated Bellman equation, which is provided in Supplementary Note 1.
Linearity of value function for waiting
Here, we show that value function for waiting increases linearly in line parallel to the diagonal within the space, which is required to show that the optimal decision boundaries are parallel to the diagonal. We will do so by a backwards induction argument in time. The base case for the induction argument relies on the shape of the value function for large times, t→∞. For such times, the decision maker incurs a large cost for accumulating evidence up until that time, and also expects to gain little further insight into the true rewards when accumulating more evidence. As a consequence, at such times it will always be better to decide immediately rather than to accumulate more evidence. Therefore, the value function will be given by the value for deciding, , which, as discussed in the previous paragraph, is linearly increasing in lines parallel to the diagonal.
The inductive step will show that, if the value function at time t+δt is linearly increasing in lines parallel to the diagonal, then so it the value of waiting at time t, and, as a consequence, also the value function at time t. The value of waiting at time t is given by , where the expectation is over future expected rewards and , and reflects the uncertainty about how the reward estimate evolves over time. For our case, the distribution describing this uncertainty is a bivariate Gaussian (as described in the previous sections), centred on the current expected rewards, , and with a covariance that only depends on t. Its shiftinvariant shape causes the expectation to be a smoothed version of that, as , linearly increase in lines parallel to the diagonal. The value of waiting is this expectation shifted by the constant momentary cost −cδt, and therefore also has this property (Fig. 3b). This establishes that, if the value function at time t+δt is linearly increasing in lines parallel to the diagonal, then so is the value of waiting at time t. The value function at time t is the maximum over the value for deciding and that for waiting. As both increase linearly in lines parallel to the diagonal, so does this value function, (Fig. 3c,e). This completes the inductive step.
The induction argument shows that both value for deciding as well as that for waiting increases linearly with slope one in lines parallel to the diagonal for all t. This immediately means that, if they intersect at some point , then they will intersect at the whole line that is parallel to the diagonal (Fig. 3c). As a consequence, the decision boundaries, which lie on the intersection between these two values, are parallel to this diagonal for all times t. See Supplementary Note 1 for the proof of the same property with an argument that does not rely on induction.
Finding the optimal policy numerically
To find the optimal policy for the above cases numerically, we computed the value function by backward induction^{30}, using Bellman’s equation. Bellman’s equation expresses the value function at time t as a function of the value function at time t+δt. Therefore, if we know the value function at some time T, we can compute it at time T−δt, then T−2δt, and so on, until time t=0. We usually chose some large T, significantly beyond the time horizon of interest, at which we set , independent of the value at any t>T. For any time t≤T, we represented the value function over the remaining two parameters (or in the valuebased task) numerically over an equally space twodimensional grid. This grid allowed us to compute the integral that represents the expectation over the future value numerically by the twodimensional convolution between future value Vt+δt,1t+δt,z2t+δt and transition probability distribution . For any such time, the optimal decision boundaries were found on this grid by the intersection of the value for deciding and that for waiting. We handled boundary effects in space and time by significantly extending the grid beyond the area of interest and cropping the value function after fully computing it over the extended range.
In the reward rate and correct rate case, computing the value function requires knowledge of the corresponding rate ρ. This ρ was unknown, but could be found by the condition . is strictly decreasing in ρ (Supplementary Note 1), such that we could initially assume an arbitrary ρ for which we computed . The correct ρ was then found by iterating the computation of within a root finding procedure until .
The following parameters were used to generate the figures. We set the prior mean as , except for Fig. 5 where we varied while fixing . The prior variance was , and observation noise , for both options. We used a grid spanning and , in steps of 0.4 in both dimensions. The maximum time to consider was set to T=5 s, with timesteps of size δt=0.005 s for backward induction. To focus on the effect of reward rate, we assumed no explicit cost of evidence accumulation, c=0 and a waiting time t_{w} set to 0.5 s.
Data availability
The authors declare that the data supporting the findings of this study are available within the article and its Supplementary Information File.
Additional information
How to cite this article: Tajima, S. et al. Optimal policy for valuebased decisionmaking. Nat. Commun. 7:12400 doi: 10.1038/ncomms12400 (2016).
References
 1
Link, S. W. & Heath, R. A. A sequential theory of psychological discrimination. Psychometrika 40, 77–105 (1975).
 2
Ratcliff, R. A theory of memory retrieval. Psychol. Rev. 85, 59–108 (1978).
 3
Gold, J. I. & Shadlen, M. N. Neural computations that underlie decisions about sensory stimuli. Trends Cogn. Sci. 5, 10–16 (2001).
 4
Wald, A. Sequential tests of statistical hypotheses. Ann. Math. Stat 16, 117–186 (1945).
 5
Wald, A. & Wolfowitz, J. Optimum character of the sequential probability ratio test. Ann. Math. Stat. 19, 326–339 (1948).
 6
Kira, S. et al. A neural implementation of wald’s sequential probability ratio test. Neuron 85, 861–873 (2015).
 7
Drugowitsch, J., MorenoBote, R., Churchland, A. K., Shadlen, M. N. & Pouget, A. The cost of accumulating evidence in perceptual decision making. J. Neurosci. 32, 3612–3628 (2012).
 8
Krajbich, I., Armel, C. & Rangel, A. Visual fixations and the computation and comparison of value in simple choice. Nat. Neurosci. 13, 1292–1298 (2010).
 9
Milosavljevic, M., Malmaud, J., Huth, A., Koch, C. & Rangel, A. The drift diffusion model can account for the accuracy and reaction time of valuebased choices under high and low time pressure. Judgm. Decis. Mak. 5, 437–449 (2010).
 10
Krajbich, I. & Rangel, A. Multialternative driftdiffusion model predicts the relationship between visual fixations and choice in valuebased decisions. Proc. Natl Acad. Sci. USA 108, 13852–13857 (2011).
 11
Vickers, D. Evidence for an accumulator model of psychophysical discrimination. Ergonomics 1, 37–58 (1970).
 12
Teodorescu, A. R. & Usher, M. Disentangling decision models: from independence to competition. Psychol. Rev. 120, 1–38 (2013).
 13
Basten, U., Biele, G., Heekeren, H. R. & Fiebach, C. J. How the brain integrates costs and benefits during decision making. Proc. Natl Acad. Sci. USA 107, 21767–21772 (2010).
 14
Louie, K., Khaw, M. W. & Glimcher, P. W. Normalization is a general neural mechanism for contextdependent decision making. Proc. Natl Acad. Sci. USA 110, 6139–6144 (2013).
 15
Pirrone, A., Stafford, T. & Marshall, J. a. R. When natural selection should optimize speedaccuracy tradeoffs. Front. Neurosci. 08, 1–5 (2014).
 16
Pais, D. et al. A mechanism for valuesensitive decisionmaking. PLoS ONE 8, e73216 (2013).
 17
Gao, J., Tortell, R. & McClelland, J. L. Dynamic integration of reward and stimulus information in perceptual decisionmaking. PLoS ONE 6, 5–7 (2011).
 18
Feng, S., Holmes, P., Rorie, A. & Newsome, W. T. Can monkeys choose optimally when faced with noisy stimuli and unequal rewards? PLoS Comput. Biol. 5, e1000284 (2009).
 19
Wang, X. Probabilistic decision making by slow reverberation in cortical circuits. Neuron 36, 955–968 (2002).
 20
Wang, X. J. Decision making in recurrent neuronal circuits. Neuron 60, 215–234 (2008).
 21
Brunton, B. W., Botvinick, M. M. & Brody, C. D. Rats and humans can optimally accumulate evidence for decisionmaking. Science 340, 95–98 (2013).
 22
Fudenberg, D., Strack, P. & Strzalecki, T. Stochastic choice and optimal sequential sampling (2015) Available at SSRN: http://ssrn.com/abstract=2602927 or http://dx.doi.org/10.2139/ssrn.2602927.
 23
Hawkins, G. E., Forstmann, B. U., Wagenmakers, E.J., Ratcliff, R. & Brown, S. D. Revisiting the evidence for collapsing boundaries and urgency signals in perceptual decisionmaking. J. Neurosci. 35, 2476–2484 (2015).
 24
Oud, B. et al. Irrational time allocation in decisionmaking. Proc. R. Soc. B Biol. Sci 283, 20151439 (2016).
 25
Churchland, A. K., Kiani, R. & Shadlen, M. N. Decisionmaking with multiple alternatives. Nat. Neurosci. 11, 693–702 (2008).
 26
Beck, J. M. et al. Probabilistic population codes for Bayesian decision making. Neuron 60, 1142–1152 (2008).
 27
Drugowitsch, J., Deangelis, G. C., Klier, E. M., Angelaki, D. E. & Pouget, A. Optimal multisensory decisionmaking in a reactiontime task. Elife 2014, 1–19 (2014).
 28
Mahadevan, S. Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach. Learn. 22, 159–195 (1996).
 29
Kim, J. N. & Shadlen, M. N. Neural correlates of a decision in the dorsolateral prefrontal cortex of the macaque. Nat. Neurosci. 2, 176–185 (1999).
 30
Brockwell, A. E. & Kadane, J. B. A gridding method for bayesian sequential decision problems. J. Comput. Graph. Stat. 12, 566–584 (2003).
Acknowledgements
This work was partially supported by Swiss National Foundation #31003A_143707 and a grant from the Simons Foundation (#325057).
Author information
Affiliations
Contributions
S.T., J.D. and A.P. conceived the study. S.T. and J.D. developed the theory and conducted the mathematical analysis. S.T. performed the simulations. S.T., J.D. and A.P. interpreted the results and wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Information
Supplementary Figure 1, Supplementary Notes 13 and Supplementary References (PDF 745 kb)
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
Tajima, S., Drugowitsch, J. & Pouget, A. Optimal policy for valuebased decisionmaking. Nat Commun 7, 12400 (2016). https://doi.org/10.1038/ncomms12400
Received:
Accepted:
Published:
Further reading

Comparison of magnitudesensitive sequential sampling models in a simulationbased study
Journal of Mathematical Psychology (2020)

FrequencySensitivity and MagnitudeSensitivity in DecisionMaking: Predictions of a Theoretical ModelBased Study
Computational Brain & Behavior (2020)

The impact of learning on perceptual decisions and its implication for speedaccuracy tradeoffs
Nature Communications (2020)

The Transition from Evaluation to Selection Involves Neural Subspace Reorganization in Core Reward Regions
Neuron (2020)

Choosing what we like vs liking what we choose: How choiceinduced preference change might actually be instrumental to decisionmaking
PLOS ONE (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.