Optimal policy for value-based decision-making

For decades now, normative theories of perceptual decisions, and their implementation as drift diffusion models, have driven and significantly improved our understanding of human and animal behaviour and the underlying neural processes. While similar processes seem to govern value-based decisions, we still lack the theoretical understanding of why this ought to be the case. Here, we show that, similar to perceptual decisions, drift diffusion models implement the optimal strategy for value-based decisions. Such optimal decisions require the models' decision boundaries to collapse over time, and to depend on the a priori knowledge about reward contingencies. Diffusion models only implement the optimal strategy under specific task assumptions, and cease to be optimal once we start relaxing these assumptions, by, for example, using non-linear utility functions. Our findings thus provide the much-needed theory for value-based decisions, explain the apparent similarity to perceptual decisions, and predict conditions under which this similarity should break down.

I n everyday ambiguous and noisy environments, decisionmaking requires the accumulation of evidence over time. In perceptual decision-making tasks (for example, discriminating a motion direction), choices and reaction times are well-fit by drift diffusion models (DDMs) [1][2][3] . These models represent the accumulated belief informed by sensory evidence as the location of a diffusing particle that triggers a decision once it reaches one of two decision boundaries. DDMs are known to implement theoretically optimal algorithms, such as the sequential likelihood ratio test [4][5][6] and more general algorithms that handle varying task difficulty 7 .
Recently, DDMs have been shown to also describe human behaviour in value-based decisions, where subjects compare the endogenous values of rewarding items (for example, deciding between two lunch options). This suggests that humans perform value-based decisions by computations similar to those used for standard perceptual decision (such as visual discrimination of random dot motion directions). In this case, the DDMs are driven only by the difference in item values, and thus predict the choices to be insensitive to the absolute values of the compared items ( Fig. 1) [8][9][10] . In particular, relying only on the relative value means that it might take on average longer to decide between two equally good options than between items of very different values.
This raises an important question: do DDMs indeed implement the optimal strategy for value-based decisions? Intuitively, absolute values should also influence the decision strategy, such that relying only on relative values appears suboptimal. In particular, it seems unreasonable to wait for a long time to decide between two nearly similar highly rewarding options. Nonetheless, DDMs or related models are generally better at explaining human behaviour than alternative models. For example, race models (RMs) assume independent 'races' to accumulate evidence for individual options. Once one of these races reaches a decision criterion the corresponding choice is triggered 11,12 . Even though RMs are sensitive to absolute choice values and as such predict more rapid choices for higher rewarded options, they neither fit well human behaviour in perceptual decision-making tasks 12 nor in value-based decision tasks in which decisions are usually better described by relying only on relative values 13,14 . Does this mean that humans use DDMs even though these model implement suboptimal strategies, or that DDMs indeed implement the optimal strategy for value-based choices? What is clear is that we need to understand (i) what the optimal strategy for value-based decisions is, (ii) why the value-based and perceptual decision seem to be fitted by the same class of models (DDMs) despite the qualitative difference between these tasks and (iii) to which degree value-based and the perceptual decisions differ in terms of their normative computational strategies.
In this paper, we derive the theoretically optimal strategy for value-based decisions, and show that this strategy is in fact equivalent to a particular class of DDMs that feature 'collapsing boundaries' whose distance shrinks over time. We show that the exact shape of these boundaries and the associated average reaction times depend on average-reward magnitudes even if decisions within individual trials are only guided by the relative reward between choice options. Finally, we highlight the difference between value-based and standard perceptual decisions, reveal specific conditions under which the optimality of DDMs are violated, and show how to reconcile the ongoing debate on whether decision makers are indeed using collapsing decision boundaries. In contrast to previous work that assumed particular a priori mechanisms underlying value-based choices, such as RMs or DDMs [15][16][17][18] , our work instead deduces optimal decision-making mechanisms based solely on a description of the information available to the decision maker. Thus, the use of diffusion models for value-based choices is not an a priori assumption of our work, but rather a result that follows from the normative decision-making strategy.

Results
Problem setup and aim. Consider a decision maker choosing between options that yield potentially different rewards (or 'values'), as, for example, choosing between two lunch menu options in the local restaurant. If the decision maker knew these rewards precisely and immediately then she should instantly choose the more rewarding option. However, in realistic scenarios, the reward associated with either option is uncertain a priori. This uncertainty might, for example, arise if she has a priori limited information about the choice options. Then, it is better to gather more evidence about the reward associated with the compared options before committing to a choice (for example, when choosing among lunch menus, we can reduce uncertainty about the value of either menu by contemplating the composition of each menu course separately and how these separate courses complement each other). However, how much evidence should The purple lines represent the decision boundaries. The red arrows indicate the mean drift rate. The grey fluctuating traces illustrate sample trajectories of a particle that drifts in the space of relative value between two given options. (left) If an option is preferred (that is, yields higher reward) than the other, the mean drift is biased toward the boundary of preferred option, making the particle to hit the decision boundary within a relatively short time. (right) However, if the given options are equally good, DDM assumes a mean drift without any bias, requiring much longer time for the particle to hit either decision boundary-even if both options are highly rewarding.
we accumulate before committing to a choice? Too little evidence might result in the choice of the lower-rewarding option (the less appreciated lunch menu), whereas long evidence accumulation comes at the cost of both time and effort (for example, missing the passing waiter yet another time). In what follows, we formalize how to best tradeoff speed and accuracy of such choices, and then derive how the decision maker ought to behave in such scenarios. We first introduce each component of the decisionmaking task in its most basic form, and discuss generalizations thereof in later sections.
We assume that, at the beginning of each trial, the two options have associated true rewards, z 1 and z 2 , which are each stochastically drawn from separate normal distributions with a fixed mean z j for option jA{1,2} and common variance s 2 z . These true rewards are unknown to the decision maker, as they are never observed directly. Instead, we assume that the decision maker observes some momentary evidence with mean z j dt; dx j;i $ N ðz j dt; s 2 dtÞ for both options jA{1,2} simultaneously in small time-steps i of duration dt. Note that variability (and associated ambiguity) of the momentary evidence can arise through noise sources that are both internal or external to the decision maker-sources that we discuss in more detail further below.
Before observing any momentary evidence, we assume that the decision maker holds a normally distributed belief z j $ N ð z j ; s 2 z Þ with mean z j and variance s 2 z , which are, respectively, the mean and variance of the distribution from which the reward are being drawn from at the beginning of each trial. In other words, this a priori belief corresponds to the actual distribution from which the true rewards are drawn (that is, the decision maker uses the correct generative model), and entails that option j is most likely to yield reward z j , but might also yield other rewards, with the spread of rewards around z j controlled by the level of uncertainty s 2 z about z j . For now, we only consider the case in which the amounts of reward associated with both options are uncorrelated and, on average, the same z 1 ¼ z 2 ð Þ . In terms of choosing between lunch menu options, either menu would a priori yield the same reward, and the true rewards of either menu option are independently of each other drawn from the aforementioned normal distribution (Fig. 2a). Later, we discuss the consequences of a correlation between true option values.
As soon as being presented with sensory evidence dx j,i , the decision maker accumulates further information about the rewards associated with either choice option. This momentary evidence dx j,i reveals noisy information about the true reward z j , such that each additional piece of momentary evidence reduces the uncertainty about this reward. We emphasize that neither of the true rewards is ever observed without noise. As a result, the decision maker needs to accumulate evidence to reduce uncertainty about the underlying true rewards by averaging out the noise. Longer evidence accumulation results in a better average and lower associated uncertainty.
The noise in the momentary evidence itself can have both internal and external sources. External sources constitute the potentially stochastic nature of stimuli, perceptual noise, ambiguity and incomplete knowledge. For example, having not yet read the main course and dessert of a particular menu option causes uncertainty about the option's value due to incomplete knowledge. Internal sources could result from uncertain memory, or value inference that extends over time.
One example for such value inference would be to sequentially contemplate the value of different features of a particular menu course over time.
Formally, after observing the value-related evidence dx j (0 : t) from time 0 (onset of momentary evidence) to some time t, the decision-maker's posterior belief about the true reward, z j , of option j is given by The posterior mean is an evidence-weighted combination of the a priori mean z j and the time-averaged accumulated evidence x j t ¼ ndt ð Þ¼ P n i¼1 dx j;i , and the posterior variance (that is uncertainty) decreases monotonically with time (see Methods section). Due to uncertainty in the momentary evidence, the accumulated evidence x j (t) itself describes a stochastic process. Here, and in contrast to other models of decision-making (both perceptual 19,20 and value-based 15,16 ), all stochasticity in the accumulated evidence results from ambiguity in the momentary evidence itself, rather than from noise in the mechanisms that implement the decision-making process. In other words, the process responsible for the accumulation of the evidence is assumed to be noiseless, an assumption consistent with recent neurophysiological recordings. 21 What are the costs and rewards that the decision maker incurs during the course of her decisions? In terms of costs we assume that the decision maker pays a cost c per second of accumulating evidence, from onset of the choice options until an option is chosen. This cost could, for example, be an explicit cost for delayed choices, or represent the effort induced by evidence accumulation. In the context of choosing between lunch menus, this cost might arise from missing the passing waiter yet again, or from being late for a post-lunch meeting. Choosing option j is associated with experiencing some reward r j that is a function of the true reward z j associated with this option, as, for example, when experiencing reward for consuming the lunch. For now, we assume experienced and true reward to be equivalent, that is r j ¼ z j . For a single choice, the overall aim of the decision maker is to maximize expected reward minus expected cost, where the expectation is across choices j and evidence accumulation times T, given the flow of evidence dx j (0:T) from time 0 to T. We first derive the optimal behaviour, or 'policy', that maximizes this objective function for single, isolated choices and later generalize it to the more realistic scenario in which the total reward in a long consecutive sequence is maximized.
Optimal decisions with DDMs with collapsing boundaries. To find the optimal policy, we borrow tools from dynamic programming (DP). One of these tools is the 'value function', which can be defined recursively through Bellman's equation. In what follows, we show that the optimal policy resulting from this value function is described by two time-dependent parallel bounds in the two-dimensional space of current estimates of the true option rewards. These bounds are parallel with unity slopes, approach each other over time and together form a bound on the difference of reward estimates. This difference is efficiently inferred by diffusion models, such that DDMs can implement the optimal strategy for value-based decision-making. Bellman's equation for optimal value-based decision-making. To define the value function, assume that the decision maker has accumulated some evidence about the option rewards for some time t. Given this accumulated evidence, the value function returns the total reward the decision maker expects to receive when following the optimal policy. This value includes both the cost for evidence accumulation from time t onwards and the reward resulting from the final choice. The expected rewards, r j t ð Þ ¼ r j j dxð0 : tÞ , and elapsed time t are sufficient statistics of the accumulated evidence (see Methods section), such that the value function is defined over these quantities. At each point in time t during evidence accumulation we can either commit to a choice or accumulate more evidence and choose later. When committing to a choice, it is best to choose the option associated with the higher expected reward, such that the total expected reward V d ðr 1 ;r 2 Þ for choosing immediately is given by the value for 'deciding', V dr1 ;r 2 ð Þ¼maxfr 1 ;r 2 g (Fig. 3a). When accumulating more evidence for a small duration dt, in contrast, the decision maker observes additional evidence on which she updates her belief about the true rewards while paying accumulation cost cdt. At this stage, she expects to receive a total reward of Vðt þ dt;r 1 ðt þ dtÞ;r 2 ðt þ dtÞÞ. Therefore, the total expected reward for accumulating more evidence is given by the value for 'waiting', V t þ dt; Fig. 3b), where the expectation is over the distribution of future expected rewards,r 1 ðt þ dtÞ andr 2 ðt þ dtÞ, given that they arer 1 andr 2 at time t (see Methods section for an expression of this distribution). The decision maker ought to only accumulate more evidence if doing so promises more total reward, such that the value function can be written recursively in a form called Bellman's equation (Fig. 3a-c,e; see Supplementary Note 1 for formal derivation), With knowledge of the value function, optimal choices are performed as follows. Before having accumulated any evidence, the subjective expected reward associated with option j equals the mean of the prior belief,r j ¼ z j , such that the total expected reward at this point is given by Vð0; Once evidence is accumulated,r 1 andr 2 evolve over time, reflecting the accumulated evidence and associated updated belief of the true reward of the choice options. It remains advantageous to accumulate evidence as long as the total expected reward for doing so is larger than that for deciding immediately. As soon as deciding and waiting become equally valuable, that is, it is best to choose option j associated with the higher rewarded expected rewardedr j . This optimal policy results in two decision boundaries in ðr 1 ;r 2 Þ-space that might change with time ( Fig. 3f). In-between these boundaries it remains advantageous to accumulate more evidence, but as soon as either boundary is reached, the associated option ought to be chosen.
Parallel optimal decision boundaries. For the task setup considered above, the decision boundaries take a surprisingly simple shape. When plotted in the ðr 1 ;r 2 Þ-space of estimated option rewards for some fixed time t, the two boundaries are always parallel to the diagonalr 1 ¼r 2 (Fig. 3f). Furthermore, they are always above and below this diagonal, reflecting that the diagonal separates the regions in which the choice of either option promises more reward. Here, we provide an informal argument why this is the case.
The argument relies on the fact that, for each time t, the decision boundaries are determined by the intersection between the value for deciding and that for waiting ( Fig. 3c,d). Both of these values share the property that, in lines parallel to the diagonal, they are linearly increasing with slope one. Formally, both functions satisfy f t;r 1 þ C;r 2 þ C ð Þ ¼ f t;r 1 ;r 2 ð ÞþC for any fixed time t, reward estimatesr 1 andr 2 , and arbitrary scalar C. This implies that, if they intersect at some pointr Ã 1 ;r Ã 2 À Á , thus forming part of the decision boundary, they will intersect at the whole liner Ã 1 þ C;r Ã 2 þ C À Á that is parallel to the diagonal (Fig. 3c,e,f). Therefore both decision boundaries are parallel to the diagonal.
How can we guarantee that the values for both deciding and waiting are linearly increasing in lines parallel to the diagonal? For the value for deciding, V dr1 ;r 2 ð Þ¼maxfr 1 ;r 2 g, this is immediately obvious from its definition ( Fig. 3a and caption). Showing the same for the value for waiting requires more work, and is done by a backwards induction argument in time (see Methods section for details). Intuitively, after having accumulated evidence about reward for a long time (t-N), the decision maker expects to gain little further insight by any additional evidence. Therefore, deciding is better than waiting, such that the value function will be that for deciding, Vðt;r 1 ;r 2 Þ ¼ V d ðr 1 ;r 2 Þ, which, as previously mentioned, is linearly increasing in lines parallel to the diagonal, providing the base case. Next, it can be shown that, if the value function at time t þ dt is linearly increasing in lines parallel to the diagonal, then so is the value of waiting at time t, and, as a consequence, also the value function at time t-essentially because the uncertainty about how the reward estimate evolves over time is shift-invariant (does not depend on current expected rewards, ðr 1 ;r 2 Þ; see Methods section). The value function at time t is the maximum over the value for deciding and that for waiting. As both increase linearly in lines parallel to the diagonal, so does this value function, Vðt;r 1 ;r 2 Þ (Fig. 3c,e). This completes the inductive step.
To summarize, an induction argument backward in time shows that both the values for deciding and waiting increase linearly in lines parallel to the diagonal for all t. As a consequence, the decision boundaries, which lie on the intersection between these two values, are parallel to this diagonal for all times t. In Supplementary Methods, we demonstrate the same property with an argument that does not rely on induction. In both cases, the argument requires, for any fixed t, a stochastic temporal evolution of our expected reward estimates that is shift-invariant with respect to our current estimates ðr 1 ;r 2 Þ. In other words, for any estimates ðr 1 ;r 2 Þ, the decision maker expects them to evolve in exactly the same way. This property holds for the task setup described above and some generalizations thereof (Supplementary Note 1), but might be violated under certain, more complex scenarios, as described further below.
Optimal decisions with collapsing boundaries, and by diffusion models. A consequence of parallel decision boundaries is that optimal choices can be performed by tracking only the difference in expected option rewards,r 1 Àr 2 , rather than botĥ r 1 andr 2 independently. To see this, consider rotating these boundaries in ðr 1 ;r 2 Þ-space by À 45°such that they come to be parallel to the horizontal axis in the new ðr 1 þr 2 ;r 2 Àr 1 Þ-space (Fig. 4a,b). After the rotation they boundr 1 Àr 2 and are independent ofr 1 þr 2 . Value rˆ2 Similarly, the value surface for 'waiting' (that is, the expected value after observing new evidence for a short period dt, subtracted cost for waiting cdt) is defined as a function of r 1 ;r 2 ð Þ. Note that, around the diagonal,r 1 ¼r 2 , the value for waiting is smoother than that for choosing due to the uncertainty about future evidence. (c,d) The value surfaces for choosing and waiting superimposed, and their sections atr 1 þr 2 ¼ 0:5. The decision boundaries (dotted lines) are determined by points in the space of reward estimates in which the value for 'deciding' (blue) equals that for waiting (red). In the region where waiting has a higher value than choosing either option (blue below red curve/surface), the decision maker postpones the decision to accumulate more evidence; otherwise, she chooses the option that is expected to give the higher reward. Because the relationship between the two value surfaces is translational symmetric in terms of mean reward^r 1 þr2 2 , their intersections are parallel and do not depend on this mean reward. (e) The expected value V(t) is given by the maximum of the values for choosing and waiting. This surface determines the value for waiting (b) at the next-earlier time step, t À dt. (f) Decision boundaries and associated choices shown in the twodimensionalr 1 ;r 2 ð Þrepresentation. Note that the two boundaries are always parallel to the diagonal,r 1 ¼r 2 . This is because the both value functions (for deciding and for waiting) are linearly increasing with slope one in lines parallel to the diagonal (a,b). For the value for deciding, for example, below the diagonal we haver 1 4r 2 , such that V dr1 ;r 2 ð Þ¼r 1 , and therefore V dr1 þ C; For Gaussian a priori rewards (Fig. 2a), numerical solutions reveal that the distance between the two boundaries decreases over time, resulting in 'collapsing boundaries' (Fig. 4c) that can be explained as follows. In the beginning of the decision, the true option rewards are highly uncertain due to a lack of information. Hence, every small piece of additional evidence will make the running reward estimates substantially more certain. This makes it worth to withhold decisions by far-separated decision boundaries (Fig. 4c for small t). Once a significant amount of evidence is accumulated, further evidence will barely increase certainty about the true rewards. Thus, it becomes more preferable to decide quickly rather than to withhold choice for an insignificant increase in choice accuracy (even for similar reward estimates,r 2 Àr 1 % 0, and residual uncertainty about which option yields the higher reward). The narrowing boundary separation ensures such rapid decisions (Fig. 4c for large t).
We can further simplify the optimal decision procedure by implementing the computation of the expected option reward difference by a diffusion model. As long as z 1 ¼ z 2 , such an implementation remains statistically optimal, as the diffusing particle, x(t)x 1 (t) À x 2 (t), (recall that x j t ¼ ndt ð Þ¼ P n i¼1 dx j;i ) and elapsed time t form a set of sufficient statistics of the posterior r 1 (t) À r 2 (t)|dx(0:t) over this difference (see Methods section). Furthermore, x j (t) can be interpreted as the sample path of a particle that diffuses with variance s 2 and drifts with rate z j . For this reason, x(t) diffuses with variance 2s 2 and drifts with rate z 1 À z 2 , thus forming the particle in a diffusion model that performs statistically optimal inference. The same mapping between expected reward difference and diffusing particle allows us to map the optimal boundary on reward into boundaries on x(t) (Fig. 4c,d). Therefore, models as simple as diffusion models can implement optimal value-based decision-making. rewards. This setup assumes a single choice and, besides the accumulation cost, infinite time to perform it. In realistic scenarios, however, such choices are usually embedded within a sequence of similar choices. Here, we consider how such embedding influences the form of the optimal policy.
Maximizing the reward rate across choices. We assume that each choice within the sequence follow the previous single-choice setup. That is, after onset of the choice options, the decision maker pays a cost c per second for accumulating evidence about the true option rewards. At choice, she receives the true reward associated with the chosen option. The choice is followed by a (possibly stochastic) waiting time of t w seconds on average, after which two new choice options appear and new evidence is accumulated. The true reward associated with either option is before choice option onset drawn according to the previously described Gaussian prior (Fig. 2a), such that these rewards remain constant within individual choices, but vary across consecutive choices. Rather than maximizing the total expected reward for each individual choice, we assume that the aim is to maximize the total expected reward within a fixed time period, independent of how many choices are performed within this period. To avoid boundary effects, we assume the period duration to be close-to-infinite, such that maximizing the total expected reward within this period becomes equivalent to maximizing the reward rate r, given by where the expectation is, as for equation (2), across choices j and evidence accumulation times T, given the flow of evidence. Here, it is critical that we fix the time period while leaving open the number of choices that can be performed. If we instead were to fix the number of choices while leaving open the time to make them, it again becomes optimal to maximize the total expected reward for each of these choices separately, such that the optimal policy for each such choice is the same as that for single, isolated choices. Infinite choice sequences make using the standard value function difficult. This value function returns the total expected reward for all current and future choices when starting from the current state. For an infinite number of such future choices, the value function might thus become infinite. One way to avoid this is to use instead the 'average-adjusted value' function, which-in addition to an accumulation cost-penalizes the passage of some time duration dt by À rdt, where r is the reward rate. This reward rate is by equation (4) the total reward received (including accumulation costs) per second, averaged over the whole choice sequence. Penalizing the value function by this reward rate makes explicit the implicit loss of rewards due to potential future choices that the decision maker misses out on when accumulating too much evidence for the current choice. This penalization allows us to treat all choices in the sequence as if they were the same, unique choice. A further consequence of this penalization is that the value function for accumulating more evidence for some duration dt undergoes a more significant change, as accumulating this evidence now comes at a cost À (c þ r)dt instead of the previous À cdt (see Methods section for the associated Bellman equation). For positive reward rates, r40, this cost augmentation implies more costly evidence accumulation such that it becomes advantageous to accumulate less evidence than for single, isolated choices. This change is implemented by decision boundaries that collapse more rapidly (shown formally in Supplementary Note 1, see also Supplementary  Fig. 1). Thus, collapsing decision boundaries implement the optimal policy for both single choices and sequences of choices, with the only difference that these boundaries collapse more rapidly for the latter. The duration of inter-choice waiting t w modulates this difference, as with t w -N, the reward rate described by equation (4) reduces to the expected reward for single, isolated choices, equation (2). Therefore the policy for single trials is a special case of that for maximizing the reward rate in which the waiting time between consecutive choices becomes close-to-infinite. Here, the decision boundaries were derived with the same dynamic-programming procedure as for the valuebased case, except for that the rewards were assumed to be binary, and only one if the decision maker correctly identified the option with the larger 'reward' cue z j (see Methods section). In contrast to the reward-rate maximization strategy for value-based decisions (a), the decision strategy maximizing the correct rate is invariant to the absolute values of mean reward/evidence strength, thus demonstrating a qualitative difference between value-based and perceptual decision-making in terms of the optimal strategy. In addition, the optimal boundaries in the value-based case approach each other more rapidly over time than for perceptual decisions. The faster boundary collapse for value-based decisions is consistent across a broad range of mean absolute rewards, showing that the distinction in boundary dynamics is not just due to the difference in expected reward rates, but reflecting a qualitative difference between the geometries of value functions in these two tasks. Dependency of the policy on the prior distribution of reward. As shown above, optimal value-based decisions are achieved by accumulating only the difference of reward estimates, as implementable by DDMs. However, this does not mean that the absolute reward magnitudes have no effect on the decision strategy; they affect the decision boundary shape. Figure 5a shows how the optimal decision boundaries depend on the mean of the a priori belief about the true rewards across trials. When both options are likely to be highly rewarding on average, the boundaries should collapse more rapidly to perform more choices within the same amount of time. In the light of a guaranteed high reward, this faster collapse promotes saving time and effort of evidence accumulation. The boundary shape does not change for trial-by-trial variations in true rewards (which are a priori unknown) for the same prior, but only when the prior itself changes. This sensitivity to the prior and associated average rewards also differentiates reward rate-maximizing value-based decision-making from decisions that aim at maximizing the reward for single, isolated choices (Supplementary Note 1), and from classic paradigms of perceptual decision-making (Fig. 5b, see also Discussion section). To summarize, for value-based decisions that maximize the reward rate, the a priori belief about average-reward magnitudes affect the strategy (and, as a consequence, the average reaction time) by modulating the speed of collapse of the decision boundaries, even if choices within individual decisions are only guided by the relative reward estimates between options.
The limits of diffusion models for value-based decisions. For all scenarios we have considered so far, diffusion models can implement the optimal decision-making policy. Here, we discuss that this is still the case for some, but not all generalizations of the task. For some tasks, the optimal policy won't even be representable by parallel boundaries in the ðr 1 ;r 2 Þ-space of expected reward estimates. This is, for example, the case when the prior/likelihood distributions of reward/evidence are correlated in a particular way (see Methods section and Supplementary Note 1), or when the utility function is non-linear (see Fig. 6 for an example).
Thus, diffusion models only seem to implement the optimal decision strategy under very constrained circumstances. However, even beyond these circumstances, diffusion models might not be too far off from achieving close-to-optimal performance, but their loss of reward remains to be evaluated in general circumstances. Laboratory experiments could satisfy conditions for diffusion models to be close-to-optimal even in the presence of a nonlinear utility function. Such experiments often use moderate rewards (for example, moderately valued food items, rather than extreme payoffs) in which case a potentially non-linear utility would be well-approximated by a linear function within the tested range of rewards.

Discussion
We have theoretically derived the optimal behaviour for valuebased decision-making with noisy evidence about rewards. Our  Figure 6 | In some scenarios the optimal policy becomes even more complex than two parallel boundaries in the space of expected reward estimates. This property might, for example, break down if the utility that the decision maker receives from her choices is not the reward itself but instead a non-linear function of this reward. If this utility grows sub-linearly in the reward, as is frequently assumed, the decision boundaries approach each other with increasing expected reward, as higher rewards yield comparably less utility. In such circumstances, optimal choices require tracking of both expected reward estimates,r 1 andr 2 , independently rather than only their difference. To demonstrate this, here we assumed a saturating utility function, Utility ¼ u(r), which saturates at r-N and r-À N. This could be the case, for example, if rewards vary over a large range over which the subjectively perceived utility follows a non-linear saturating function of this reward. (In this figure, u is modelled with a tangent hyperbolic function, but the exact details of the functional form do not qualitatively change the results). The logic of the different panels follows that of Fig. 2. (a) The value function surface for choosing either of two options. (b) The value surfaces for postponing decision to accumulate more evidence for a period of dt. (c) The two value surfaces superimposed. (d) The decision boundary and choice represented in the two-dimensional space of ðr 1 ;r 2 Þ. Note that the distance between decision boundaries is narrower in the regime where estimated rewards are high on average, resembling 'RMs' 11,12 , which are more sensitive to absolute reward magnitudes than DDMs.
analysis revealed that the optimal strategy in a natural problem setup (where values are linear in rewards) reduces to a DDM with time-varying boundaries. This result provides a theoretical basis for why human decision makers seem to feature behaviour in such tasks that, just as in accuracy-based (conventional perceptual) decisions, is well captured by DDMs-despite the profound qualitative difference in task structures (for example, a two-dimensional value functions for value-based tasks, but not for accuracy-based ones). Furthermore, we found that the optimal strategy does not always reduce to DDMs if we assume non-linear relationships between value and reward (Fig. 6), predicting that human behaviour may deviates from DDMs in specific experimental conditions (perceived utility following a non-linear saturating function of this reward; Fig. 6d); interestingly, such decision boundary structure might be better approximated by 'correlated RMs' 11,12 . Simultaneous to our work, another theoretical study by Fudenberg et al. (unpublished work 22 ) has recently focused on optimal evidence accumulation and decision-making for valuebased decisions. This study provides a more in-depth mathematical characterization of the optimal policy implemented by diffusion model with collapsing boundaries. Their analysis, however, is restricted to single, isolated choices, and-unlike us-does not consider policy changes for reward rate maximization, nor non-linear utility functions that invalidate the use of diffusion models.
Whether human and animal use collapsing decision boundaries is a topic of debate in the recent accuracy-based 23 and valuebased 9 decision-making studies. Interestingly, a recent metaanalysis study reports that whether subject uses collapsing boundaries varies strongly across tasks and individuals 23 . Our theory suggests that the optimal boundary dynamics is sensitive to task demands (for example, reward-rate maximization or correct-rate maximization) as well as the absolute mean reward magnitude (in contrast to perceptual decision-making; see Supplementary Note 2). Thus, subjects might switch their decision strategies depending on those experimental factors, emphasizing the need to carefully control these factors in further studies.
Still, in both daily lives and laboratory experiments, humans can sometimes take a long time to decide between two valuable options, which might reflect suboptimal behaviour or an insufficiently fast collapse of the bound. For instance, a recent empirical study by Oud et al. 24 reports slower-than-optimal value-based and perceptual choices of human decision makers in a reward rate maximization setting. These slow choices might arise, however, from incompletely or incorrectly learned priors (Supplementary Note 3), and warrant further investigation. Another slowing factor is insufficient time pressure induced by, for example, fixing the number of choices instead of the total duration of the experiment. In this case, the slow reaction times may not reflect a suboptimal strategy. For example, Milosavljevic et al. 9 have found that subjects can take a surprisingly long time to decide between two high-valued items but, in this experiment, subjects had to perform a fixed number of choices without any time constraint. Their reward at the end of the experiment was determined by drawing one item among all the items selected by the subject 9 . With such a task design, there is no explicit incentive for making fast choices and, therefore, the optimal strategy does allow for long reaction times. All of the above cases highlight that the seeming irrationality of slow choices between two high-valued options might in fact reflect a completely rational strategy under contrived laboratory settings. Thus, by revealing the optimal policy for value-based decisions, the present theory provides a critical step in studying the factors that determine our decisions about values.
What do collapsing boundaries in diffusion models tell us about the neural mechanisms involved in such decisions? Previous studies concerning perceptual decisions have linked such boundary collapse to a neural 'urgency signal' that collectively drives neural activity towards a constant threshold 7,25 . However, note that in such a setup even a constant (that is, noncollapsing) diffusion model bound realizes a collapsing bound in the decision maker's posterior belief 7 . Analogously, a constant diffusion model bound in our setup realizes a collapsing bound on the value estimate difference. Furthermore, how accumulated evidence is exactly coded in the activity of individual neurons or neural populations remains unclear (for example, compare refs 6,26), and even less is known about value encoding. For these reasons we promote diffusion models for behavioural predictions, but for now refrain from directly predicting neural activity and associated mechanisms. Nonetheless, our theory postulates what kind of information ought to be encoded in neural populations, and as such can guide further empirical research in neural value coding.

Methods
Structure of evidence and evidence accumulation. Here, we assume a slightly more general version of the task than the one we discuss throughout most of the main text, with a correlated prior and a correlated likelihood. Further below we describe how this version relates to the one in the main text. In particular, we assume the prior over true rewards, given by vector z z 1 ; z 2 ð Þ T , to be a bivariate Gaussian, z $ N z; R z ð Þ, with mean z and covariance R z . In each small time step i of duration dt, the decision maker observes some momentary evidence dx i dx 1;i ; dx 2;i À Á T $ N zdt; Rdt ð Þthat informs her about these true rewards. After accumulating evidence for some time t ¼ ndt, her posterior belief about the true rewards is found by Bayes' rule, p z j dxð0 : where we have defined x t ð Þ ¼ P n i¼1 dx i as the sum of all momentary evidence up to time t, and R t ð Þ ¼ R À 1 z þ tR À 1 À Á À 1 as the posterior covariance (hereafter, when R(t) is a function of time it denote the posterior covariance, rather than the covariance of evidence, R). For the case that experienced reward r(r 1 , r 2 ) T equals true reward z, that is r ¼ z, the mean estimated option rewardr t ð Þ ¼ z j dxð0 : tÞ h i is the mean of the above posterior.
Expected future reward estimates. Finding the optimal policy by solving Bellman's equation requires computing the distribution of expected future rewardŝ rðt þ dtÞ given the current expected rewardsrðtÞ. Assuming a small dt such that the probability of an eventual boundary crossing becomes negligible, we can find this distribution by the marginalization Asrðt þ dtÞ is the mean of the posterior of z after having accumulated evidence up to time t þ dt, it is given bŷ where we have used x(tþdt)¼x(t)þdx(tþdt) and x t ð Þ ¼ R R t ð Þ À 1r t ð Þ À R À 1 z j z À Á , following from the definition ofrðtÞ. Furthermore, by the generative model for the momentary evidence we have dx t þ dt ð Þ j z $ N zdt; Rdt ð Þ , and our current posterior is z jr t ð Þ $ Nr t ð Þ; RðtÞ ð Þ , which, together, gives Þ . With these components, the marginalization results in where we have only kept terms of order dt or lower. An extended version of this derivation is given in Supplementary Note 1.
More specific task setups. Here, we consider two more specific task setups. In the first one, the prior covariance is proportional to the likelihood covariance, that is R z ¼ aR. This causes the posterior z to be given by In this case, the posterior mean becomes independent of the covariance, and is a weighted mixture of prior and accumulated evidence. The distribution over expected future reward estimates becomesr t þ dt ð Þ jrðtÞ $ N ðr t ð Þ; a À 1 þ t ð Þ À 2 RÞ. In terms of choosing among lunch menus, a positively correlated prior could correspond to differently skilled cooks working on different days, such NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12400 ARTICLE that the true rewards associated with the different options fluctuate jointly. A correlated likelihood might correspond to fresh produce in one menu option predicting the same in the other menu option. If the likelihood covariance is proportional to that of the prior, diffusion models still implement the optimal choice policy.
In the second more specific setup we assume both prior and likelihood to be uncorrelated, with covariance matrices given by R z ¼ s 2 z I and R ¼ s 2 I. This is the setup discussed throughout most of the work, and results in an equally uncorrelated posterior z, that is for option j given by equation (1). The distribution over expected future reward estimates is also uncorrelated, and for option j is given byr j t þ dt ð Þ jr j ðtÞ $ N ðr j t ð Þ; s À 2 s À 2 z þ ts À 2 À Á À 2 Þ.
A more general scenario than the ones we have discussed so far is that both the decision-maker's a priori belief about the true rewards, as well as the likelihood of the momentary evidence about these rewards are correlated, but the prior covariance is not proportional to the likelihood covariance. Once prior covariance and likelihood covariance are not proportional to each other anymore, diffusion models fail to implement the optimal policy. Even then, the optimal policy in the ðr 1 ;r 2 Þ-space of expected reward estimates is still given by two boundaries parallel to the identity line, such that we can again only bound the difference between these estimates. However, these are bounds on expected reward estimate differences, and not on a diffusing particle. Mapping the estimates into a single diffusing particle requires combining them linearly with combination weights that change over time, which is incompatible with the standard diffusion model architecture (although it can be implemented by an extended diffusion model as shown in (ref. 27). Thus, parallel decision boundaries on expected reward estimates do not automatically imply that diffusion models can implement optimal decisions. maximizing the reward rate. In particular, the value function turns into the average-adjusted value function that penalizes the passage of some time dt by rdt, where r is now the correct rate rather than the reward rate. The correct rate is still the average experienced reward (minus accumulation cost) per unit time, but-due to the changed definition of experienced reward-does not anymore relate to the true reward, but only if the option associated with the larger associated true reward was correctly identified. This causes the value for deciding to be additionally penalized by rt w . The value for waiting some more time dt to accumulate more evidence incurs an additional cost rdt, but remains unchanged otherwise. The average-adjusted value function is again invariant under addition of a constant, such we chooseṼ 0; z 1 ; z 2 ; r ð Þ¼0. This fully specifies the value function and associated Bellman equation, which is provided in Supplementary Note 1.
Linearity of value function for waiting. Here, we show that value function for waiting increases linearly in line parallel to the diagonal within ther 1 ;r 2 ð Þ-space, which is required to show that the optimal decision boundaries are parallel to the diagonal. We will do so by a backwards induction argument in time. The base case for the induction argument relies on the shape of the value function for large times, t-N. For such times, the decision maker incurs a large cost for accumulating evidence up until that time, and also expects to gain little further insight into the true rewards when accumulating more evidence. As a consequence, at such times it will always be better to decide immediately rather than to accumulate more evidence. Therefore, the value function will be given by the value for deciding, Vðt;r 1 ;r 2 Þ ¼ V d ðr 1 ;r 2 Þ, which, as discussed in the previous paragraph, is linearly increasing in lines parallel to the diagonal.
The inductive step will show that, if the value function at time t þ dt is linearly increasing in lines parallel to the diagonal, then so it the value of waiting at time t, and, as a consequence, also the value function at time t. The value of waiting at time t is given by V t þ dt;r 1 t þ dt ð Þ ;r 2 ðt þ dtÞ ð Þ À cdt, where the expectation is over future expected rewardsr 1 ðt þ dtÞ andr 2 ðt þ dtÞ, and reflects the uncertainty about how the reward estimate evolves over time. For our case, the distribution describing this uncertainty is a bivariate Gaussian (as described in the previous sections), centred on the current expected rewards, ðr 1 ;r 2 Þ, and with a covariance that only depends on t. Its shift-invariant shape causes the expectation V t þ dt;r 1 t þ dt ð Þ ;r 2 ðt þ dtÞ ð Þ to be a smoothed version of Vðt þ dt;r 1 ;r 2 Þ that, as Vðt þ dt;r 1 ;r 2 Þ, linearly increase in lines parallel to the diagonal. The value of waiting is this expectation shifted by the constant momentary cost À cdt, and therefore also has this property (Fig. 3b). This establishes that, if the value function at time t þ dt is linearly increasing in lines parallel to the diagonal, then so is the value of waiting at time t. The value function at time t is the maximum over the value for deciding and that for waiting. As both increase linearly in lines parallel to the diagonal, so does this value function, Vðt;r 1 ;r 2 Þ (Fig. 3c,e). This completes the inductive step.
The induction argument shows that both value for deciding as well as that for waiting increases linearly with slope one in lines parallel to the diagonal for all t. This immediately means that, if they intersect at some pointr Ã 1 ;r Ã 2 À Á , then they will intersect at the whole liner Ã 1 þ C;r Ã 2 þ C À Á that is parallel to the diagonal (Fig. 3c). As a consequence, the decision boundaries, which lie on the intersection between these two values, are parallel to this diagonal for all times t. See Supplementary Note 1 for the proof of the same property with an argument that does not rely on induction.
Finding the optimal policy numerically. To find the optimal policy for the above cases numerically, we computed the value function by backward induction 30 , using Bellman's equation. Bellman's equation expresses the value function at time t as a function of the value function at time t þ dt. Therefore, if we know the value function at some time T, we can compute it at time T À dt, then T À 2dt, and so on, until time t ¼ 0. We usually chose some large T, significantly beyond the time horizon of interest, at which we set V T;ẑ 1 ;ẑ 2 ð Þ¼V d ðẑ 1 ;ẑ 2 Þ, independent of the value at any t4T. For any time trT, we represented the value function over the remaining two parametersẑ 1 ;ẑ 2 ð Þ(orr 1 ;r 2 ð Þin the value-based task) numerically over an equally space two-dimensional grid. This grid allowed us to compute the integral that represents the expectation over the future value numerically by the two-dimensional convolution between future value Vt þ dt,1t þ dt,z2t þ dt and transition probability distribution Pẑ t þ dt ð Þjẑ t ð Þ ð Þ . For any such time, the optimal decision boundaries were found on this grid by the intersection of the value for deciding and that for waiting. We handled boundary effects in space and time by significantly extending the grid beyond the area of interest and cropping the value function after fully computing it over the extended range.
In the reward rate and correct rate case, computing the value function requires knowledge of the corresponding rate r. This r was unknown, but could be found by the conditionV 0; z 1 ; z 2 ; r ð Þ¼0.Ṽ 0; z 1 ; z 2 ; r ð Þis strictly decreasing in r (Supplementary Note 1), such that we could initially assume an arbitrary r for which we computedṼ 0; z 1 ; z 2 ; r ð Þ . The correct r was then found by iterating the computation ofṼ 0; z 1 ; z 2 ; r ð Þwithin a root finding procedure until V 0; z 1 ; z 2 ; r ð Þ¼ 0. The following parameters were used to generate the figures. We set the prior mean as z 1 ¼ z 2 ¼ 0, except for Fig. 5 where we varied z 1 þ z 2 while fixing z 1 ¼ z 2 . The prior variance was s 2 z ¼ 16, and observation noise s 2 x ¼ 4, for both options. We used a grid spanning À 10 ẑ 1 10 and À 10 ẑ 2 10, in steps of 0.4 in both dimensions. The maximum time to consider was set to T ¼ 5 s, with timesteps of size dt ¼ 0.005 s for backward induction. To focus on the effect of reward rate, we assumed no explicit cost of evidence accumulation, c ¼ 0 and a waiting time t w set to 0.5 s.
Data availability. The authors declare that the data supporting the findings of this study are available within the article and its Supplementary Information File.