# Flexible combination of reward information across primates

## Abstract

A fundamental but rarely contested assumption in economics and neuroeconomics is that decision-makers compute subjective values of risky options by multiplying functions of reward probability and magnitude. By contrast, an additive strategy for valuation allows flexible combination of reward information required in uncertain or changing environments. We hypothesized that the level of uncertainty in the reward environment should determine the strategy used for valuation and choice. To test this hypothesis, we examined choice between risky options in humans and rhesus macaques across three tasks with different levels of uncertainty. We found that whereas humans and monkeys adopted a multiplicative strategy under risk when probabilities are known, both species spontaneously adopted an additive strategy under uncertainty when probabilities must be learned. Additionally, the level of volatility influenced relative weighting of certain and uncertain reward information, and this was reflected in the encoding of reward magnitude by neurons in the dorsolateral prefrontal cortex.

## Main

Most models of decision-making assume that while evaluating risky options, we combine information about reward probability and stakes multiplicatively1,2,3. This approach, however, suffers from a major limitation: it cannot readily accommodate flexible weighting of reward information unless utility and probability weighting functions change dynamically, both of which are assumed to be fixed. An additive strategy for value construction (for example, a linear combination of reward probability and magnitude) can produce a behaviour similar to that of a multiplicative strategy when proper weighting is adopted4, but more importantly, allows greater flexibility in differential weighting of reward information. Such flexible weighting is necessary when reward outcomes are probabilistic and corresponding probabilities must be estimated, resulting in different levels of uncertainty that needs to be considered for optimal combination of information5,6. Additionally, statistics of reward outcomes (for example, mean reward probabilities) can change over time, giving rise to unexpected uncertainty (volatility) that requires further adjustments in learning and decision-making7,8.

One could argue, however, that if representations of reward attributes follow an exponential form, a multiplicative model becomes an additive one (Supplementary Note 1). Although this scenario seems plausible, the combination of reward information for valuation cannot be considered separately from the subsequent decision-making processes4. This is because an additive but not a multiplicative strategy implies decision-making based on direct comparisons of reward attributes. Therefore, the fundamental difference between additive and multiplicative strategies is whether or not different reward attributes of each option are fused before the onset of decision-making processes. This distinction is important because fusion of reward attributes hinders further adjustments of the weight of each attribute on valuation and/or choice.

Considering the importance of flexibility in value-based choice, we hypothesized that the level of uncertainty in the reward environment should strongly influence the strategy used for valuation and choice. More specifically, we assumed that progressing from choice under risk when reward probabilities are known, to choice under uncertainty when probabilities must be learned, decision-makers should shift from a multiplicative to an additive strategy. In addition, we hypothesized that the levels of uncertainty associated with different pieces of reward information should further affect how they are weighted and combined for computing subjective value or making decisions.

## Results

### Identifying adopted strategies under risk and uncertainty

To determine valuation strategies adopted by individual human participants or by each monkey during individual sessions of the experiment, we fitted choice behaviour using various models. These models assume different functions for how reward probability and magnitude are assessed subjectively and rely on either an additive or multiplicative strategy for combination of reward information. More specifically, we compared four variations of additive and multiplicative models in which linear and nonlinear transformations of reward probability (via a probability weighting function), reward magnitude (through a utility function), or their combinations were included. Additionally, we also considered hybrid models that include both additive and multiplicative components. We used the Bayesian model selection (BMS)15 method and Akaike information criterion (AIC) to identify the model that best captures choice data (see Methods).

We first ensured that our fitting procedure can correctly identify the specific strategy adopted by individual participants and can accurately estimate relevant parameters. To that end, we generated choice data using a hybrid model over a wide range of model parameters in the gambling task and the three environments of the PRL tasks and fitted the simulated data using the same hybrid model (see Methods for details). We found that the hybrid model can successfully retrieve the two main parameters used for generating data (Supplementary Fig. 1): relative weight of the multiplicative component ($$\beta _{\mathrm{mult}}$$); and the ratio of relative weight of reward magnitude to that of reward probability ($$\beta _m/\beta _p$$), which we refer to as magnitude-to-probability weighting. We also fitted the same data using the linear additive and multiplicative models and used BMS across all three models to examine model identification as a function of the parameters used to generate the data.

Overall, our fitting method was able to identify the hybrid model as the most likely model followed by the additive and multiplicative models when $$\beta _{\mathrm{mult}}$$ is close to 0 and 1, respectively (Supplementary Fig. 2). We found small errors in identification of the more dominant model for data generated for the ML task. In the PRL task, however, this error was larger and depended on $$\beta _m/\beta _p$$ but mostly occurred when $$\beta _{\mathrm{mult}}$$ was around 0.5. For very small $$\beta _m/\beta _p$$ values, the model identification was biased towards a multiplicative one but identification bias shifted towards an additive model as $$\beta _m/\beta _p$$ became closer to 1 ($$\log ({\beta _m/\beta _p}) = 0$$). Considering that $$\beta _m/\beta _p$$ values estimated from our participants were very small ($$\log({\beta _m/\beta _p}) < - 1$$) in the PRL task (see below), these results suggest that error in model identification is small. Overall, fitting simulated data illustrates that the correct model, parameters of the hybrid model and the dominant component of a hybrid model can be accurately retrieved using our fitting procedure.

### Flexible adoption of valuation strategies in humans and monkeys

We next used all four variations of additive, multiplicative and hybrid models to fit individual participants’ choice data in each experiment. We found that during the gambling task (choice under risk), multiplicative models and hybrid models with a dominant multiplicative component were the most likely models adopted by both monkeys and humans (Fig. 2a,d, Supplementary Fig. 3a,f and Table 1). By contrast, during the ML task (choice under uncertainty), additive models and hybrid models with a dominant additive component were the most likely models adopted by both monkeys and humans (Fig. 2b,c,e,f, Supplementary Fig. 3b,c,g,h and Table 1). Interestingly, this was true for both the stable and volatile environments indicating that an additive strategy was adopted when reward probabilities must be learned.

We found consistent results for the PRL task that also required learning of reward probabilities. Specifically, fitting choice behaviour in the PRL task revealed that additive models and hybrid models with a dominant additive component were the most likely models adopted by both humans and monkeys (Fig. 3, Supplementary Fig. 3d,e,i,j and Table 1). Together, these results across three tasks illustrate that both monkeys and humans adopt a predominately multiplicative strategy under risk, whereas both switch to a predominately additive strategy under uncertainty.

To ensure that all human participants included in our data analyses actually learned reward probabilities associated with the two options during the ML and PRL tasks, we removed participants who overall did not choose the option with the higher probability of reward more often than chance level (0.5; see Methods). This resulted in exclusion of 4 and 12 participants in the ML and PRL tasks, respectively. To confirm that our exclusion criteria did not bias our results in terms of the strategy adopted, we also fitted the choice data from the excluded participants in the PRL task (the ML task had too few excluded participants). We did not find any credible evidence that the excluded participants adopted a strategy qualitatively different from that used by the remaining participants (Supplementary Fig. 4). The most parsimonious explanation for our data was that the excluded participants simply failed to learn reward probabilities.

### Flexible combination of reward information under uncertainty

Together, the results based on the fit of choice behaviour suggest that there is a major and previously undetected effect of expected uncertainty on the strategies that participants use to combine information across reward dimensions. Why might an additive approach be favoured under uncertainty? We hypothesized that the additive strategy may allow more flexibility and may therefore be favoured in uncertain environments in which decision-makers must learn reward attributes and associated uncertainty to adjust the weight of each attribute with a different level of uncertainty on the overall value or choice. If true, this also predicts that weighting of a given piece of information should depend on its level of uncertainty and thus the relative weighting of reward information should change according to volatility of the environment.

To test this prediction, we compared the effect of volatility on how different pieces of reward information were combined using the estimated model parameters based on the simplest hybrid model in the two environments of the ML as well as PRL tasks. In this model, the additive component was a linear function of reward magnitude and probability and the multiplicative component was equal to the expected value (EV). We focused on the simplest hybrid model to allow us directly to compare behavioural and neural adjustments in monkeys (see below) and interpret our results more clearly. We found that the relative weighting of reward information differed in the two environments of the ML task in both species (Fig. 4a,c). More specifically, both monkeys and humans exhibited significantly larger magnitude-to-probability weighting in the volatile compared with the stable environment (one-sided Wilcoxon signed-rank test; monkeys: median ± interquartile range (IQR) = 0.53 ± 1.34, P < 0.001, d = 0.54, N = 316, 95% confidence interval (CI) = [0.49, 0.76]; humans: median ± IQR = 0.47 ± 1.54, P < 0.001, d = 0.45, N = 46, 95% CI = [0.14, 0.89]). Similarly, magnitude-to-probability weighting was larger in the more volatile compared with the less volatile environment of the PRL task (Fig. 4b,d; one-sided Wilcoxon signed-rank test; monkeys: median ± IQR = 0.16 ± 0.57, P < 0.001, d = 0.22, N = 118, 95% CI = [0.08, 0.26]; humans: median ± IQR = 0.20 ± 1.81, P = 0.04, d = 0.29, N = 38, 95% CI = [−0.01, 0.27]). These results confirm our prediction and illustrate that the relative weighting of the certain (reward magnitude) to that of the uncertain (reward probability) information increased as increasing volatility in the reward environment increased uncertainty in reward probability.

If changes in volatility cause adjustments in the relative weighting of reward information, then one would predict that there would be larger changes between the stable and volatile environments of the ML task compared with the less and more volatile environments of the PRL task. This is because the range of uncertainty in reward probabilities was larger in the ML task than in the PRL task. Consistent with this prediction, we found that the differences between magnitude-to-probability weighting in the volatile and stable environments of the ML task were larger than the differences between magnitude-to-probability weighting in the more and less volatile environments of the PRL task; this effect, however, was only significant in monkeys (one-sided Wilcoxon rank-sum test; monkeys: P < 0.001, d = 0.33, N = 432, 95% CI = [0.07, 0.55]; humans: P = 0.1, d = 0.06, N = 82, 95% CI = [−0.2, 0.68]). Interestingly, we did not find any consistent evidence for effects of volatility on the extent to which an additive strategy was adopted using the likelihood of sessions in monkeys (or fraction of subjects in humans) with additive and hybrid strategies, or using estimated $$\beta _{\mathrm{mult}}$$ values (Figs. 2,3; see Table 2 for detailed statistics). Together, these results suggest that volatility associated with reward probability can strongly influence how this information is weighted relative to reward magnitude.

### Adjustments of learning to uncertainty

It has been shown previously that volatility of the environment influences learning rates16. Therefore, we also compared the estimated learning rates between the two environments of the ML and PRL tasks. We found that in the ML task, learning rates were greater in the volatile compared with the stable environment (Fig. 5; compare with Fig. 2 of ref. 14). By contrast, we did not find any credible evidence for an increase in learning rates between the less and more volatile environments of the PRL task for monkeys or humans (Supplementary Fig. 5). Less-consistent effects of volatility on learning rates compared with effects of volatility on the relative weighting of reward information suggests that changes in weighting of reward information might be a more fundamental adjustment to volatility in the reward environment.

The absence of volatility effects on learning rates should not be considered as a lack of evidence for adjustments in learning processes in the PRL task. As shown previously8, learning of better and worse options (in terms of reward probability) follows different dynamics in the less and more volatile environments of the PRL task (Supplementary Fig. 6a). Nevertheless, we performed additional analyses of choice behaviour in the PRL task to show directly (without using model fitting) that volatility influences both the speed of learning and the relative weighting of reward information (Supplementary Fig. 6c; Supplementary Note 2).

### Flexible representations of reward information in the prefrontal cortex

We looked for neural correlates of the observed behavioural adjustments in the activity of the dorsolateral prefrontal cortex (dlPFC) neurons recorded during the PRL task consisting of two comparable environments with different levels of volatility13. We first used two separate multiple linear regression models for the two environments to characterize neural responses to various events/signals in the current and previous trials and any changes in these responses due to volatility (see Methods). We found a significant difference in a few regression coefficients between the two environments across all neurons: the relative positions of target colours (POSRG), the previous chosen colour (CRG(t 1)), the interaction of the positions of target colours and previous chosen colour (POSRG × CRG(t 1)) and the difference in and sum of reward magnitudes of the two options presented on each trial (mr – ml and mr + ml) (Fig. 6a–d, Supplementary Fig. 7a–f; see equation (11) in Methods).

The regression coefficient for the difference in reward magnitude quantifies how strongly this variable is encoded. Considering the relevance of this encoding for an additive integration or direct comparison of reward attributes, we next examined the relationship between adjustments in dlPFC activity and behaviour in response to changes in volatility of the environment. Among ‘magnitude-difference selective’ neurons, we found a significant positive correlation between behavioural adjustments and changes in encoding of the difference in reward magnitudes for the two options (Kendall correlation (N = 45): r = 0.27, P = 0.018, 95% CI = [0.045, 0.43]; Fig. 6f). Specifically, stronger dlPFC encoding of the difference in magnitudes accompanied larger behavioural weighting of reward magnitude relative to reward probability (magnitude-to-probability weighting) in the more volatile environment. By contrast, we did not find credible evidence for correlation between behavioural adjustments and changes in encoding of any other variables that were significantly represented in neural activity, including the sum in reward magnitudes of the two options (Kendall correlation (N = 45); mr + ml: r = −0.052, P = 0.5, 95% CI = [−0.21, 0.10]; POSRG: r = −0.082, P = 0.4, 95% CI = [−0.29, 0.11]; CRG(t–1): r = 0.063, P = 0.5, 95% CI = [−0.14, 0.26]; POSRG × CRG(t 1): r = 0.003, P = 1, 95% CI = [−0.18, 0.20]; Fig. 6e, Supplementary Fig. 7g–i). Together, analyses of neural data suggest a direct link between observed behavioural adjustments and adjustments in the representation of the most relevant variable (that is the difference in reward magnitudes) in the dlPFC neurons.

## Discussion

A fundamental difference between the multiplicative and additive strategies is that different reward attributes have to be fused for each option before the onset of decision-making processes in the former but not necessarily in the latter. This is because an additive strategy for the construction of subjective value (followed by choice) is equivalent to decision-making based on the weighted sum of the differences in each dimension; choice can be made by direct comparisons of reward attributes in each dimension separately17. The difference between the two strategies has important implications for the flexibility of choice behaviour because fusion of reward attributes results in an integrated value that hinders further independent adjustments of the weight of each attribute on valuation and choice.

Traditional approaches to behavioural economics and neuroeconomics hold that laboratory measures of economic attitudes (especially with regards to risk and time) measure stable and universal preferences and strategies. However, recent empirical evidence supports the idea that, in humans and other animals, economic preferences are constructed without planning and vary substantially based on ostensibly small contextual factors18. For example, it has been shown that humans adaptively adjust their choice behaviour according to statistics of attended variables, time to receive the reward and current resources19,20,21. Similarly, animal studies of decision-making have demonstrated time-dependent, task-dependent and sequence-dependent choice preferences11,22,23. Therefore, the present findings are consistent with the broader evidence that human and animal decision-makers are not hard-wired to follow fixed strategies assumed by normative models, but instead are endowed with the flexibility needed for learning and choice under uncertainty.

Finally, previous studies have shown that the anterior cingulate cortex (ACC) carries signals related to volatility16 and is crucial for leaning from reward feedback under uncertainty24. Here, we find that the dlPFC neurons change their encoding of reward magnitude according to volatility of the environment, suggesting that volatility information may be routed from the ACC to the dlPFC to support flexible behaviour under uncertainty. However, future manipulation studies are required to test this prediction to elucidate circuit-level mechanisms of adaptive learning25.

## Methods

### Ethics statement

All experimental procedures in monkeys were approved by the Institutional Animal Care and Use Committee (IACUC) at Yale University, the University Committee on Animal Resources at the University of Rochester, or the Institutional Animal Care and Use Committee at the University of Minnesota. All experimental procedures in humans were approved by the Dartmouth College Institutional Review Board and informed consent was obtained from all participants before the experiment.

### Neurophysiological recording

For the PRL task, activity of individual neurons in the dlPFC was recorded extracellularly (left hemisphere in both monkeys) using a 16-channel multi-electrode recording system (Thomas Recording) and a multichannel acquisition processor (Plexon). On the basis of magnetic resonance images, the recording chamber was centred over the principal sulcus and located anterior to the genu of the arcuate sulcus (monkey O, 4 mm; monkey U, 10 mm). All neurons selected for analysis were located anterior to the frontal eye field, which was defined by eye movements evoked by electrical stimulation (current <50 μA) in monkey O. The recording chamber in monkey U was located sufficiently anterior to the frontal eye field, so stimulation was not performed in this animal. Each neuron in the dataset was recorded for a minimum of 320 trials (77 and 149 neurons in monkeys O and U, respectively) and on average for 518.8 trials (s.d. = 147.2 trials). We did not preselect neurons on the basis of activity and all neurons that could be sufficiently isolated for the minimum number of trials were included in the analyses.

### Human participants

For the gambling task in humans, 64 participants (38 females; ages 18–22 years) were recruited from the Dartmouth College student population. No participant was excluded from data analyses for the gambling task. For the ML and PRL tasks in humans, 50 participants (35 females; ages 18–22 years) were recruited from the Dartmouth College student population. Because both the ML and PRL tasks involved learning reward probabilities associated with the two colour targets, we used a criterion to remove participants whose performance (in terms of selecting the option with the higher probability of reward) was not significantly better than chance (0.5). More specifically, we used a performance threshold of 0.5513 equal to 0.5 + 2s.e.m., based on the average of 380 trials after excluding the first 10 trials of each environment. This resulted in the exclusion of data from 4 out of 50 participants in the ML task and 12 out of 50 participants in PRL task, respectively. No participants had a history of neurological or psychiatric illness.

Participants in all experiments were compensated with a combination of money and ‘T-points’, which are extra credit points for classes within the Department of Psychological and Brain Sciences at Dartmouth College. The base rate for compensation was US$10 per hour or one T-point per hour. Participants were additionally rewarded based on their performance by up to US$10 per hour.

Three male monkeys performed 70,700 (monkey B), 24,700 (monkey C) and 12,872 trials (monkey J) of a gambling task for a total of 146 sessions and 108,272 trials. On each trial of this task, they selected one of two options (Fig. 1a). Options offered either a gamble or a safe (100% probability) bet for liquid (water or dilute cherry juice, depending on the animal’s preference) reward. Gamble offers were defined by two parameters, reward size and probability. Each gamble rectangle was divided into two portions: one red and the other either blue or green. The size of the green or blue portions signified the probability of winning a medium (0.165 ml) or large (0.24 ml) reward, respectively. Probabilities were drawn from a uniform distribution between 0 and 100%, with 1% precision. The rest of the bar was coloured red; the size of the red portion indicated the probability of no reward. A safe option existed in 11.11% of trials that was entirely grey and carried a 100% probability of a small reward (0.125 ml).

On each trial, one offer appeared on the left side of the screen and the other appeared on the right. The side of the first and second offer (left and right) was randomized by trial. Following presentation of both offers individually, both offers appeared simultaneously and the animal indicated its choice by shifting its gaze to its preferred offer and maintaining fixation on it. Following a successful fixation, the gamble was immediately resolved and the reward delivered10.

Each participant performed a gambling task in which he/she selected between a pair of offers on every trial and was provided with reward feedback (Fig. 1c). Gambles were presented as rectangular bars divided into one or two portions. A portion’s colour indicated the reward magnitude of that outcome and its size signalled its probability. The task consisted of either choice between a safe option and a gamble that yields either a reward larger than that of the safe option or no reward with complementary probabilities, or choice between two gambles. Participants evaluated and selected between a total of 63 unique gamble pairs, each of which was shown four times in a random order (total of 252 trials).

Before the beginning of the task, participants completed a training session in which they selected between two safe options. These training sessions were used to familiarize participants with the associations between four different colours (purple, magenta, green and grey) and their corresponding reward values. Reward values were always 0, 1, 2 and 4 points; no reward (0 points) was always assigned to the grey colour. The colour–reward assignment remained consistent for each participant throughout both the training session and its corresponding task. The colour–reward assignments, however, were randomized among participants9.

Monkeys O and U completed 45 and 73 sessions (a total of 118 sessions and 66,148 trials) of the PRL task, respectively, in which they had to choose between a red and a green circle on each trial (Fig. 1b). A set of yellow tokens was presented around each target to indicate the magnitude of potential reward on a given target. On each trial, one of the target colours was associated with a high reward probability (80%) whereas the other was associated with the complementary low reward probability (20%). These reward probabilities were fixed within a block of trials and alternated across blocks of 20 (more volatile) or 80 (less volatile) trials to induce different levels of volatility. Thus block length L was used to manipulate volatility of the environment. If an animal’s choice on a given trial was rewarded, it was given the amount of apple juice associated with the magnitude of the chosen target. Each token corresponded to one drop of juice (0.1 ml). The reward magnitudes associated with each target colour were drawn from the following ten possible pairs: {(1,1), (1,2), (1,4), (1,8), (2,1), (2,4), (4,1), (4,2), (4,4), (8,1)}. Each magnitude pair was counter-balanced across target locations so that reward magnitude did not provide any information about the location of reward. We did not find any systematic differences in either animal’s behaviour, therefore we combined the data from both monkeys. Additional details about the task and behaviours of these animals have been reported previously13.

### ML and PRL tasks in humans

Each participant completed two sessions of the experiment, corresponding to the ML and PRL tasks, in which she/he was asked to choose between blue and red or cyan and magenta squares, respectively (Fig. 1d). The magnitude of potential reward (reward points) on a given target was presented as yellow numbers inside each target. Participants were told to select between the two targets on the basis of both presented reward magnitude on each trial and experienced outcomes associated with each colour in the preceding trials to maximize the total number of reward points.

The first session of the experiment (the ML task) started with 200 trials in which the probability of either the red or blue target being rewarded was fixed at 80% or 20% (stable environment). This was followed by a super-block of 200 trials in which reward probabilities associated with the two targets switched between 80% and 20% every 20 or 40 trials (volatile environment). The second session of the experiment (the PRL task) started with either a super-block of 160 trials in which reward probabilities for the two targets switched every 20 trials (more volatile environment) followed by a super-block of 240 trials in which reward probabilities for the two targets switched every 80 trials (less volatile environment), or vice versa. The order of the less and more volatile environments were counter-balanced across participants. Throughout the experiment, reward magnitudes were selected from the following ten possible combinations: {(1,1), (1,2), (1,4), (1,8), (2,1), (2,4), (4,1), (4,2), (4,4), (8,1)} similar to the ML and PRL tasks in monkeys. The target colour associated with the higher probability of reward during the initial block of each session of the experiment was randomly assigned and counter-balanced across participants.

### Analysis of behavioural data

For both the ML and PRL tasks, we first fitted choice data using a reinforcement learning (RL) model to estimate reward probabilities learned by each participant over time. More specifically, we tested an RL model with two learning rates for rewarded and unrewarded trials ($$\alpha _{\mathrm{rew}}$$ and $$\alpha _{\mathrm{unr}}$$). In this model, the two options (coloured targets or distinct shapes) are assigned with complementary probabilities: say $$\hat p_{\mathrm{R}}$$ and $$\hat p_{\mathrm{G}} = 1 - \hat p_{\mathrm{R}}$$ for the red and green targets, respectively. We made this assumption because actual reward probabilities were complementary in all our experiments and models based on complementary estimates provided better fits than those based on independent estimates8. If the red target is selected, the estimated probability for the red target is updated as follows:

$$\hat p_{\mathrm{R}}\left( {t + 1} \right) = \hat p_{\mathrm{R}}\left( t \right) + (\delta _{r\left( t \right),1}\alpha _{\mathrm{rew}} + \delta _{r\left( t \right),0}\alpha _{\mathrm{unr}})\left( {r(t) - \hat p_{\mathrm{R}}\left( t \right)} \right)$$
(1)

where t represents the trial number, $$\hat p_{\mathrm{R}}(t)$$ is the estimated reward probability of the red target on trial t, $$r(t)$$ is the trial outcome (1 for rewarded, 0 for unrewarded), $$\alpha _{\mathrm{rew}}$$ and $$\alpha _{\mathrm{unr}}$$ are the learning rates for rewarded and unrewarded trials, respectively, and $$\delta _{r\left( t \right),X}$$ is the Kronecker delta function ($$\delta _{r\left( t \right),X} = 1$$, if $$r(t) = X$$ and 0 otherwise). On trials when the green target is selected, the update rule is equal to:

$$\hat p_{\mathrm{R}}\left( {t + 1} \right) = \hat p_{\mathrm{R}}\left( t \right) - (\delta _{r\left( t \right),1}\alpha _{\mathrm{rew}} + \delta _{r\left( t \right),0}\alpha _{\mathrm{unr}})\left( {r(t) - \hat p_{\mathrm{R}}\left( t \right)} \right)$$
(2)

because of the assumption about the complementary nature of reward probabilities.

To examine systematically how each participant combined reward magnitude and estimated (in the ML and PRL tasks) or given (in the gambling task) reward probability into a subjective value, we compared several variations of models in which probabilities and magnitudes were combined additively or multiplicatively. We also considered hybrid models that combine both additive and multiplicative models.

In the additive models, the subjective value of each gamble is computed as follows:

$${{\mathrm{SV}}_{l}} = {\omega_m}u(m_{\mathrm{l}}) + {\omega _p}w(p_{\mathrm{l}})$$
(3)

where $${{\mathrm{SV}}_{l}}$$ is the subjective value of the left gamble, $$m_{\mathrm{l}}$$ is the magnitude of left gamble, $$p_{\mathrm{l}}$$ is the provided or estimated (equation (1)) reward probability of the left gamble, $$u\left( m \right)$$ is the utility function, $$w(p)$$ is the probability weighting function (see below) and $$\omega _m$$ and $$\omega _p$$ are the weights assigned to the magnitude and probability, respectively.

In the multiplicative models, the subjective value of each gamble is computed as follows:

$${{\mathrm{SV}}_{l}} = \omega_{mult} \left( {u(m_{\mathrm{l}}) \times w(p_{\mathrm{l}})} \right)$$
(4)

where $$\omega_{mult}$$ determines the sensitivity of choice to subjective value based on the multiplicative strategy.

In the hybrid models, the subjective value of each gamble is computed as follows:

$${{\mathrm{SV}}_{l}} = ({\omega _m}u(m_{\mathrm{l}}) + {\omega _p}w(p_{\mathrm{l}})) + \omega _{\mathrm{mult}}\left( {u(m_{\mathrm{l}}) \times w(p_{\mathrm{l}})} \right)$$
(5)

where $$\omega _{\mathrm{mult}}$$ is the weight of the multiplicative component on the subjective value. We normalize these weights to define a set of relative weights ($$\beta _{\mathrm{mult}}$$, $$\beta _m$$, and $$\beta _p$$) as follows:

$$\left\{ {\begin{array}{*{20}{c}} {\beta _{\mathrm{mult}} = \frac{{\omega _{\mathrm{mult}}}}{{\omega _m + \omega _p + \omega _{\mathrm{mult}}}}} \\ {\beta _m = \frac{{\omega _m}}{{\omega _m + \omega _p}}} \\ {\beta _p = \frac{{\omega _p}}{{\omega _m + \omega _p}}} \end{array}} \right.$$
(6)

where $$\beta _{\mathrm{mult}}$$ is the relative weight assigned to the multiplicative component, and $$\beta _m$$ and $$\beta _p$$ measure the relative weight of reward magnitude and probability in the additive component, respectively. Using these definitions, the subjective value in the hybrid models can be written as:

$${{\mathrm{SV}}_{l}} = \omega_{sum} \times \left( {\left( {1 - \beta _{\mathrm{{mult}}}} \right)({\beta _m}u(m_{\mathrm{l}}) + {\beta _p}w(p_{\mathrm{l}})) + \beta _{\mathrm{mult}}\left( {u(m_{\mathrm{l}}) \times w(p_{\mathrm{l}})} \right)} \right)$$
(7)

where $$\omega_{sum} = \omega _m + \omega _p + \omega _{\mathrm{mult}}$$. The model with $$\beta _{\mathrm{mult}} = 0$$ is purely additive and the model with $$\beta _{\mathrm{mult}} = 1$$ is purely multiplicative.

The estimated subjective values are then used to compute the probability of selecting left and right based on a logistic function:

$${\mathrm{logit}}\;p\left( {\mathrm{Left}} \right) = \beta _0 + \left( {\mathrm{SV}_{\mathrm{l}} - \mathrm{SV}_{\mathrm{r}}} \right) + \beta _{\mathrm{stay}}D_{\mathrm{pc}}{\mathrm{POS}_{\mathrm{RG}}}$$
(8)

where $$p({\mathrm{Left}})$$ denotes the probability of choosing the left gamble. The first and third terms only were used to fit choice behaviour in the volatile environments to capture the bias in choosing the options on the left or right ($$\beta _0$$) and the tendency to repeat the previous chosen target colour ($$\beta _{\mathrm{stay}}$$), respectively. These terms were confounded with reward values and thus, were not used when probabilities were known, as in the gambling task, or fluctuated very little, as in the stable environment of the ML task. Finally, $$D_{\mathrm{pc}}$$ is a dummy variable ($$D_{\mathrm{pc}} = - 1,1$$ if the previous choice was green or red, respectively), and $${\mathrm{POS}_{\mathrm{RG}}}\left( t \right)$$ is the relative position of the red and green targets (1 if red is on the right, and −1 otherwise).

We examined four variations of the additive, multiplicative, and hybrid models (EV, EV+PW, EU and SU) in which the actual or nonlinear transformations of probabilities and magnitudes were combined additively or multiplicatively. In the EV models, linear functions of reward probabilities and magnitudes were used to estimate the subjective value of each gamble ($$u\left( m \right) = m,w\left( p \right) = p$$). In the EU model we considered a nonlinear function of reward magnitude to determine the subjective utility of a given reward outcome:

$$u(m) = m^{\rho}$$
(9)

where $$\rho$$ is the exponent of the power law function and determines risk aversion. However, the probability weighting was linear in this model. In the EV+PW model, we considered a linear function of magnitude and a nonlinear probability weighting function. The PW was computed using the one-parameter Prelec function as follows:

$$w(p) = \mathrm{e}^{ - ( - {\mathrm{log}}(p))^\gamma }$$
(10)

where $$\gamma$$ is a parameter that determines the amount and direction of distortion in the probability weighting function. Finally, in the SU model, we used both nonlinear utility and nonlinear probability weighting functions to estimate the subjective value of each gamble. This procedure was similar for the ML, PRL, and gambling tasks.

All models were fitted to experimental data by minimizing the negative log likelihood of the predicted choice probability given different model parameters using the fminsearch function in MATLAB (Mathworks Inc.). To avoid over-fitting and to deal with different numbers of parameters, we applied a variational BMS approach to identify the most likely models that could account for our data. We calculated likelihood of each model using the estimated Dirichlet density from which models are sampled to generate participant-specific data15. The procedure was repeated 50 times using 80% of the data for a given monkey or human participant in a given task to calculate the mean and s.d. of model likelihood in capturing the data. We confirmed our results by computing the AIC that penalized the use of additional parameters in a given model. The smaller value for this measure indicates a better fit of choice behaviour. Finally, we found similar results using a Bayesian information criterion for all experimental data and cross-validation for monkey data for which this method could be applied (data not shown).

To avoid local minima, the fitting procedure was repeated 20 times for data from each monkey and human participant. For more complex models (EV+PW, EU and SU), we used the estimated parameters of the simplest model (EV) as the initial values for searching the parameters. We adopted this method to ensure that more complex models achieve negative log likelihood not bigger than the best corresponding simplest model, which could happen due to converging to local minima with a large number of parameters.

### Validation of the fitting procedure

To investigate whether our fitting procedure can be used to distinguish between alternative models and accurately estimate model parameters, we simulated choice data using a hybrid model of value construction (equation (7) with linear utility and probability weighting functions) in the gambling and the PRL tasks and fitted this data using additive, multiplicative, and hybrid models. The simulated data in the PRL task were generated using an RL model with two learning rates. We constrained $$\beta _m$$ and $$\beta _{\mathrm{mult}}$$ to be in the range [0, 0.5] and [0, 1], respectively, but kept $$\omega_{sum}$$ equal to 5 for all simulations. We simulated 10 sets of choice data, each with 4000 trials in the gambling task and for the three environments of the PRL task: stable environment with block length of 200, less volatile environment with block length of 80, and more volatile environment with block length of 20. To ensure proper learning, we set the learning rates to 0.4, 0.2 and 0.1 for rewarded and 0.1, 0.05 and 0.025 for unrewarded trials in the more volatile, less volatile and stable environments, respectively. For simplicity, we used linear utility and probability weighting functions. We then fitted the simulated data to estimate model parameters and compute the BMS likelihood. The BMS likelihood and the estimated model parameters were computed by averaging over all fits.

### Analysis of neural data

A linear regression model was used to investigate how individual neurons encode various types of information in the PRL task. We included the terms that were shown to have a neural representation in the dorsolateral prefrontal cortex5. To analyse how the way features are encoded by single neurons is influenced by volatility, we compared the fit of two simple regression models to activity in the less and more volatile blocks using the following equation:

$$\begin{array}{l} y\left( t \right) = \beta _0 + \beta _1C_{\mathrm{lr}}\left( t \right) + \beta _2C_{\mathrm{lr}}\left( {t - 1} \right) + \beta _3R\left( {t - 1} \right)\\\qquad\quad + \beta _4{\mathrm{POS}}_{\mathrm{RG}}\left( t \right) + \beta _5\left( {m_{\mathrm{r}}\left( t \right) + m_{\mathrm{l}}(t)} \right) + \beta _6\left( {m_{\mathrm{r}}\left( t \right) - m_{\mathrm{l}}(t)} \right)\\\qquad\quad + \beta _7C_{\mathrm{RG}}\left( t \right) + \beta _8C_{\mathrm{RG}}\left( {t - 1} \right) + \beta _9R\left( {t - 1} \right)\\\qquad\quad \times \mathrm{POS}_{\mathrm{RG}}\left( t \right) + \beta _{10}C_{\mathrm{RG}}\left( {t - 1} \right)\\\qquad\quad \times \mathrm{POS}_{\mathrm{RG}}\left( t \right) + \beta _{11}C_{\mathrm{RG}}\left( {t - 1} \right)\\\qquad\quad \times R\left( {t - 1} \right) + \beta _{12}{\mathrm{PRL}}\left( t \right) + \beta _{13}{\mathrm{HVL}}\left( t \right)\\ \end{array}$$
(11)

where $$y\left( t \right)$$ is the firing rate of a neuron for a given epoch on trial t, $$C_{\mathrm{lr}}(t)$$ is the location of the chosen target on trial t, $$R\left( t \right)$$ is the outcome on trial t, $${\mathrm{POS}}_{\mathrm{RG}}(t)$$ is the position of the red and green target on trial t, $$\left( {m_{\mathrm{r}}(t) - m_{\mathrm{l}}(t)} \right)$$ and $$\left( {m_{\mathrm{r}}\left( t \right) + m_{\mathrm{l}}(t)} \right)$$ is the sum and the difference in reward magnitude of left and right targets on trial t, and $$C_{\mathrm{RG}}(t)$$ is the colour of the chosen target on trial t. The $${\mathrm{PRL}}\left( t \right)$$ term stands for the location associated with the high-reward-probability target ($${\mathrm{PRL}}\left( t \right) = C_{\mathrm{LR}}\left( {t - 1} \right) \times R\left( {t - 1} \right)$$), and the $${\mathrm{HVL}}\left( t \right)$$ term indicates the location of colour associated with the high-reward-probability target ($${\mathrm{HVL}}\left( t \right) = C_{\mathrm{RG}}\left( {t - 1} \right) \times R\left( {t - 1} \right) \times \mathrm{POS}_{\mathrm{RG}}\left( t \right)$$).

To compare the regression coefficients across the two volatility conditions (less and more volatile environments), we randomly removed a subset of trials in each pair of reward magnitudes so that the proportion of trials in which the animal chose the high-reward- probability target was equated for the two conditions. While reducing the difference between these proportions we ensured that the lowest number of trials was removed for each condition in each session. We repeated this procedure 50 times for each session (removing different sets of trials in each repetition) and averaged the regression coefficients across the repetitions. ‘Magnitude-difference’ selective neurons are defined as neurons that were selective to the difference in magnitudes considering both sessions together. Finally, we did not correct for multiple comparisons to identify the bins at which a given regressor was significantly different from 0 because of the overlap between spikes in neighbouring bins (due to the sliding window). Nevertheless, to assign two successive bins as significant, we required those bins to have significant values and the fractions of neurons with significant regressor to be larger than 0.15. The latter step was to avoid false positives due to small numbers of neurons. Statistical comparisons were performed using two-sided Wilcoxon signed-rank tests. Finally, for correlation analysis, we only considered spikes between 750 and 1,250 ms after target onset when magnitude information was presented on the screen13.

### Relative modulation due to volatility

To quantify the modulations due to volatility, we computed different quantities for behavioural and neural estimates. Specifically, we defined a relative neural modulation index using estimated standardized regression coefficients as below:

$${\mathrm{Relative}}\;{\mathrm{neural}}\;{\mathrm{modulation}} = {\mathrm{sign}}\left( {\beta _{i\left( {\mathrm{mvol}} \right)} + \beta _{i\left( {\mathrm{lvol}} \right)}} \right) \times (\beta _{i\left( {\mathrm{mvol}} \right)} - \beta _{i\left( {\mathrm{lvol}} \right)})$$
(12)

where $$i = \{ 1, \ldots, 13\}$$ is the regressor index, and $$\beta _{i({\mathrm{lvol}})}$$, and $$\beta _{i({\mathrm{mvol}})}$$ are the estimated regression coefficient for the less and more volatile environments, respectively. Similarly, we defined a relative behavioural modulation index using the behavioural estimate of the ratio of weights for reward probability and magnitude as below:

$$\begin{array}{*{20}{l}}{\mathrm{Relative}}\;{\mathrm{behavioural}}\;{\mathrm{modulation}} = {\mathrm{sign}}\left(\frac{{\beta _{m\left( {\mathrm{mvol}} \right)}}}{{\beta _{p\left( {\mathrm{mvol}} \right)}}} + \frac{{\beta _{m\left( {\mathrm{lvol}} \right)}}}{{\beta _{p\left( {\mathrm{lvol}} \right)}}}\right)\\ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad \times \left(\frac{{\beta _{m\left( {\mathrm{mvol}} \right)}}}{{\beta _{p\left( {\mathrm{mvol}} \right)}}} - \frac{{\beta _{m\left( {\mathrm{lvol}} \right)}}}{{\beta _{p\left( {\mathrm{lvol}} \right)}}}\right)\end{array}$$
(13)

where $$\frac{{\beta _{m\left( {\mathrm{lvol}} \right)}}}{{\beta _{p\left( {\mathrm{{lvol}}}\right)}}}$$ and $$\frac{{\beta _{m\left( {\mathrm{mvol}} \right)}}}{{\beta _{p\left( {\mathrm{mvol}} \right)}}}$$ are the ratio of estimated weights for reward magnitude and probability (magnitude-to-probability weighting) in the less and more volatile environments, respectively, based on the fit of behavioural data using the simplest hybrid model.

### Data analysis

Data collection and analysis were not performed blind to the conditions of the experiments. Unless otherwise noted, data distribution was assumed to be non-normal but this was not formally tested. Statistical comparisons were performed using Wilcoxon signed-rank tests to test the hypothesis of zero median for one sample or the difference between paired samples. No statistical methods were used to pre-determine sample sizes but our sample sizes are similar to those reported in previous similar publications7,16. We used a significance level of 0.05 for all statistical tests. Reported effect sizes are Cohen’s d values. All behavioural analyses, model fitting, and simulations were done using MATLAB 2018a (MathWorks Inc.).

### Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

The data that support the findings of this study are available from the corresponding author upon request.

## Code availability

Custom computer codes that support the findings of this study are available from the corresponding author upon request.

## References

1. 1.

Bernoulli, D. Expositions of a new theory of the measurement of risk. Econometrica 22, 23–36 (1954).

2. 2.

Edwards, W. The theory of decision making. Psychol. Bull. 51, 380 (1954).

3. 3.

Kahneman, D. & Tversky, A. On the psychology of prediction. Psych. Rev. 80, 237–251 (1973).

4. 4.

Stewart, N. Information integration in risky choice: identification and stability. Front. Psychol. 2, 301 (2011).

5. 5.

Ernst, M. O. & Banks, M. S. Humans integrate visual and haptic information in a statistically optimal fashion. Nature 415, 429 (2002).

6. 6.

Hunt, L. T., Dolan, R. J. & Behrens, T. E. Hierarchical competitions subserving multi-attribute choice. Nat. Neurosci. 17, 1613–1622 (2014).

7. 7.

Farashahi, S., Rowe, K., Aslami, Z., Lee, D. & Soltani, A. Feature-based learning improves adaptability without compromising precision. Nat. Commun. 8, 1768 (2017).

8. 8.

Farashahi, S. et al. Metaplasticity as a neural substrate for adaptive learning and choice under uncertainty. Neuron 94, 401–414 (2017).

9. 9.

Spitmaan, M., Chu, E. & Soltani, A. Salience-driven value construction for adaptive choice under risk. J. Neurosci. 39, 5195–5209 (2019).

10. 10.

Strait, C. E., Blanchard, T. C. & Hayden, B. Y. Reward value comparison via mutual inhibition in ventromedial prefrontal cortex. Neuron 82, 1357–1366 (2014).

11. 11.

Farashahi, S., Azab, H., Hayden, B. & Soltani, A. On the flexibility of basic risk attitudes in monkeys. J. Neurosci. 38, 4383–4398 (2018).

12. 12.

Hayden, B., Heilbronner, S. & Platt, M. Ambiguity aversion in rhesus macaques. Front. Neurosci. 4, 166 (2010).

13. 13.

Donahue, C. H. & Lee, D. Dynamic routing of task-relevant signals for decision making in dorsolateral prefrontal cortex. Nat. Neurosci. 18, 295–301 (2015).

14. 14.

Massi, B., Donahue, C. H. & Lee, D. Volatility facilitates value updating in the prefrontal cortex. Neuron 99, 598–608 (2018).

15. 15.

Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J. & Friston, K. J. Bayesian model selection for group studies. NeuroImage 46, 1005–1017 (2009); erratum 48, 311–311 (2009).

16. 16.

Behrens, T. E. J., Woolrich, M. W., Walton, M. E. & Rushworth, M. F. S. Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007).

17. 17.

Tversky, A. Intransitivity of preferences. Psychol. Rev. 76, 31 (1969).

18. 18.

Lichtenstein, S. & Slovic, P. The Construction of Preference (Cambridge Univ. Press, 2006).

19. 19.

Ariely, D., Loewenstein, G. & Prelec, D. “Coherent arbitrariness”: stable demand curves without stable preferences. Q. J. Econ. 118, 73–106 (2003).

20. 20.

Frederick, S., Loewenstein, G. & O’Donoghue, T. Time discounting and time preference: a critical review. J. Econ. Lit. 40, 351–401 (2002).

21. 21.

Kolling, N., Wittmann, M. & Rushworth, M. F. Multiple neural mechanisms of decision making and their competition under changing risk pressure. Neuron 81, 1190–1202 (2014).

22. 22.

Ferrari-Toniolo, S., Bujold, P. M. & Schultz, W. Probability distortion depends on choice sequence in rhesus monkeys. J. Neurosci. 39, 2915–2929 (2019).

23. 23.

Hayden, B. Y. Time discounting and time preference in animals: a critical review. Psychon. Bull. Rev. 23, 39–53 (2016).

24. 24.

Kennerley, S. W., Walton, M. E., Behrens, T. E. J., Buckley, M. J. & Rushworth, M. F. S. Optimal decision making and the anterior cingulate cortex. Nat. Neurosci. 9, 940–947 (2006).

25. 25.

Soltani, A. & Izquierdo, A. Adaptive learning under expected and unexpected uncertainty. Nat. Rev. Neurosci. https://doi.org/10.1038/341583-019-0180-y (2019).

26. 26.

Brainard, D. H. The psychophysics toolbox. Spat. Vis. 10, 433–436 (1997).

27. 27.

Cornelissen, F. W., Peters, E. M. & Palmer, J. The Eyelink Toolbox: eye tracking with MATLAB and the Psychophysics Toolbox. Behav. Res. Methods Instrum. Comput. 34, 613–617 (2002).

## Acknowledgements

We thank E. Chu, S. Nichols-Worley and L. Tran for collecting human data, and C. Strait and M. Mancarella for collecting monkey data in the gambling task. This work is supported by the National Science Foundation (CAREER Award no. BCS1253576 to B.Y.H. and EPSCoR Award no. 1632738 to A.S.), and the National Institutes of Health (grant no. R01 DA038615 to B.Y.H., grant nos. R01 DA029330 and R01 MH108629 to D.L., and grant no. R01 DA047870 to A.S.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

## Author information

Authors

### Contributions

A.S. conceived the project. C.H.D., B.Y.H. and D.L. designed the experiments in monkeys. S.F. and A.S. designed the human experiments. S.F. and A.S. performed model simulations and analysed the data. C.H.D. and S.F. conducted the experiments. C.H.D., S.F., D.L., B.Y.H. and A.S. analysed and interpreted the experimental data. D.L., B.Y.H. and A.S wrote the manuscript and all authors contributed to revising the manuscript.

### Corresponding author

Correspondence to Alireza Soltani.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Peer review information: Primary Handling Editor: Marike Schiffer.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary information

### Supplementary Information

Supplementary Notes 1 and 2 and Figs. 1–7.

## Rights and permissions

Reprints and Permissions

Farashahi, S., Donahue, C.H., Hayden, B.Y. et al. Flexible combination of reward information across primates. Nat Hum Behav 3, 1215–1224 (2019). https://doi.org/10.1038/s41562-019-0714-3

• Accepted:

• Published:

• Issue Date:

• ### The description–experience gap: a challenge for the neuroeconomics of decision-making under uncertainty

• Basile Garcia
• , Fabien Cerrotti
•  & Stefano Palminteri

Philosophical Transactions of the Royal Society B: Biological Sciences (2021)

• ### Non-human primates use combined rules when deciding under ambiguity

• A. Romain
• , M-H. Broihanne
• , A. De Marco
• , B. Ngoubangoye
• , J. Call
• , N. Rebout
•  & V. Dufour

Philosophical Transactions of the Royal Society B: Biological Sciences (2021)

• ### Inducing Affective Learning Biases with Cognitive Training and Prefrontal tDCS: A Proof-of-Concept Study

• Margot Juliëtte Overman
• , Michael Browning
•  & Jacinta O’Shea

Cognitive Therapy and Research (2020)

• ### Activity in orbitofrontal neuronal ensembles reflects inhibitory control

• , Meghan C. Pesce
•  & Benjamin Y. Hayden

European Journal of Neuroscience (2020)

• ### Cognitiva Speciebus: Towards a Linnaean Approach to Cognition

• Philip Millroth
• , August Collsiöö
•  & Peter Juslin

Trends in Cognitive Sciences (2020)