1 Supplementary Material : Learning the value of information and reward over 1 time when solving exploration-exploitation problems 2 3

To flexibly adapt to the demands of their environment, animals are constantly exposed to the conflict resulting from having to choose between predictably rewarding familiar options (exploitation) and risky novel options, the value of which essentially consists of obtaining new information about the space of possible rewards (exploration). Despite extensive research, the mechanisms that subtend the manner in which animals solve this exploitation-exploration dilemma are still poorly understood. Here, we investigate human decision-making in a gambling task in which the informational value of each trial and the reward potential were separately manipulated. To better characterize the mechanisms that underlined the observed behavioural choices, we introduce a computational model that augments the standard reward-based reinforcement learning formulation by associating a value to information. We find that both reward and information gained during learning influence the balance between exploitation and exploration, and that this influence was dependent on the reward context. Our results shed light on the mechanisms that underpin decision-making under uncertainty, and suggest new approaches for investigating the exploration-exploitation dilemma throughout the animal kingdom.


Performance 26
To investigate whether participants were actively engaged in the task (i.e., learning reward 27 outcomes to increase their total gain) we investigated the degree by which participants played strategically 28 during the task maximizing their total gain. To do so, we computed the overall-exploitative choices (i.e. 29 choosing the deck with the highest averaged of points obtained in the previous trials) made by participants 30 during the entire free-choice task. All participants scored above chance level set at 33%. A Wilcoxon 31 Signed Rank Test on the average value of subjects' overall-exploitative choices revealed a significant 32 difference between exploitative choices (M = 0.650, SD = 0.060) and chance level, p < 10 -7 , indicating that 33 participants played strategically during the task. We also asked whether participants' performance was 34 affected by the 4 conditions adopted in the task. To examine choice behaviour across conditions, a 35 Friedman Rank Sum Test was conducted on overall-exploitative choices under Baseline, Reward, 36 Information and Mixed conditions and rendered a X 2 (3, N=21) = 49.971, p < 10 -11 . Pairwise comparisons 37 using Wilcoxon Signed Rank Test revealed a significant increase in overall-exploitative choices in the 38 Reward condition compared to Baseline, Information and Mixed condition, all p values < 10 -5 . Overall-39 exploitative choices in the Mixed condition were greater than those during Information and Baseline 40 conditions, all p values < 10 -3 . Participants' performance was compromised in situations containing equal 41 generative means, which make the identification of the best option challenging. correlation was found in the equal information condition, p = 0.202. These results seem to suggest that 50 because a affects only reward values, participants that were more prone to directed explore the 51 environment tends to have lower a value. Indeed, a negative correlation was observed between the 52 probability to directed explore the environment and participants' learning rate a, r = -0.563, n = 21, p = 53 0.008. Overall, these results indicate that, in those participants, lower values of a seem not to refer to an 54 absence of learning during the forced-choice task, but rather to a preference profile toward unknown 55 options. 56 57

Qualitative model comparison analysis 58
Overall behaviour in the unequal information condition 59 To inspect participants' behaviour overall in the unequal information condition, we computed 60 directed exploration, random exploration and exploitation in the first free choice trials of the unequal 61 (Information + Mixed) information condition. Trials were classified as directed exploratory when 62 participants chose the option that was never selected during forced-choice trials, exploitative when 63 participants chose the deck with the highest average of points and random exploratory when the 64 classification did not meet the previous criteria. Averaged values entered into Friedman Rank Sum Test.

65
Results revealed an effect of information on decision strategies X 2 (2, N=21) = 35.88, p < 10 -8 ( Figure S1).  Reward context and utility vs. cost trade-off 77 In the analysis on the effect of reward context on decision strategies we observed that the two 78 exploratory strategies showed diametrically opposite effect following changes in reward context as 79 predicted by the two models. To better understand why random exploration was higher in the High Reward 80 context, we conducted an additional analysis in the equal information condition (Baseline + Reward 81 condition). We hypothesized that the reason might be due to an effect of utility vs.    Correlations among model parameters 121 After fitting the kRL model (Table S1), we observed correlations between model parameters.

122
Specifically, we observed a negative correlation between the learning rate a and the softmax inverse

Learning dynamics 155
After evaluating whether participants were learning reward outcomes (using a global estimate of Q 156 prior) and accumulating information during the decision process, we questioned the type of learning that 157 took place during the task. We assumed that participants used the d learning rule to update reward