Abstract
Organisms appear to learn and make decisions using different strategies known as modelfree and modelbased learning; the former is mere reinforcement of previously rewarded actions and the latter is a forwardlooking strategy that involves evaluation of actionstate transition probabilities. Prior work has used neural data to argue that both modelbased and modelfree learners implement a value comparison process at trial onset, but modelbased learners assign more weight to forwardlooking computations. Here using eyetracking, we report evidence for a different interpretation of prior results: modelbased subjects make their choices prior to trial onset. In contrast, modelfree subjects tend to ignore modelbased aspects of the task and instead seem to treat the decision problem as a simple comparison process between two differentially valued items, consistent with previous work on sequentialsampling models of decision making. These findings illustrate a problem with assuming that experimental subjects make their decisions at the same prescribed time.
Introduction
It is becoming clear that there are multiple modes of learning and decisionmaking. For instance, when learning which sequence of actions to choose, some decisionmakers behave as if they are ‘modelfree’, simply repeating actions that previously yielded rewards, while others behave as if they are ‘modelbased’, additionally taking into account whether those outcomes were likely or unlikely given their actions^{1,2,3,4,5,6}. Modelfree behaviour is thought to reflect reinforcement learning^{7}, guided by reward prediction errors^{8,9,10,11,12}. On the other hand, modelbased behaviour is thought to reflect consideration of the task structure, that is, decisionmakers form a ‘model’ of the environment. Modelbased choice has been linked to goalbased behaviour^{13,14}, cognitive control^{15}, slower habit formation^{16}, declarative memory^{17} and extraversion^{18}. What remains unclear is whether modelfree behaviour is a distinct approach to decisionmaking or whether it simply reflects suboptimal inference.
Behavioural and neural evidence seem to support a hybrid model where individuals exhibit mixtures of both modelbased and modelfree behaviour^{4,5,19}. These models assume that the brain uses reward outcomes to update the expected values (‘Qvalues’) of the alternatives and then compares those Qvalues at the time of choice. There is an implicit assumption in these models that the evaluation and choice stages occur at the same time for the modelbased and modelfree components, using a similar reinforcement process, but with the modelbased component incorporating actionstate transition probabilities.
In support of this view, recent neural evidence suggests that the brain employs an arbitration system that compares the reliability of the two systems and adjusts the relative contribution of the modelbased component at the time of choice^{20}. Other evidence argues for a cooperative architecture between the modelbased and modelfree systems^{21}. Finally, modelbased choices have been linked to prospective neural signals reflecting the desired state, supporting the hypothesis of goalbased evaluation^{22}.
At the same time, a parallel literature has used eye tracking and sequentialsampling models to better understand valuebased decisionmaking^{23,24,25,26,27,28,29}. This work has shown that evidence accumulation and comparison drives choices and that the process depends on overt attention. Krajbich et al.^{23} developed the attentional driftdiffusion model (aDDM) to capture this phenomenon. The idea behind this work is that gaze is an indicator of attentional allocation^{30} and that attention to an option magnifies its value relative to the other alternative(s). Subsequent studies have identified similar relationships between option values, eye movements, response times (RT) and choices^{24,31}. Despite this work and the vast literature on oculomotor control and visual search, the connection between selective attention and reinforcement learning, particularly modelbased learning, remains unclear^{26,32,33,34,35}. Some recent evidence does suggest a link between attention and choice in a simple reinforcementlearning task^{26}, but in general the use of eyetracking in this literature is just beginning^{35,36,37}.
Here we seek to investigate whether modelbased and modelfree behaviour reflect a common choice mechanism that utilizes the same information uptake and valuecomparison process (but with varying degrees of accuracy), or whether these choice modes rely on distinct processes. To do so, we use eyetracking to study human subjects in a twostage learning task designed by Daw et al.^{3} to distinguish between modelfree and modelbased learning. Gaze data allow us to test whether modelfree and modelbased subjects engage in the same choice process, and whether modelfree subjects ignore taskrelevant information or simply misinterpret that information. Interestingly, we find that the choices of modelfree subjects show clear signs of an aDDMlike comparison process, whereas modelbased subjects appear to already know which option they will choose, showing signs of directed visual search. Furthermore, modelfree subjects often ignore taskrelevant information, suggesting that they approach the task in a different way than the modelbased subjects.
Results
Behavioural results
We carried out an eyetracking experiment using a twostage decisionmaking task that discriminates between modelbased and modelfree learning^{3}. Fortythree subjects completed the experiment, which consisted of two conditions with 150 trials each.
In the first condition, we replicated the standard design. Each trial had two stages. In the first stage subjects had to make a choice between two Tibetan symbols (arbitrarily labelled ‘A’ and ‘B’ for further analyses) that could lead to one of two secondstage states, ‘purple’ and ‘blue’ (Fig. 1a). The transition was stochastic: one symbol was more likely to lead to the blue state, and the other one was more likely to lead to the purple state. Thus each firststage symbol had a ‘common’ state (probability 0.7) and a ‘rare’ state (probability 0.3) associated with it. Once one of the states was reached, subjects had another choice between two symbols of the respective colour (Fig. 1b). Each of these four symbols was rewarded with a different probability that slowly drifted over the course of the experiment, independently for each symbol and irrespective of the subjects’ choices.
In this task, a pure modelfree learner typically repeats their firststage choice if it led to a reward in the previous trial, irrespective of the state reached in that trial. This behaviour can be captured with a temporaldifference model that updates action values with a reward prediction error^{7}. A pure modelbased learner, on the other hand, typically repeats a rewarded action only if there was a common transition. We model this strategy in the standard way, using a forwardlooking computational model that involves evaluation of prospective states’ values using the empirically estimated transition probabilities^{38}. To fit the data, we adapted a hybrid model that combines both modelfree and modelbased learning, implying a weight w between modelbased and modelfree action values for the firststage symbols^{3,14,39} (w=1 for pure modelbased and w=0 for pure modelfree; see Methods).
The modelfitting procedure uses each subject’s history of choices, transitions and rewards to estimate five free parameters. The resulting parameter distributions were in line with previous findings^{3,22} (Supplementary Table 1). With these parameters, the model assigns socalled ‘Qvalues’ to each option on a trialbytrial basis. As usual, we assume that these Qvalues are the basis for subjects’ choices. In all further analyses, we refer to these hybrid Qvalues as simply ‘Qvalues’.
Subjects exhibited varying degrees of modelbased behaviour, with the value of w ranging from 0 to 1. To illustrate the differences between modelbased and modelfree behaviour, we split our subjects into two groups based on the median w (0.3). In all further analyses, we use the labels ‘modelfree’ and ‘modelbased’ for the two groups defined by this median split. Consistent with prior findings we observed that the more modelfree learners tended to merely repeat previously rewarded choices (mixedeffects regression, N=22, P=10^{−5}, Fig. 2a, Supplementary Table 2), while the more modelbased learners tended to only do so after a common transition (mixedeffects regression, N=21, P=0.005, Fig. 2b, Supplementary Table 2). Note that these results simply confirm that the model parameter w is indeed capturing modelbased behaviour.
Distinct gaze patterns between types
Across multiple prior experiments, the aDDM value comparison process has been linked to several patterns in subjects’ gaze data^{23,24,26,31}. While the model predicts some of them (for example, subjects tend to choose items that they are currently looking at and items that they have looked at longer over the course of the trial), other patterns are merely associated with the process (for example, first gaze location and dwell time (referred to in previous aDDM work as ‘first fixation location’ and ‘fixation duration’) are uncorrelated with value). Here we sought to test whether these patterns were present in the behaviour of our modelfree and modelbased subjects.
The initial gaze location in other experiments, both with learned values^{26} and idiosyncratic preferences^{23} has been found to be unaffected by the values of the choice options. Using a mixedeffects logit regression of initial gaze location on the higher Qvalue item as a function of the Qvalue difference in the firststage decisions, we found no effects for the modelfree subjects (mixedeffects regression estimates: N=22, intercept=0.007, P=0.9; absolute Qvalue difference=0.16, P=0.39, no significant difference from 0.5: N=22, Mann–Whitney test, W=297, P=0.16, Fig. 3a), but effects significantly different from 0.5 (N=21, Mann–Whitney test, W=368, P=0.01) and increasing in the difference of Qvalues, for modelbased learners (mixedeffects regression estimates: N=21, intercept=−0.01, P=0.85; absolute Qvalue difference=0.65, P=0.04, Fig. 3a); the difference between the Qvalue effects for two groups was also significant (N=43, mixedeffects regression group dummy coefficient, P=0.03; regressions that include w’s presented in Supplementary Table 3). These results suggest that unlike modelfree subjects, modelbased subjects often knew ahead of time which symbols they would choose and used peripheral vision to locate them.
Consistent with this idea, modelbased subjects were also more likely to look at only one of the symbols before making their firststage choice (45% versus 32% for modelfree subjects, N=43, P=0.05, controlling for Qvalue difference using a mixedeffects regression). As a result, the two groups had significantly different distributions of the number of gazes per trial (χ^{2}(11)=108.9, P<0.001; Fig. 3b,c).
In line with previous findings^{23} and characteristic of the aDDM process, during the firststage choices, modelfree middlegaze dwells (those that were neither first nor last, so this analysis included only trials with three or more gazes) were shorter if the choice was easier (that is, the difference in Qvalues was larger) (Fig. 3d; mixedeffects regression estimates: N=22, intercept=5.35, P=10^{−12}; absolute Qvalue difference=−2.62, P=0.03), while for modelbased subjects this effect was only marginal (mixedeffects regression estimate: N=21, intercept=5.14, P=10^{−13}; absolute Qvalue difference=−1.79, P=0.07), although the difference in the coefficients between the two groups was not significant. This correlation between dwell time and choice difficulty has been noted in prior aDDM experiments and is likely due to the fact that long gazes in easier trials are more likely to result in a boundary crossing, terminating the decision.
Also consistent with previous aDDM findings, for both groups, in the firststage choices, middlegaze dwell time was independent of the Qvalue for the lookedat symbol (Fig. 3e; mixedeffects regression estimates; N=22 and 21, P=0.31 for modelfree and P=0.71 for modelbased). This finding is important for thinking about causality, as it suggests that overt attention is not drawn to highvalue stimuli.
At the same time, we did not find any evidence that median dwell times (twosided ttest, t(39)=0.85, P=0.4) or RTs (twosided ttest, t(41)=−0.26, P=0.8) were different across the two groups or varied with w (dwell time: Pearson r(41)=0.2, P=0.2; RTs: Pearson r(41)=0.14, P=0.39). Both groups exhibited similarsized effects of log(RT) decreasing with the absolute Qvalue difference (mixedeffects regression estimate; modelfree subjects: N=22, intercept=6.84, P=10^{−16}; absolute Qvalue difference=−0.2, P=0.02; modelbased subjects: N=21, intercept=6.81, P=10^{−16}; absolute Qvalue difference=−0.36, P=0.02), but the interaction between the Qvalue effect and modelbased behaviour was not significant (N=43, mixedeffects group dummy regression coefficient, P=0.3).
In 78% of firststage choices, there were only one or two gazes, so if the symbol that was looked at first was not chosen during that first gaze, it was less likely to be chosen afterwards. This bias was significant for both modelfree and modelbased learners, but was stronger for the latter. Here we estimated a mixedeffects regression for all trials where both symbols were viewed, with the firststage choice as the dependent variable and the following variables as predictors: absolute Qvalue difference, firstgaze location, modelbased dummy and the interaction between the firstgaze location and the modelbased dummy. The results showed a negative effect of the firstgaze location with a stronger effect for modelbased learners (N=43, intercept=0.36, P=0.002; Qvalue difference (B−A)=5.03, P=10^{−16}; firstgaze location=−0.85, P=10^{−13}; modelbased group dummy=0.22, P=0.19; interaction between first gaze location and modelbased group dummy: −0.51, P=0.02; note that the significant intercept does not indicate a preference for the arbitrarily chosen symbol ‘B’, rather it is a consequence of the other significant effects). This is consistent with the hypothesis that a modelbased learner is looking for a particular symbol and so if they look beyond the first symbol, that symbol is unlikely to be chosen.
Modelfree behaviour is driven more by dwell time
Choices in both stages were affected not only by the predicted Qvalues, but by gaze patterns as well (Table 1).
First, we restricted out analyses to firststage choices. Both groups’ choices were affected both by the difference in Qvalues and by the location of their final gaze (Fig. 4a). Analogous to prior work with the aDDM, when the Qvalue difference is small, lastgaze location strongly predicts subjects’ choices, whereas when the Qvalue difference is large, attention has relatively less effect on the choice outcome and subjects overwhelmingly choose the best item irrespective of their lastgaze location. At the same time, modelbased subjects’ choices were more affected by the gaze location: for instance, if the last gaze was on the worse symbol (in terms of Qvalues), they were more likely to choose that symbol, unlike the modelfree subjects (twogroup Mann–Whitney test, N=22 and 21, W=150, P=0.02).
On the other hand, modelfree subjects were significantly more influenced by dwell time: they were 27% more likely to choose the lastseen symbol if the total gaze time for that symbol during the trial was longer than the gaze time for the other symbol, while this effect was only 13% in modelbased subjects (Fig. 4b, twogroup Mann–Whitney test, N=22 and 21, W=138, P=0.02). The effect of dwell time on choice is a robust effect in previous work on valuebased choice, so its relative weakness in the modelbased subjects again suggests that they are often using a different choice process.
To account for all the relevant factors, we used a mixedeffects logistic regression model (Table 1) to model subjects’ choices. We included the following choice predictors: trialbytrial Qvalues generated by the computational learning model, the choice on the previous trial, lastgaze dwell time, and first and lastgaze location, as well as interactions between the gaze variables. The two groups exhibited significant differences both in the lastgazeduration effect (Table 1, columns 1 and 3, last gaze duration coefficients) and the lastgazelocation effect (Table 1, columns 1 and 3, last gaze on B coefficients). We also observed a significant positive effect of the interaction between gaze location and a modelbased dummy (N=43, P=0.02) and a significant negative effect of the interaction between these two variables and dwell time (N=43, P=0.02). These regression results were also robust to using w in place of the dummy variable (Table 1, column 3), and to replacing the Qvalues with cumulative rewards for each symbol (Supplementary Table 4).
Next, we restricted this analysis to trials with only one gaze to each symbol (Table 1, columns 2 and 4; these trials constituted 40% of all trials used for eyetracking analysis). These are the most important trials in which to look for differences between modelbased and modelfree subjects, since the visual search process should not require more than two gazes while the aDDM process should generally require at least two gazes. Thus the twogaze trials are where we are most likely to observe both processes. Further supporting our earlier findings, we found that in these trials the lastgaze location effect was significantly stronger for modelbased subjects (N=42, P=0.004, Table 1, column 2), while the dwell time effect was significantly stronger for modelfree subjects (N=42, P=0.007, Table 1, column 2). These results were also robust to continuous w specification (Table 1, column 4).
Finally, if indeed modelbased subjects know ahead of time what they intend to choose, we should also expect to see a similar pattern in the secondstage choices, but weaker for rare transitions than common transitions. To test this hypothesis we repeated the same analyses for the secondstage choices, taking all trials with 2 or more gazes on the secondstage symbols, and found qualitatively similar effects (Supplementary Table 5), which were indeed weaker for raretransition trials (Supplementary Tables 6 and 7).
Taken together, these findings indicate that the behaviour of modelfree subjects is consistent with an aDDM comparison process. On the other hand, modelbased learners seem to often know what they are looking for (a particular symbol leading to the desired state) and thus seem to rely more on a simple visual search process. Naturally, the correspondence between the two learning types and these two processes is imperfect, but the data suggest that subjects that are more likely to employ the modelbased strategy are also more likely to engage in directed visual search. Moreover, the aDDM is mostly able to capture the choice, RT, and gaze patterns (the last gaze effect and the dwelltime effect) observed in the modelfree data, but it cannot account for the shifts in the modelbased data: the stronger last gaze location effect (Fig. 4a) but the weaker dwelltime effects on choice (Fig. 4b; Supplementary Note 1 and Supplementary Figs 1–4). The aDDM predicts that with a change in the model parameters these effects should go in the same direction; thus the aDDM cannot seem to capture the visualsearch process employed primarily by the modelbased subjects.
Visual transition cues affect choice behaviour
Our behavioural results, consistent with previous findings, suggest that modelfree subjects do not properly take the transition structure into account during their firststage choices. What is not yet known is whether these subjects are trying to track this information (and failing) or simply ignoring this aspect of the task. To answer this question, we designed a second condition of the experiment with visual cues to convey trialtotrial variations in the transition probabilities.
In this condition, subjects completed another 150 trials of the same task, with one important difference: the transition probabilities varied randomly and independently across trials. Each trial, the probability of the common transition varied uniformly from 0.4 to 1. Mathematically the mean objective probability of the common transition was 0.7 in both conditions, but in the second condition, subjects had to update the average transition probabilities with the trialtotrial changes in those probabilities. Here we provided subjects with onscreen visual cues indicating the deviations of the transition probabilities from their means (Fig. 1c). For simplicity we refer to the deviation of a symbol’s common colour as its ‘colour deviation’. We conveyed this information with two horizontal bars (one for each symbol). Each bar was coloured partly blue and partly purple. For example, for a symbol that on average leads to the blue state with P=0.7, a halfblue and halfpurple bar would indicate that on this trial the probability of reaching the blue state is P=0.7 (colour deviation=0), while a fullpurple bar would indicate that on this trial the probability of reaching the blue state is only P=0.4 (colour deviation=–0.3). Thus a modelbased subject looking to reach a particular state should utilize both the identities of the symbols and the bars.
Using a mixedeffects regression, we found that subjects indeed made use of the bars in their choices. Subjects were more likely to choose the same symbol as in the previous trial if that choice led to a rewarded common transition and the symbol’s current colour deviation was greater than the other symbol’s negative colour deviation. In other words, a symbol that typically leads to the blue state would be more likely to be chosen again if it led to a rewarded blue state in the last trial and its current bar contains more blue than the other bar (see Methods). This effect was highly significant (mixedeffects regression, N=43, P=10^{−8}, Fig. 2c,d, Supplementary Table 2, triple interaction between reward, transition type and colour deviation difference). As before, we also found a weak modelfree effect of pure reinforcement (mixedeffects regression, N=43, P=0.08) on the probability of repeating one’s firststage choice, however, we no longer observed a pure modelbased effect of reward interacted with transition type (mixedeffects regression, N=43, P=0.57).
To better understand the change in these effects, relative to the first condition, we fit a modified hybridlearning model that incorporated an additional weight parameter v for the colour deviations (see Methods). Optimally, a subject should weight the colour information equally to the baseline transition probability information (w=v). Instead we found that subjects heavily overweighed the colour deviations (v=0.67) relative to the baseline probabilities (w=0.16) and the modelfree information (1vw=0.17; see Supplementary Table 1). This helps to explain the greatly diminished effects of reward and the reward*transition type interaction on subjects’ choices.
Additionally, we observed that, across subjects, the modelfree regression coefficient in this second condition of the experiment was negatively correlated with both v (Pearson r(41)=−0.5, P=0.001) from the second condition and the modelbased weight w (Pearson r(41)=−0.39, P=0.01) from the first condition.
We can also ask whether introducing the transition information encouraged subjects to adopt a more modelbased strategy. Indeed, we observed a decrease in the average modelfree behaviour (1w in the first condition and 1wv in the second condition) from 0.6 to 0.2 (N=43, Mann–Whitney test, W=350, P=10^{−6}). While we cannot say whether the break between conditions or the additional instructions were partly responsible for this effect, we can rule out a simple effect of decreasing modelfree behaviour over time. We fit a model that used two different w’s for the first and the second halves of the first condition, and actually observed a slight increase in the modelfree weight (N=43, Mann–Whitney test, W=798, P=0.08).
Modelbased learners look more at the transition cues
We hypothesized that modelfree subjects might pay less attention to the coloured bars, indicating that they do not make full use of the task structure. To test this hypothesis, we compared subjects’ gaze patterns with their choice behaviour in both conditions of the experiment.
To measure subjects’ modelbased attention, we calculated the total share of dwell time on the bars compared to the symbols, as well as the probabilities of first and last gaze to the bars. All three variables were strongly correlated (Pearson r(41)>0.9, P<0.001), so we used gaze share in all further analyses.
Similar to our previous analyses, we measured the effect of the bars on subjects’ choices in two ways, one with a regression model and one with the hybridlearning model. For the first analysis, we estimated a mixedeffects logit regression predicting subjects’ choices in response to the pure reinforcement effect (reward coefficient, Supplementary Table 2) the modelbased effect (interaction between reward and transition) and the effect of the bar information (interaction between reward, transition and colour deviation). For the second analysis, we included the parameter v that captures the weight that subjects put on the colour deviations in their choices.
We found that the bar gaze share was positively correlated, across subjects, with both the barcolour choice effect in the mixedeffects regression (Pearson r(41)=0.53, P=10^{−5}; Fig. 5a) and the model parameter v (Pearson r(41)=0.7, P=10^{−7}). The gaze share was also negatively correlated with the modelfree (reward) coefficient in the mixedeffects model (Pearson r(41)=−0.65, P=10^{−6}; Fig. 5b) and the modelfree weight 1wv (Pearson r(41)=−0.7, P=10^{−7}) of the hybridlearning model, indicating that modelfree subjects were considerably less likely to look at the visual cues.
Finally, this same bargazeshare measure was positively correlated with the modelbased weight w from the first condition of the experiment (Pearson r(41)=0.37, P=0.015; Fig. 5c): on average, subjects that were classified as modelfree learners were looking at the indicator bars 57% of the time, while modelbased learners were looking at these bars 75% of the time. These outofsample results indicate that modelfree subjects ignore crucial aspects of the decision task, rather than simply misinterpreting that information.
Discussion
These results provide new insights into the intrinsic differences between modelbased and modelfree learning. Gaze data revealed that modelbased learners seem to know what they will choose before the options even appear, while modelfree learners employ an onthespot value comparison process that ignores the structure of the environment. The modelbased learners were more likely to look at the best option first, were most likely to look at only one option, and their choices were relatively unaffected by gaze time. On the other hand, the modelfree learners mostly looked at both options, often multiple times, made choices that were strongly influenced by relative gaze time, and ignored visual cues that provided information about transition probabilities. We propose that there are two distinct processes being observed at the ‘time of choice’ in these multistage decision tasks. One is a stimulusdriven comparison process exhibited primarily by modelfree subjects and the other is a simple visual search process to find an already chosen item, more typical for modelbased subjects. Our novel condition with visual cues conveying modelbased information was able to significantly increase subjects’ reliance on transitionprobability information, suggesting that the mere presence of explicit information may encourage modelbased behaviour.
Our findings highlight the need to study the dynamical properties of decisions rather than treating them as static processes. The random initial gaze location and effects of gaze time and final gaze location on modelfree choice align closely with previous aDDM research on decisions between food items and consumer goods^{23,24,31} as well subsequent studies using conditioned stimuli^{26} and monetary gambles^{29}. This work has demonstrated that these relationships between attention and choice are a natural consequence of a valuecomparison process that is biased towards the currently attended option^{23,40,41,42}. These studies typically find that gaze location and dwell time are independent of the values of the stimuli (at least in binary choice^{24,43}) suggesting a causal effect of attention on choice (see also^{25,42,44}). Other research has argued for the opposite direction of causality, that is, that the reward process might be able to bias attention towards more valuable stmuli^{32,45,46,47}.
Our results also have implications for the research into the neural underpinnings of modelfree and modelbased behaviour. Some of these studies have shown that modelbased computations, particularly state prediction error evaluations, are performed at the time of reward^{48,49}, while others show prospective computation at the ‘time of choice’^{3,22,50,51}. Because our findings suggest that the time of choice is systematically different across groups, this means that stimulus driven neural activity must be interpreted with some caution, as it could reflect the decision process or the postdecision search process. Future neuroimaging experiments could instead investigate activity during the intertrial interval to see if it is possible to detect modelbased planning then.
There are two prominent mechanisms for how memories might be integrated to guide decisions^{52}. One is prospective integration, which involves retrieving memories at the time a response is required^{22}. The other is retrospective integration, which involves learning at feedback time, before the next decision is confronted^{48,53,54}. The DYNA framework suggests how such learning could occur^{21} and successor representation describes how this could be extended to multistage environments^{55,56}. Our findings provide support for a retrospective mechanism, which occurs between trials. However, we depart from these established models in arguing that what occurs between trials is not only learning, but the actual choice as well. It may be possible to test this directly: for instance, one would predict that a decrease in intertrial interval might decrease modelbased behaviour as there will be less time to retrospectively learn and plan for the next trial.
Finally, the distinction between modelbased and modelfree learners is reminiscent of the distinction between proactive and reactive cognitive control^{57}. In the dualmechanismsofcontrol framework, proactive control involves anticipatory goalrelated activity, while reactive control is purely stimulus driven. Similarly, we have argued that modelbased behaviour involves formulating a plan before the coming decision, while modelfree behaviour is stimulus driven. However, a key difference is that in most cognitive control settings the stimuli are unpredictable, whereas in our setting the same two options are present in every trial. This suggests that our results may possibly extend to more complex settings where the options vary from trial to trial^{22}.
Our study also demonstrates another use for eyetracking data in the study of decisionmaking: determining whether subjects make use of all of the available information. The eyetracking results from the second condition of our experiment showed that modelfree subjects do not simply misinterpret or miscalculate the task structure information, but rather tend to ignore it, even when it is presented in explicit visual form. On the other hand, modelbased subjects clearly attend to this information and use it to inform their model of the task. These results corroborate previous findings in MouseLab studies of strategic bargaining^{58}, where subjects who were better at forwardlooking strategic thinking were also more likely to collect information about payoffs in future states. Moreover, providing these visual cues seemed to reduce the overall amount of modelfree learning, suggesting a potential remedy to this suboptimal behaviour.
Further research into the dynamical properties of modelbased and modelfree behaviour are clearly needed to gain a better understanding of the distinction between these modes of learning. Our study has provided an initial glimpse into the different mechanisms underlying these behaviours, but more work is needed to link the eyetracking data to neuroimaging results. We hope that this research will fuel further investigation into the ties between static models of learning and dynamical models of choice, which will certainly yield deeper insights into these core topics in decision science.
Methods
Subjects
Fortyfive students (19 female) at The Ohio State University were recruited from the Department of Economics subject pool. Subjects were paid based on their overall performance in the decision task, at a rate of 5¢ for one reward point, with a minimum payment of $5 as a showup fee. Subjects earned an average of $16. One subject experienced software crashes during the experiment, and another one explicitly failed to understand the task, so these two subjects’ data were excluded from the analysis. The Ohio State University Internal Review Board approved the experiment, and all subjects provided written informed consent.
Twostage decision task
In the first condition of the experiment, subjects completed 150 trials of a twostage Markov decision process task^{3}, with two short breaks every 50 trials (Fig. 1a). On each trial, the first stage involved a choice between two Tibetan symbols that had different probabilities of transition to two possible secondstage states (blue and purple, by the colour of the boxes that contained the symbols). One symbol was more likely to lead (on average) to the blue state, while the other one was more likely to lead to the purple state (Fig. 1b). On every trial, for each firststage symbol the transition probabilities to the common state were independently and randomly sampled from a uniform distribution in the interval between 0.4 and 1, resulting in an average of P=0.7 for each symbol. The other, rare state, was reached with probability 0.3. Subjects were instructed that each symbol was more likely to lead to one of the secondstage states, but they had to identify the transition probabilities on their own.
In the second stage, subjects were required to choose between two symbols in the state they reached (blue or purple). Each of the four secondstage symbols had an independent probability of yielding a fixed reward. During the course of the experiment, these probabilities drifted independently in the range from 0.25 to 0.75 according to slow Gaussian walks with mean=0 and s.d.=0.025 to facilitate learning and exploring different states.
In each stage, the position of the symbols on the screen was randomized. Choices were made using a keyboard, and every choice was followed by a white frame around the chosen symbol for 0.5 s. All choices had free RT. After the secondstage symbol was chosen, it was displayed at the center of the screen, with the outcome shown in the bottom part of the screen (either ‘+1 point’ or ‘0 points’).
Before starting the task, subjects were introduced to the rules of the task, including a short practice on each part of the task, and a 30trial practice session with different stimuli.
Twostage decision task with visual transition cues
In the second condition of the experiment, subjects completed another 150 trials of the twostage task with a modified firststage decision screen. Under each symbol we displayed visual cues for the respective deviations of the transition probabilities (Fig. 1c). The cues were presented in the form of coloured (blue and purple) bars. The horizontal size of each bar was equal to the horizontal size of the symbol boxes. Each bar showed the deviation of the particular trial transition probability from the average (0.7). At the average, each bar had blue and purple segments of equal size. If the probability of transition to a common state was sampled closer to 1, the segment of that state’s colour had a larger share of the bar, proportional to the absolute deviation from 0.7. On the other hand, if the transition probability approached 0.4, that segment’s share was smaller. Subjects went through additional instructions and training with the bars to ensure comprehension of the task.
The second stage decision screen was exactly the same as in the first condition of the experiment (see above). Subjects were instructed that the reward probabilities for all four secondstage symbols were randomly reset for this condition, but that the firststage symbols retained their transition probabilities.
Eyetracking methods
Subjects’ gaze data was recorded using an EyeLink 1000+ desktopmounted eyetracker with a chin rest and sampled at 1000 Hz. Before every choice, subjects were required to fixate at the center of the screen for 2 s, or the software did not allow them to proceed. This ensured unbiased initial gaze positions. The task was created and displayed using Matlab and Psychtoolbox^{59}. The chin rest was placed at 65 cm away from the screen, and the screen resolution was set at 1920 × 1080.
Eyetracking data analysis
The following procedure was applied to the gaze position data. The size of a symbol was set to 400 × 290 pixels. A gaze on a symbol was recorded if the gaze position was within a region of interest (ROI) that included the symbol itself and a 50pixel margin, so the horizontal distance between ROIs was 460 pixels. The ROIs for the bar indicators were set at the same size as the symbol ROIs centered around the bars. Vertically, the symbols were centered at 33% of the distance from the top of the screen, and the bars were placed at 80% distance from the top of the screen.
Trials with no gaze on the ROIs were excluded from all gaze analyses (the mean number of such trials in the first stage was 20 out of 150 per subject, with 30 subjects having less than 20 trials excluded; there was no significant difference in the number of excluded trials between modelbased and modelfree learners, twosided ttest, N=22 and 21, P=0.77). The main results were also robust to focusing solely on these 30 subjects (Supplementary Figs 5–7, Supplementary Table 8).
Gaps between two gazes on the same ROI were interpreted as a blink or a technical error (for example, eyetracker losing the pupil) and treated as one gaze to the same item. Gaps between gazes on two different ROIs were discarded.
First condition choice analysis
We implement a variant of a wellknown hybrid learning model^{38} that assigns socalled actionstate Qvalues to every action and combines the SARSA(λ) (stateactionrewardstateaction) modelfree reinforcement and a forwardlooking modelbased strategy that makes use of the empirical transition probabilities to evaluate expected values of the firststage choices.
The modelfree learning strategy uses only reward information to update the Qvalues. These values are initialized to zero at the beginning of the experiment. Let a_{1} be the symbol chosen in the first stage of the task, and a_{2} be the secondstage choice (subscripts generally indicate stage number). Then, after a trial t is completed and a reward r(t)∈{0,1} is received, the chosen secondstage symbol a_{2}’s Qvalue Q_{2} is updated in the following way:
where α is a learning rate parameter, and r(t)−Q_{2}^{MF}(i,t) is a reward prediction error. This process is identical for both modelfree and modelbased decision makers as there is no stochastic transition after this stage.
The value of the symbol chosen in the first stage is also updated through the reinforcement process, using both the secondstage reward prediction error and the prediction error that comes from the difference between the obtained and expected value of the secondstage state:
where λ is an eligibility trace parameter that captures the effect of the secondstage prediction error on the firststage action value.
The modelbased learning strategy incorporates the empirical transition probabilities into the updating process^{38}:
where P(bluea_{1}) and P(purplea_{1}) are the respective transition probabilities after choosing action a_{1} which are calculated using BetaBinomial Bayesian updating:
where N(bluea) and N(purplea) are the numbers of times the blue or purple state was reached after making a choice a.
The hybrid model Qvalue for each firststage choice is calculated using a convex combination of the modelfree and modelbased action values:
where w is a weight parameter restricted between 0 (pure modelfree strategy) and 1 (pure modelbased strategy).
A logit discrete choice model is assumed for both stage choices, with probability of the second stage choice computed as
and the first stage as
where β is a traditional choice ‘inverse temperature’ parameter, 1(a_{1}) is an indicator function that returns 1 if the same symbol was chosen in trial t−1, and 0 otherwise, and p is a parameter that captures ‘stickiness’ in firststage choices.
The hybrid model has 5 free parameters: α,β,λ,p,w. We do not use different α’s and β’s for the two stages for several reasons: (a) a larger number of parameters adds more noise to the estimation of our parameter of interest, w, (b) previous studies do not provide conclusive statistical evidence of a difference between first and second stage α and β values in the population^{3,22}, and (c) we find that for almost all of the subjects (40/43) the simpler model fits the data better, in terms of the Bayesian information criterion (BIC), than the model with two additional parameters.
We fit the model individually to each subject’s data using a maximum likelihood estimator and the probability formulas defined above. We restrict α, λ and w to lie between 0 and 1, β to be positive, and use a Nelder–Mead optimization procedure with 10,000 random starting points to ensure the achievement of global maxima. Obtained values were used to derive hybrid Qvalues on a trialbytrial basis for each subject individually.
Second condition choice analysis
In the second condition of the experiment, choices were influenced by visual cues presented on the screen. For this condition, we used a modified model that assumed two weight parameters instead of one. For the sake of notation simplicity, let p_{state}=P(statea_{1}) be the empirical probability of transition to a particular state after the firststage action a_{1}, estimated via the Baysesian updating formula in (4), and Δp_{state}=p(t)−0.7 be the deviation of the trial transition probability from the mean. Then the hybrid Qvalue for action a_{1} is defined as
where w is a weight assigned to the mean transition probability that has to be inferred from the symbol’s identity, and v is a weight assigned to the correction provided by the colour bars.
In all other aspects, the computational model in this part of the experiment is equivalent to the model in the first part.
Regression analysis
In addition, following previous literature^{3,18}, we ran a hierarchical logistic regression. All data was fit using the mixedeffects regression in lme4 package^{60} in R, and all coefficients were treated as random effects at the subject level. The dependent variable was the choice of the same firststage symbol that was chosen in the previous trial (stay=1, switch=0).
In the first condition of the experiment, it was regressed on the previous trial reward (1 or 0) and type of the previous trial state transition: 1 if the state reached was common for the chosen symbol and 0 if it was rare, as well as the interaction of these two variables. In this regression, the coefficient on the reward reflects modelfree choice, and the coefficient on the interaction reflects modelbased choice: a modelbased subject would repeat their choice if the transition was common, and would switch to the other symbol if the transition was rare.
In the second condition of the experiment, we used an additional continuous variable colour that was equal to the difference between the colour deviation for the common transition for the symbol chosen in the previous trial and the colour deviation for the same state for the other symbol. For example, if symbol A was chosen on the previous trial, and its common state was blue, then the colour variable in the current trial would be equal to the blue area for symbol A minus the blue area for symbol B. In this regression, we used all three regressors as well as their interactions. In addition to the effects described above, the interaction between reward, transition type and colour, measures the effect of the visual cues on choices in the first stage of the task.
Data availability
The data that support the findings of this study are available from the corresponding author upon request.
Additional information
How to cite this article: Konovalov, A. et al. Gaze data reveal distinct choice processes underlying modelbased and modelfree reinforcement learning. Nat. Commun. 7:12438 doi: 10.1038/ncomms12438 (2016).
References
Gläscher, J., Daw, N., Dayan, P. & O’Doherty, J. P. States versus rewards: dissociable neural prediction error signals underlying modelbased and modelfree reinforcement learning. Neuron 66, 585–595 (2010).
Beierholm, U. R., Anen, C., Quartz, S. & Bossaerts, P. Separate encoding of modelbased and modelfree valuations in the human brain. Neuroimage 58, 955–962 (2011).
Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P. & Dolan, R. J. Modelbased influences on humans’ choices and striatal prediction errors. Neuron 69, 1204–1215 (2011).
Daw, N. D. & Dayan, P. The algorithmic anatomy of modelbased evaluation. Philos. Trans. R. Soc. B Biol. Sci. 369, 20130478 (2014).
Daw, N. D. Modelbased reinforcement learning as cognitive search: neurocomputational theories. Cogn. Search Evol. Algorithms Brain at http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.216.209 (2012).
Wunderlich, K., Symmonds, M., Bossaerts, P. & Dolan, R. J. Hedging your bets by learning reward correlations in the human brain. Neuron 71, 1141–1152 (2011).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: an Introduction MIT Press (1998).
Schultz, W. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Glimcher, P. W. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proc. Natl Acad. Sci 108, 15647–15654 (2011).
O’Doherty, J. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science 304, 452–454 (2004).
Wimmer, G. E., Daw, N. D. & Shohamy, D. Generalization of value in reinforcement learning by humans: generalization of value. Eur. J. Neurosci. 35, 1092–1104 (2012).
Niv, Y., Edlund, J. A., Dayan, P. & O’Doherty, J. P. Neural prediction errors reveal a risksensitive reinforcementlearning process in the human brain. J. Neurosci. 32, 551–562 (2012).
Dolan, R. J. & Dayan, P. Goals and habits in the brain. Neuron 80, 312–325 (2013).
Eppinger, B., Walter, M., Heekeren, H. R. & Li, S.C. Of goals and habits: agerelated and individual differences in goaldirected decisionmaking. Front. Neurosci. 7, 253 (2013).
Otto, A. R., Skatova, A., MadlonKay, S. & Daw, N. D. Cognitive control predicts use of modelbased reinforcement learning. J. Cogn. Neurosci. 27, 319–333 (2015).
Gillan, C. M., Otto, A. R., Phelps, E. A. & Daw, N. D. Modelbased learning protects against forming habits. Cogn. Affect. Behav. Neurosci. 15, 523–536 (2015).
Doll, B. B., Shohamy, D. & Daw, N. D. Multiple memory systems as substrates for multiple decision systems. Neurobiol. Learn. Mem. 117, 4–13 (2015).
Skatova, A., Chan, P. A. & Daw, N. D. Extraversion differentiates between modelbased and modelfree strategies in a reinforcement learning task. Front. Hum. Neurosci. 7, 525 (2013).
Doll, B. B., Simon, D. A. & Daw, N. D. The ubiquity of modelbased reinforcement learning. Curr. Opin. Neurobiol. 22, 1075–1081 (2012).
Lee, S. W., Shimojo, S. & O’Doherty, J. P. Neural computations underlying arbitration between modelbased and modelfree learning. Neuron 81, 687–699 (2014).
Gershman, S. J., Markman, A. B. & Otto, A. R. Retrospective revaluation in sequential decision making: A tale of two systems. J. Exp. Psychol. Gen. 143, 182–194 (2014).
Doll, B. B., Duncan, K. D., Simon, D. A., Shohamy, D. & Daw, N. D. Modelbased choices involve prospective neural activity. Nat. Neurosci. 18, 767–772 (2015).
Krajbich, I., Armel, C. & Rangel, A. Visual fixations and the computation and comparison of value in simple choice. Nat. Neurosci. 13, 1292–1298 (2010).
Krajbich, I. & Rangel, A. Multialternative driftdiffusion model predicts the relationship between visual fixations and choice in valuebased decisions. Proc. Natl Acad. Sci. 108, 13852–13857 (2011).
Towal, R. B., Mormann, M. & Koch, C. Simultaneous modeling of visual saliency and value computation improves predictions of economic choice. Proc. Natl Acad. Sci. 110, E3858–E3867 (2013).
Cavanagh, J. F., Wiecki, T. V., Kochar, A. & Frank, M. J. Eye tracking and pupillometry are indicators of dissociable latent decision processes. J. Exp. Psychol. Gen. 143, 1476–1488 (2014).
Ashby, N. J., Dickert, S. & Glöckner, A. Focusing on what you own: Biased information uptake due to ownership. Judgm. Decis. Mak. 7, 254–267 (2012).
Ashby, N. J., Walasek, L. & Glöckner, A. The effect of consumer ratings and attentional allocation on product valuations. Judgm. Decis. Mak. 10, 172–184 (2015).
Stewart, N., Hermens, F. & Matthews, W. J. Eye movements in risky choice: eye movements in risky choice. J. Behav. Decis. Mak. 29, 116–136 (2015).
Hoffman, J. E. & Subramaniam, B. The role of visual attention in saccadic eye movements. Percept. Psychophys. 57, 787–795 (1995).
Krajbich, I., Lu, D., Camerer, C. & Rangel, A. The attentional driftdiffusion model extends to simple purchasing decisions. Front. Psychol. 3, 193 (2012).
Gottlieb, J. Attention, learning, and the value of information. Neuron 76, 281–295 (2012).
Hayhoe, M. & Ballard, D. Eye movements in natural behavior. Trends Cogn. Sci. 9, 188–194 (2005).
Wills, A. J., Lavric, A., Croft, G. S. & Hodgson, T. L. Predictive learning, prediction errors, and attention: evidence from eventrelated potentials and eye tracking. J. Cogn. Neurosci. 19, 843–854 (2007).
Hu, Y., Kayaba, Y. & Shum, M. Nonparametric learning rules from bandit experiments: the eyes have it!. Games Econ. Behav. 81, 215–231 (2013).
Niv, Y. et al. Reinforcement learning in multidimensional environments relies on attention mechanisms. J. Neurosci. 35, 8145–8157 (2015).
Knoepfle, D. T., Wang, J. T. & Camerer, C. F. Studying learning in games using eyetracking. J. Eur. Econ. Assoc. 7, 388–398 (2009).
Otto, A. R., Gershman, S. J., Markman, A. B. & Daw, N. D. The curse of planning: dissecting multiple reinforcementlearning systems by taxing the central executive. Psychol. Sci. 24, 751–761 (2013).
Dezfouli, A. & Balleine, B. W. Actions, action sequences and habits: evidence that goaldirected and habitual action control are hierarchically organized. PLoS Comput. Biol. 9, e1003364 (2013).
Shimojo, S., Simion, C., Shimojo, E. & Scheier, C. Gaze bias both reflects and influences preference. Nat. Neurosci. 6, 1317–1322 (2003).
Gottlieb, J., Hayhoe, M., Hikosaka, O. & Rangel, A. Attention, reward, and information seeking. J. Neurosci. 34, 15497–15504 (2014).
Milosavljevic, M., Navalpakkam, V., Koch, C. & Rangel, A. Relative visual saliency differences induce sizable bias in consumer choice. J. Consum. Psychol. 22, 67–74 (2012).
Towal, R. B., Mormann, M. & Koch, C. Simultaneous modeling of visual saliency and value computation improves predictions of economic choice. Proc. Natl Acad. Sci. USA. 110, E3858–E3867 (2013).
Armel, K. C., Beaumel, A. & Rangel, A. Biasing simple choices by manipulating relative visual attention. Judgm. Decis. Mak. 3, 396–403 (2008).
Peck, C. J., Jangraw, D. C., Suzuki, M., Efem, R. & Gottlieb, J. Reward modulates attention independently of action value in posterior parietal cortex. J. Neurosci. 29, 11182–11191 (2009).
Yasuda, M., Yamamoto, S. & Hikosaka, O. Robust representation of stable object values in the oculomotor basal ganglia. J. Neurosci. 32, 16917–16932 (2012).
Lee, J. & Shomstein, S. Rewardbased transfer from bottomup to topdown search tasks. Psychol. Sci. 25, 466–475 (2013).
Wimmer, G. E. & Shohamy, D. Preference by association: how memory mechanisms in the hippocampus bias decisions. Science 338, 270–273 (2012).
Shohamy, D. & Wagner, A. D. Integrating memories in the human brain: hippocampalmidbrain encoding of overlapping events. Neuron 60, 378–389 (2008).
Frank, M. J. et al. fMRI and EEG predictors of dynamic decision parameters during human reinforcement learning. J. Neurosci. 35, 485–494 (2015).
Simon, D. A. & Daw, N. D. Neural correlates of forward planning in a spatial decision task in humans. J. Neurosci. 31, 5526–5539 (2011).
Shohamy, D. & Daw, N. D. Integrating memories to guide decisions. Curr. Opin. Behav. Sci. 5, 85–90 (2015).
Foster, D. J. & Wilson, M. A. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature 440, 680–683 (2006).
KurthNelson, Z., Barnes, G., Sejdinovic, D., Dolan, R. & Dayan, P. Temporal structure in associative retrieval. eLIFE 4, e04919 (2015).
Dayan, P. Improving generalization for temporal difference learning: the successor representation. Neural Comput. 5, 613–624 (1993).
Gershman, S. J., Moore, C. D., Todd, M. T., Norman, K. A. & Sederberg, P. B. The successor representation and temporal context. Neural Comput. 24, 1553–1568 (2012).
Braver, T. S. The variable nature of cognitive control: a dual mechanisms framework. Trends Cogn. Sci. 16, 106–113 (2012).
Johnson, E. J., Camerer, C., Sen, S. & Rymon, T. Detecting failures of backward induction: monitoring information search in sequential bargaining. J. Econ. Theory 104, 16–47 (2002).
Cornelissen, F., Peters, E. & Palmer, J. The eyelink toolbox: eye tracking with MATLAB and the psychophysics toolbox. Behav. Res. Methods Instrum. Comput. 34, 613–617 (2002).
Bates, D., Maechler, M. & Bolker, B. lme4: Linear mixedeffects models using S4 classes. R package version 1.110. http://CRAN.Rproject.org (2012).
Acknowledgements
We thank N.D. Daw for sharing experiment code and tutorials.
Author information
Authors and Affiliations
Contributions
Both authors designed the experiment and analyses. A.K. programmed and conducted the experiment, performed the data analysis, and cowrote the paper. I.K. cowrote the paper and supervized the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Information
Supplementary Figures 1 – 7, Supplementary Tables 1 – 8 and Supplementary Note 1 (PDF 1324 kb)
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
Konovalov, A., Krajbich, I. Gaze data reveal distinct choice processes underlying modelbased and modelfree reinforcement learning. Nat Commun 7, 12438 (2016). https://doi.org/10.1038/ncomms12438
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/ncomms12438
This article is cited by

Modelling Stock Markets by Multiagent Reinforcement Learning
Computational Economics (2021)

Mouse tracking reveals structure knowledge in the absence of modelbased choice
Nature Communications (2020)

Humans primarily use modelbased inference in the twostage task
Nature Human Behaviour (2020)

Amount and time exert independent influences on intertemporal choice
Nature Human Behaviour (2019)

Gaze bias differences capture individual choice behaviour
Nature Human Behaviour (2019)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.