Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning

Organisms appear to learn and make decisions using different strategies known as model-free and model-based learning; the former is mere reinforcement of previously rewarded actions and the latter is a forward-looking strategy that involves evaluation of action-state transition probabilities. Prior work has used neural data to argue that both model-based and model-free learners implement a value comparison process at trial onset, but model-based learners assign more weight to forward-looking computations. Here using eye-tracking, we report evidence for a different interpretation of prior results: model-based subjects make their choices prior to trial onset. In contrast, model-free subjects tend to ignore model-based aspects of the task and instead seem to treat the decision problem as a simple comparison process between two differentially valued items, consistent with previous work on sequential-sampling models of decision making. These findings illustrate a problem with assuming that experimental subjects make their decisions at the same prescribed time.

excluding the 13 subjects whose zero-gaze-trial rate was above average.  Figure 3 excluding the 13 subjects whose zero-gaze-trial rate was above average.  Figure 4 excluding the 13 subjects whose zero-gaze-trial rate was above average.  Means and standard deviations of model parameters across subjects, in condition 1 and condition 2 of the experiment. The computational hybrid learning model was fit to each subject individually using a maximum likelihood estimator. α is the learning rate, β is the inverse temperature parameter, λ is the eligibility trace parameter, is the stickiness of choice, is the weight of the model-based strategy (=0 for pure reinforcement and = 1 for pure modelbased behavior), is the weight on the color deviations in the second condition of the experiment. LL is the maximized posterior likelihood.

P (stay)
Part 1 all data Supplementary Table 2. Probability of choosing the same symbol as in the previous trial (1 = stay, 0 = switch) conditional on the previous trial's reward outcome (1 = rewarded, 0 = unrewarded) and the type of transition (1 = common, 0 = rare), fixed effect at population level. All regression models are mixed effect logistic regressions performed in lme4 R package (formula: stay ~ reward*transition + (1+ reward*transition | subject)). Column 1 shows the pooled data exhibiting a mixture of pure reinforcement and model-based learning. Columns 2 and 3 show the results for model-free and model-based learners, classified using a full hybrid computational model. Column 4 shows the pooled data for the second part of the experiment (formula: stay ~ reward*transition*color + (1+ reward*transition*color | subject)). The color variable is the difference between the color deviation for the chosen symbol and the negative color deviation for the other symbol.  Table 3. Coefficients from logit regression models (P(look at higher Q first) ~ abs(QB -QA) x Model-based) with standard errors clustered at the subject level (the mixedeffects model did not converge in the continuous w case, so we used this more conservative method instead). Columns 1 and 2 report the results that use a group dummy, columns 3 and 4 report the results with a continuous w per subject. Columns 2 (reported in the main text) and 4 display the results without one subject that had an individual coefficient for abs(QB-QA) that was an extreme outlier (~100 times higher than the rest of the subjects).

P (B chosen)
Group  Table 5. Fixed effects coefficient estimates of second-stage choice regressions in condition 1 of the experiment (symbols are arbitrarily labeled A and B) using mixed effects logistic models, trials with two or more symbols attended. Model-based behavior is more affected by the gaze site, while model-free behavior is more influenced by gaze durations. Supplementary Table 6. The same analysis as in Supplementary Table 5, with only common transition trials included. Fixed effects coefficient estimates of second-stage choice regressions in condition 1 of the experiment (symbols are arbitrarily labeled A and B) using mixed effects logistic models, trials with two or more symbols attended, only common transition trials included. Model-based subjects are more influenced by the last gaze location (in bold).  Table 7. The same analysis as in Supplementary Table 3, with only rare transition trials included. Fixed effects coefficient estimates of second-stage choice regressions in condition 1 of the experiment (symbols are arbitrarily labeled A and B) using mixed effects logistic models, trials with two or more symbols attended, only "common" transition trials included. We observe no significant difference in coefficients for the two groups (in bold). Supplementary Table 8. Table 1 excluding the 13 subjects whose zero-gaze-trial rate was above average.

Supplementary Note 1: Exploratory aDDM analyses
Following Krajbich et al. (2010), we fit an attentional drift--diffusion model (aDDM) that assumes a drift diffusion process that evolves over time as a random walk that starts at 0 and reaches a barrier at --1 or +1. If the subject is looking at symbol A, the decision variable changes with a constant drift rate equal to ! − ! + ! , where d is a scale parameter, ! and ! are Q--values estimated using the hybrid--learning computational model, (from 0, reflecting full gaze bias, to 1, the regular DDM case) is a parameter that reflects the bias towards the item currently looked at, and ! is normally distributed noise with mean 0 and standard deviation . If the subject is looking at symbol B, the drift rate is equal to ! − ! + ! . The model assumes that the first gaze goes randomly to one of the symbols with probability estimated from the data, and then gazes alternate between the two symbols until one of the barriers is reached, and that every trial has fixed non--decision time !" .
First, to fit the model, following the standard DDM approach (Ratcliff and McKoon, 2008), we calculated the empirical distribution of response times (RTs) binned into 5 quintiles (0.1,0.3,0.5,0.7,0.9) and 11 choice difficulty bins (QA -QB ranging from --1 to +1 by 0.2) for the pooled subject data and fit it to the simulated aDDM RT--difficulty distributions produced by 5000 randomly drawn sets of parameters ( , , , !" ) using the chi--square test (minimizing the ! statistic). These fits provided expected fits to the RT distributions, but did not match the choice probabilities (especially for the model--based group) and largely missed the key trends in the eye--tracking data ( Supplementary Figures 1 and 2).
Next, we relied on previous work that showed that typically takes on a value of 0.3 across several choice domains. Using this value, we adjusted the other model parameters ( and ) to achieve the best fit to the model--free data. We identified a set of parameters that provided a substantially improved fit to the choice and eye--tracking data without much of a detriment to the RT fits (see Supplementary Figure 3). The model under--estimates the overall probability that the last gaze is to the chosen item (by approximately 10%), which can be due to the fact that these data do contain some visual search trials.
Finally, we performed a grid parameter search based on the model--free fits to fit the model--based subjects' data. By reducing to zero, drastically increasing the drift rate , and reducing noise we were able to mimic the visual search process by producing a strong bias towards choosing the last--seen item (Supplementary Figure 4). But this adjustment reduces the sensitivity of the choices to the Q--value difference, and the produced RTs are significantly faster (~200ms) than the data. This indicates the inability of the aDDM to simultaneously capture the choice accuracies and short RTs at the same time as the significant bias to choose the last--seen item and the many single--gaze trials, displayed by the model--based subjects' data.