Abstract
An extensive reinforcement learning literature shows that organisms assign credit efficiently, even under conditions of state uncertainty. However, little is known about creditassignment when state uncertainty is subsequently resolved. Here, we address this problem within the framework of an interaction between modelfree (MF) and modelbased (MB) control systems. We present and support experimentally a theory of MB retrospectiveinference. Within this framework, a MB system resolves uncertainty that prevailed when actions were taken thus guiding an MF creditassignment. Using a task in which there was initial uncertainty about the lotteries that were chosen, we found that when participants’ momentary uncertainty about which lottery had generated an outcome was resolved by provision of subsequent information, participants preferentially assigned credit within a MF system to the lottery they retrospectively inferred was responsible for this outcome. These findings extend our knowledge about the range of MB functions and the scope of system interactions.
Introduction
Efficient adaptation to the environment requires that organisms solve a creditassignment problem (i.e. learn which actions are rewarding in different worldstates). Previous research has demonstrated that organisms use flexible efficient learning strategies to cope with situations that entail uncertainty about the state of the world^{1,2,3,4,5,6,7}. However, little attention has been paid to a common case where there is uncertainty about a state at the time an action is executed and an outcome is received, but where this state uncertainty can subsequently be resolved by an inference process. This retrospective resolution can dramatically color and explain (away) the implications of actionoutcome pairings. Indeed, whole genres of detective fiction depend on this very scenario, as does the dawning realisation of an unwitting victim fleeced by a devious card shark, who had initially seduced the victim into thinking they are skilled or blessed with good luck by providing early rewards. The question we address here concerns the effect of this retrospective inference on creditassignment and whether, and how, it modulates fundamental signatures of reinforcement learning.
Our experimental approach was framed within the perspective of dual reinforcement learning (RL) systems. Here, an extensive body of psychological and neuroscientific literature indicates that behaviour is governed by two distinct systems^{8,9,10,11,12,13,14,15,16,17,18,19,20,21,22}—a rigid, retrospective modelfree (MF) system^{23,24} and a flexible, prospective modelbased (MB) system^{23,25}. Unlike the MF system, which tends to repeat actions that were successful in the past, the MB system deliberates upon the likely future effects of potential actions. Recent findings suggest that when making decisions, an agent’s behaviour reflects contributions from both systems^{16,25}. A range of theories highlights diverse principles underlying dual system interactions, such as speed accuracy tradeoffs^{26}, an actortrainer dichotomy^{17,27}, reliabilitybased arbitration^{9,28} and a plantohabit strategy^{29}. A separate rich body of research shows that when RL occurs in the face of state uncertainty, beliefs about states underlie the calculation of prediction errors and guide learning^{1,2,3,4,5,6,7}.
Here we develop and test a theory of retrospective MB inference. We propose that, in addition to prospective planning, a MB system performs retrospective inference about states in order to resolve uncertainty operative at the actual time of a decision. We can summarise this as posing a question that asks not only “where should I go?” but also “where did I come from?” Our theory draws inspiration from a dual role that cognitive models play in reallife not only in predicting future states and outcomes, but also in retrospectively inferring past hidden states. In a social domain, for example, one relies on models of close others not only to predict their future actions but also to cast light on motives that underlined their past actions. Our key proposal is that in situations involving temporary state uncertainty, a MB system exploits its model of task structure to resolve uncertainty retrospectively, i.e., following a choice. Furthermore, we suggest that a MF system can exploit the outcome of a MB inference to assign the credit from a choice preferentially to an inferred state, thus underpinning a form of interaction between the MB and MF systems.
To test this hypothesis we designed a task with state uncertainty and in which it was possible to dissociate MB and MF control of behaviour. At selected points, subjects had to choose between pairs of bandits (i.e., lotteries), and were informed that only one of the two bandits of their chosen pair will be executed. Critically, observers were not informed explicitly which of the two bandits this actually was but they could retrospectively infer the identity of the executed bandit once they observed which outcomes ensued. We found evidence for MF learning that was greater for the executed bandit, supporting a hypothesis that a MB system retrospectively infers the correct state and that this inference directs a MF credit assignment. In our discussion we consider potential algorithmic and mechanistic accounts of these findings.
Results
Behavioural task
We developed a dualoutcome bandit task in which we introduced occasional instances of uncertainty with respect to which bandit was actually executed. In brief, at the outset, participants were introduced to a treasure castle that contained pictures of four different objects. Subjects were trained as to which pair of rooms—out of four castle rooms characterised by different colours—each object would open. Each individual room was opened by two distinct objects, while each object opened a unique pair of rooms (Fig. 1a).
Following training, each participant played 504 bandit trials. Two out of every three trials were “standard trials”, in which a random pair of objects that shared a room as an outcome were offered for 2 s. Subjects had to choose between this pair. Once a choice was made, the two corresponding rooms opened, one after the other, and each could be either empty or contain a bonus point worth of treasure (Fig. 1b). Every third trial was an “uncertainty trial” in which, rather than two objects, two disjoint pairs of objects were presented for choice. Crucially, each of these presented pair of objects shared one Commonroom outcome. Participants were informed that a “transparent ghost” who haunted the castle would randomly nominate, with equal probability, one of the objects within this chosen pair. Since the ghost was transparent, participants could not see which object it nominated. Nevertheless, in this arrangement subjects still visited the opened rooms and collected any treasure they found. Importantly, when the ghost nominated an object, the room Common to both objects in the chosen pair opened first, and the room that was Unique to the ghostnominated object opened second. Thus, it was at this second point that participants could infer the object that the ghost had nominated. Across the timecourse of the experiment, room reward probabilities were governed by four independently drifting random walks (Fig. 1c).
Modelfree and modelbased contributions in standard trials
Because our main analysis concerns the effect of MB inference on MFlearning, it is important to show first that participants in the task relied on contributions from both systems. Thus, we begin by specifying how each putative system contributes to performance and, in so doing, show significant MF and MB contributions to choices.
The MF system caches the rewards associated with previous object choices. In the choice phase of standard trials, the MF system feeds retrieved Q^{MF}values of the objects offered for choice into a decision module. In the learning phase on standard trials, the simplest form of MF system performs a Rescorla–Wagner^{30} update with learningrate lr, based on a prediction error corresponding to the total reward of the two consequent rooms (ignoring the sequence):
Importantly, as the MF system lacks a model of the task transitions, on each trial its creditassignment is restricted to updates generated by the object chosen on that trial (as in Doll et al., 2015^{25}). For example, a blueroom reward following a choice of the stove will update the MF Q^{MF}value of the stove but not of the lightbulb, which also happens to open the blue room.
There are various possible MB systems. The most important difference between them concerns whether the MB system learns directly about the rooms, and uses its knowledge of the transition structure to perform an indirect prospective calculation of the values of the objects presented on a trial based on the values of the rooms to which they lead (henceforth, a ‘room value learning’ MB system), or whether it uses knowledge of the transition structure to learn indirectly about the objects, and then uses these for direct evaluation (henceforth, an ‘object value learning’ MB system). While these two formulations of an MB system are similar in that they both allow generalization of observations about rewards to objects that were not chosen or nominated, they nevertheless differ in their value calculations and generate different experimental predictions.
Until the very end of this section, our presentation relies on the room value learning formulation. This is justified because a model comparison (Supplementary Fig. 5) revealed that it was superior to an objectvalue learning formulation. According to this model, rather than maintaining and updating Q^{MB}values for the objects, an MB system instead does so for the rooms, and prospectively calculates ondemand Q^{MB}values for the offered objects. In standard trials, this is (normatively) based on the arithmetic sum of the values of their corresponding rooms:
During the learning phase of standard trials, the system performs Rescorla–Wagner updates for the values of the observed rooms:
Consequently, unlike MF, MB creditassignment generalizes across objects that share a common outcome. To continue the example, when a blue room is rewarded, Q^{MB}(blue room) increases and in following calculations, the ondemand Q^{MB}values for both the stove and lightbulb will benefit.
We next show, focusing on modelagnostic qualitative behavioural patterns alone, both MF and MB contributions to choices. These analyses are accompanied by modelsimulations of pure MF and pure MB contributions to choices. The main purpose of these simulations is to support the reasoning underlying our analyses. A full description of these models is deferred to a later section.
In this behavioral analysis we confine ourselves to Standard→Standard trial transitions. Consider a trialn + 1, which offers for choice the trialn chosen object (e.g. key), against another object (e.g. stove; Fig. 2a). The two offered objects open a “Common room” (e.g. green, denoted C) but the trialn chosen object also opens a “Unique room” (brown, denoted U). We tested if the probability of repeating a choice depended on the trialn Commonroom outcome, while controlling for the Uniqueroom outcome (Fig. 2b). From the perspective of the MB system (Fig. 2c), the value of the Common room, and in particular, whether it had been rewarded on trialn, exerts no influence on the relative Q^{MB}values of the currently offered objects, because this value “cancels out”. For example, the calculated MB value of the key on trialn + 1 is the sum of the MB Q values of the green and brown rooms. Similarly, the calculated MB value of the stove on trialn + 1 is the sum of the MB Q values for the green and the blue rooms. The MB contribution to choice depends only on the contrast between these calculated key and stove values, which equals the difference between the MB values of the brown and blue rooms. Notably, the value of the green room is absent from this contrast and hence does not affect MB contributions to choice on trialn + 1. From the perspective of the MF system, however, a Commonroom reward on trialn reinforces the chosen object alone leading to an increase in its repetition probability, as compared to nonreward for this Common room (Fig. 2d).
Using a logistic mixed effects model, in which we regressed the probability of repeating the trialn choice on Common and Unique trialn outcomes, we found (Fig. 2b) a main effect for the Common outcome (b = 0.98, t(3331) = 8.22, p = 2.9e–16; this is evident in the red, CRew, line being above the blue, CNon, line), supporting an MF contribution. Additionally, we found a main effect for the Unique reward (b = 2.03, t(3331) = 14.8, p = 4.9e–48; evident in the increase of both red and blue lines from UNon to URew), as predicted by both MF and MB contributions, and a significant interaction between Common and Unique outcomes (b = 0.46, t(3331) = 2.49, p = 0.013) indicating that the effect of the Common outcome was modestly larger when the Unique room was rewarded than unrewarded. An analysis of simple effects revealed that the Common room had a positive effect irrespective of whether the Unique room was unrewarded (b = 0.75, t(3331) = 4.60, p = 4e–6) or rewarded (b = 1.21, t(3331) = 8.73, p = 4e–18; See Supplementary Note 1 for clarifications about Fig. 2).
Turning next to a MB contribution, consider a trialn + 1, which excludes the trialn chosen object (e.g., key; Fig. 2f) from the choice set. In this case, the trialn chosen object shares a Common room (e.g. green) with only one of the trialn + 1 offered objects (e.g. stove), whose choice we label a “generalization”. Additionally, the trialnchosen object shares no outcome with the other trialn + 1 offered object (e.g. bulb). We examined whether the probability of generalizing the choice depended on the Common outcome on trialn (Fig. 2g). A MB contribution (Fig. 2h) predicts a higher choice generalization probability when the Commonroom was rewarded on trialn, as compared to nonrewarded, because this reward increases the calculated Qvalues of all objects (including the stove) that open that room.
Considering the MF system (Fig. 2i), trialn rewardevents cannot causally affect choices on trialn + 1, because learning on trialn was restricted to the chosen object, which is not present on trialn + 1. However, MF predictions are somewhat complicated by the fact that a Common green outcome on trialn (reward vs. nonreward) is positively correlated with the MF Qvalue of the stove on trialn + 1. To understand this correlation, note that the reward probability time series for each room is autocorrelated since it follows a random walk. This means coarsely that the green room’s reward probability time series alternates between temporally extended epochs during which the green room is better or worse than its average in terms of a reward probability. When the green room is rewarded vs. unrewarded on trialn, it is more likely that the green room is currently spanning one of its better epochs. Importantly, this also means that the stove was more likely to earn a reward from the green room when it had recently been chosen prior to trialn. Thus, a Commonroom reward for the key on trialn is positively correlated with the MF value of the stove on trialn + 1. It follows that an MF contribution predicts a higher generalization probability when the Common room is rewarded as compared to nonrewarded (Fig. 2i). Critically, because this MF prediction is mediated by the reward probability of the Common room (i.e., how good the green room is in the current epoch), a control for this probability washed away a MF contribution to the effect of the Common trialn outcome on choice generalization (Fig. 2m), hence implicating the contribution of an MB system (Fig. 2l).
A logistic mixed effects model showed (Fig. 2k) a positive main effect for the Common outcome (b = 0.40, t(3225) = 3.328, p = 9e–4) on choice generalization, supporting an MB contribution to bandit choices. Additionally, we found a significant main effect for the Common outcome’s reward probability (b = 1.94, t(3225) = 7.08, p = 2e–12) as predicted by both systems (Fig. 2l, m) and no interaction between the Common trialn outcome and the Common room’s reward probability (b = −0.75, t(3225) = −1.76, p = 0.079). Note that unlike our analysis which pertained to an MF contribution (Fig. 2a–e), the analysis pertaining to MB contributions did not control for the effect of the Unique room (e.g. brown), because it was an outcome of neither choice object on trialn + 1. Hence, this room’s outcome was expected to exert no influence on choice generalization from the perspective of either MB or MF. Indeed, when we added the Uniqueroom trialn outcome to the mixed effect model, none of the effects involving the Uniqueroom outcome were significant (all p > 0.05), and the effects of the Common room’s outcome and Common reward probability remained significantly positive (both p < 0.001) with no interaction. Consequently, this room was not considered in our analyses pertaining to a MB’s contribution to performance.
MB inference guides MF learning on uncertainty trials
Having established that both systems contribute to performance we next addressed our main set of questions: do people infer the ghostnominated object and, if so, does this inference guide the expression of MF learning? To address these questions we probed whether on uncertainty trials, MF learning discriminated between constituent chosenpair objects, a finding that would implicate a retrospective inference of the ghostnominated object. MF learning in this instance was gauged by probing on uncertainty trials the effect of outcomes on followup standard trial choices (Fig. 3).
Recall that following a ghost nomination, agents observe first the room that is common to the objects in their chosen pair, and then the room unique to the ghostnominated object. Therefore, agents have a 50–50% belief with respect to the ghostnominated object when they choose a pair, and this belief is maintained upon observing the first outcome, which is noninformative with respect to inference of the ghost’s nominee. Critically, following the second outcome, an MB system can infer the ghostnominated object, with perfect certainty, based upon a representation of task transition structure. The second outcome is therefore informative with respect to inference of the ghost’s nominee. Henceforth, we denote the first and second outcomes by “N” (for Noninformative) and “I” (for Informative). We hypothesised that inferred object information is shared with the MF system and this directs learning towards the chosen object. Here we assume that after outcomes are observed, the MF system updates the Q^{MF}values of both objects in the chosen pair, possibly to a different extent (i.e., with different learning rates). In the absence of inference about the ghost’s nomination, the MF system should update the Q^{MF}values of both objects to an equal extent. Thus, a finding that learning occurred at a higher rate for the ghostnominated, as compared to the ghostrejected, object would support a retrospective inference hypothesis.
Consequently, we examined MF learning in uncertain trials by focusing on Uncertain→Standard trial transitions. The task included three subtransition types, which were analysed in turn. We first present the findings in support of our hypothesis. In the discussion we address possible mechanistic accounts for these findings.
Informative outcome credit assignment to nominated object
First, we show that MF assigns credit from the Informative outcome to the ghostnominated object. Consider a standard trialn + 1 (following an uncertainty trialn) that offered for choice the ghostnominated object (e.g. key) alongside an object (e.g. phone) from the trialn nonselected pair that shared the previously inferenceallowing, I, outcome (e.g. brown) with the ghostnominated object; (Fig. 4a). We label such trials “repeat trials”. A choice repetition is defined as a choice of the previously ghostnominated object. We tested whether a tendency to repeat a choice depended on the trialn Informative outcome. Note that from the perspective of MB evaluations on trialn + 1 the Informative room’s value cancels out because this room is associated with both offered objects. From the perspective of the MF system, however, if Informative outcome credit was assigned to the ghostnominated object on trialn, then reward vs. nonreward on an Informative room should increase the repetition probability.
A logistic mixed effects analysis, in which we regressed the probability of repeating the choice on trialn outcomes, showed a main effect for the Informative (I) outcome (b = 0.84, t(2204) = 7.17, p = 1e–12), supporting an hypothesis that MF assigns credit from this outcome to the ghostnominated object. Additionally, we found a main effect for the Noninformative (N) outcome (b = 1.49, t(2204) = 10.63, p = 9e–26), as predicted by both MF creditassignment to the ghostnominated object and by MB contributions, and no significant interaction between the Informative and Noninformative outcomes (b = 0.35, t(2204) = 1.66, p = 0.098).
Informative outcome credit assignment to rejected object
Second, we tested whether credit from the Informative outcome was also assigned by an MF system to the ghostrejected object. Consider a standard trialn + 1 that offered for choice the ghostrejected object (e.g. stove) alongside an object from the trialn nonselected pair (e.g. bulb) that shares an outcome with the ghostrejected object (Fig. 4b). We label such trials “switch trials”. A choice generalization is defined as a choice of the previously ghostrejected object. Does the tendency to generalize a choice depend on the trialn Informative outcome? An MB contribution predicts no effect, because the Informative room is an outcome of neither trialn + 1 choiceobject. From the perspective of the MF system, however, if credit had been assigned to the ghostrejected object on trialn, then reward versus nonreward on an Informative outcome should increase the generalization probability.
A logistic mixed effects model, in which we regressed the choice generalization probability on the trialn outcome, showed a main effect for the Informative outcome (b = 0.32, t(2207) = 2.92, p = 0.004), supporting the hypothesis that an MF system assigns credit to the ghostrejected object. This result is striking because it shows that the MF system assigned credit from the Informative outcome to an object that is not related to that outcome. This challenges any notion of perfect MB guidance of MF creditassignment. However, it is consistent with the possibility that some participants, at least some of the time, do not rely on MB inference because when MB inference does not occur, or when it fails to guide MF creditassignment, the MF system has no basis to assign credit unequally to both objects in the selected pair. Additionally, we found a main effect for the Noninformative outcome (b = 1.15, t(2207) = 8.80, p = 3e–18), as predicted, not only by an MF creditassignment to the ghostrejected object account but also by MB contributions. We found no significant interaction between the Informative and Noninformative outcomes (b = −0.07, t(2207) = −0.31, p = 0.755).
Informative outcome preferential credit assignment
Hitherto we showed that on uncertainty trials, credit obtained from an Informative, inferenceallowing, outcome was assigned in a MF manner to both the ghostnominated and the ghostrejected objects. We hypothesized that the Informative outcome would support MB retrospective inference, and boost MF learning for the ghostnominated object. Thus, we compared (Fig. 4d) effects that the Informative outcome exerted over choice on followup repeat and switch trials, i.e. the effects from the previous two analyses (Fig. 4a, b). For each participant, we calculated the contrast for the probability of repeating a choice in repeat followups when the Informative trialn outcome was rewarded vs. nonrewarded (M = 0.159, SE = 0.024). Additionally, we computed the corresponding contrast for the probability of generalizing the choice in switch followups (M = 0.086, SE = 0.024). Importantly, the former contrast was larger than the latter (t(39) = 2.29, p = 0.028), implicating enhanced MF creditassignment to the ghostnominated object.
Noninformative outcome preferential credit assignment
We next examine MF credit assignment for the first Noninformative outcome. Consider a standard trialn + 1 that offered for choice the trialn ghostnominated object alongside the ghostrejected object, i.e. the trialn chosen pair (Fig. 4c). We label such trials “clash trials”. We defined a choice repetition as a trialn + 1 choice of the object the ghostnominated on trialn. As in previous cases, the MB contribution predicts no effect of the trialn Noninformative outcome due to cancelling out. From the perspective of the MF system, however, if credit was assigned preferentially to the ghostnominated object on trialn, then a Noninformativereward, as compared to nonreward, should increase the repetition probability.
A logistic mixed effects model, in which we regressed the choice repetition probability on trialn outcomes, showed a main effect for the Noninformative outcome (b = 0.20, t(2197) = 1.98, p = 0.048), supporting the hypothesis that MF credit assignment is mainly directed to the ghostnominated object. This finding is striking, because during reward administration for the first room, participants have a 50–50% belief about which object had been nominated by the ghost. We suggest that this supports the idea of an MF creditassignment mediated by a later retrospectiveinference. Additionally, we found a main effect for the Informative outcome (b = 1.40, t(2197) = 9.41, p = 1e–20), as predicted by both the enhanced MF creditassignment for the ghostnominated object hypothesis and by a MB contribution. We found no significant interaction between Noninformative and Informative outcomes (b = 0.48, t(2197) = 1.89, p = 0.059) (See Supplementary Fig. 1 for supporting model simulations demonstrating that preferential creditassignment for the ghostrejected object is necessary to account for the empirical effects).
Computational modelling
One limitation of the analyses reported above is that they isolate the effects of the immediately preceding trial on a current choice. However, the actions of RL agents are influenced by the entire task history. To account for such extended effects on behavior, we formulated a computational model that specified the likelihood of choices. The model allowed for a mixture of contributions from MB and MF processes. Critically, our model included three free MF learningrate parameters, which quantified the extent of MF learning for standard trials (lr_{standard}), for the ghostnominated object in uncertainty trials (lr_{ghostnom}) and for the ghostrejected object on uncertainty trials (lr_{ghostrej}). Additionally, we formulated four submodels of interest: (1) a pure MB model, which was obtained by setting the contribution of the MF and its learning rates to 0 (i.e. w_{MB} = 1; lr_{standard} = lr_{ghostnom} = lr_{ghostrej} = 0), (2) a pure MFaction model, which was obtained by setting the contribution of the MB system to choices and its learning rate to 0 (i.e. w_{MB} = 0; lr_{MB} = 0; Note that in this model, a MB inference was allowed to guide MF inference), (3) a ‘noninference’ submodel obtained by constraining equality between the learning rates for the ghostnominated and rejected objects, lr_{ghostnom} = lr_{ghostrej} and (4) a ‘nolearning for ghostrejected’ submodel, which constrained the learning rate for the ghostrejected object to lr_{ghostrej} = 0. We fitted these models to each participant’s data using a MaximumLikelihood method (See the methods for full details about the models; See Supplementary Table 1 for the full model’s fitted parameters).
We next compared our full model separately with each of these submodel. Our model comparisons were all based on a bootstrap generalizedlikelihood ratio test (BGLRT^{31}) between the full model and each of its submodels in turn (see methods for details). In brief, this method is based on hypothesis testing, where, in each of our model comparisons, a submodel serves as the H0 null hypothesis and the full model as the alternative H1 hypothesis. The results were as follows. First, we rejected the pure MB and the pure MFaction submodels for 26 and 34 individuals, respectively, at p < 0.05, and at the group level (both p < 0.001) (Fig. 5a, b). These results support a conclusion that both MB and MF systems contribute directly to choices in our task. Next, we rejected the ‘nolearning for ghostrejected’ submodel for 10 individuals at p < 0.05, and at the group level (p < 0.001) (Fig. 5c), showing that in uncertainty trials, learning occurs for ghostrejected objects. Additionally, and most importantly, we rejected the ‘noninference’ submodel for 12 individuals at p < 0.05, and at the group level (p < 0.001) (Fig. 5d), showing that learning is different for the retrospectively inferred ghostnominated than the ghostrejected object. We note that although the ‘noninference’ and the ‘no learning for ghost rejected’ models were each rejected for a minority of the participants (10 and 12 participants, respectively), the size of these minorities are nevertheless substantial considering that our task was not optimally powered to detect individual participant effects, and given significance testing at p = 0.05 should yield on average two rejections (out of 40 participants) when the null hypothesis holds for all participants. Finally, applying the Benjamini–Hochberg procedure^{32} to control for the false discovery rate, the ‘noninference’ and the ‘no learning for ghost rejected’ models were rejected for 10 and four individuals, respectively.
We next ran a mixed effects model in which we regressed the MF learning rates from the full model on the learning context (standard/ghostnominated/ghostrejected). This analysis (Fig. 6a) showed that the three learning rates differed from each other (F(2,117) = 3.43, p = 0.036). Critically, as expected the learning rate for the ghostnominated object was greater than for the ghostrejected object (F(1,117) = 6.83, p = 0.010). Additionally, the learning rate for the standard condition was larger than for the ghostrejected object (F(1,117) = 4.05, p = 0.047), with no significant difference between the learning rate for the ghostnominated object and for standard trials (F(1,117) = 0.920, p = 0.340). These findings provide additional support for the hypothesis that a retrospective inference process directs MF learning towards the object that could be inferred to have been nominated, and away from the one that was rejected.
Finally, because the models reported in the main text did not include MF eligibility traces^{33}, we examined whether such traces, rather than preferential learning based on retrospective inference, might account for the qualitative “preferential MF credit assignment” patterns presented in Fig. 4c, d. We found that models based on eligibilitydriven MF learning failed to account for the observed patterns (See methods for full model descriptions; See Supplementary Fig. 1I–L for the predictions of these models).
Correlation between MB contribution and learning specificity
As modelbasedness increases, the relative contribution of MF to performance, and hence the influence of MF’s learning rates on performance, decreases. Importantly, the full model allowed for an estimation of the extent to which each subject’s MF credit assignment prefers the ghostnominated over the ghostrejected object (lr_{ghostnom} − lr_{ghostrej}), controlling for the extent to which that subject relies on MB in his/her choices (w_{MB}). Our retrospectiveinference theory predicts that MB inference of the ghostnominated object, which relies on knowledge of the task’s transition structure, directs MF learning towards this inferred object. This suggests that the greater the degree of a subject’s modelbasedness, the better they can infer the ghost’s nominations and hence, the more specific MFlearning might be on uncertainty trials. In support of this hypothesis, we found (Fig. 6b) a positive acrossparticipants correlation (r = 0.29, p = 0.034 based on onesided permutation test) between w_{MB} and lr_{ghostnom} − lr_{ghostrej}, as quantified by the maximumlikelihood modelparameters (Supplementary Fig. 7 for a further control analysis, addressing the higher learningrate measurement noise as modelbasedness increases).
Object value learning MB system
As noted, all the main analyses in this section are based on a room value learning version of MB reasoning. This was motivated by its superior fit to the data. However, in order to support the robustness of our results we repeated our analyses of the influence of MB inference on MF value updating, but now assuming that an objectvalue learning version was responsible for the MB values (noting that the MB inference about which object had been nominated by the ghost is unaffected). In this approach we obtained the same conclusions (Supplementary Figs. 2–4 and 6).
Discussion
Extensive research in dual systems RL has emphasized a fundamental “temporalorientation” distinction between the MF and MB system. According to this notion, a MF system is retrospective, as it merely caches the “effects” of past actions, whereas a MB system is prospective, in that it plans actions based on evaluating the future consequences of these actions with respect to one’s goals. However, models of the task and the environment can be used for other functions. Here, we introduce a theory of MB retrospective inference. This theory addresses the frequent occurrence in which many of our actions are executed under conditions of state uncertainty, i.e. when important aspects of the state are latent. The theory proposes that we use our model of the world to resolve retrospectively, at least partially, an uncertainty operative at the very time actions were taken with implications for credit assignment. The MB system, therefore, does not focus solely on forming future actionplans, but has a function in shaping the impact of past experiences.
Our findings show that in the context of our task a MF system, which caches rewards obtained from choosing different objects without relying on a transition structure, can use retrospectively inferred knowledge about the ghost’s nomination to selectively assign outcome credit to the relevant object. Indeed, we found that on uncertainty trials, MF learning was spared for the ghostnominated object, i.e. it occurred with a rate that was similar to standard, nouncertainty, trials. For the ghostrejected object, on the other hand, learning was hindered. Note that credit was still assigned to the ghostrejected object, a finding that is expected if some participants do not retrospectively resolve uncertainty, or if participants resolve it only part of the time. A striking aspect of our findings is that MF credit assignment discriminated in favour of the inferred ghostnominated object, not only for an informative, inferencesupporting, second outcome, but also for the noninformative first outcome when observers still maintain a 50–50% belief. An important question pertains to potential mechanisms that might account for these findings. We consider two, which we discuss in turn, namely delayed learning and a DYNA^{17,27} architecture.
A delayed learning account rests on the idea that an MFteaching predictionerror signal, based on total trialreward, is calculated only at the end of uncertainty trials, after inference has already occurred. Notably, our task offered certainty that this uncertainty would be resolved later (i.e. after observing both outcomes). Thus, the MF system could in principle postpone a firstoutcome credit assignment and “wait” for a MB inference triggered by observing a second outcome. Once a MB system infers the relevant object it can presumably increase a MF eligibility trace^{33} for the inferred object and reduce it for the nonselected object, a process manifest as a higher MF learning rate for the inferred object. Many reallife situations, however, impose greater challenges on learners because it is unknown when, and even whether, uncertainty will be resolved. Indeed, there can often be complex chains of causal inference involving successive, partial, revelations from states and rewards. In such circumstances, postponing credit assignment could be selfdefeating. Thus, it may be more beneficial to assign credit in realtime according to one’s current belief state, and later on, if and when inferences about past states become available, enact “corrective” credit assignments adjustments. For example, upon taking an action and receiving a reward in an uncertain state, credit could be assigned based on one’s belief state. Later, if the original state is inferred, then one can retrospectively deduct credit from the noninferred state(s) and boost credit for the inferred state. Our findings pave the way for deeper investigations of these important questions and where it would be appealing to exploit tasks in which uncertainties arise, and are resolved, in a more naturalistic manner, gaining realism at the expense of experimental precision. More broadly, we consider much can be learned from studies that address neural processes that support MBretrospective inference, how such inferences facilitate efficient adaptation, and how they contribute to MBMF interactions.
Our findings also speak to a rich literature concerning various forms of MBMF system interactions. In particular, the findings resonate with a DYNA architecture, according to which the MB system trains a MF system by generating offline (i.e. during intertrial intervals) hypothetical modelbased episodes from which an MF system learns as if they were real. Our findings suggest another form of MBguidance, inferencebased guidance whereby upon resolving uncertainty, an MB system indexes appropriate objects for MF credit assignment. One intriguing possibility arises if one integrates our findings with DYNAlike approaches. Consider a case where the MF system assigns credit equally to inferred and noninferred objects online (i.e. during reward administration), but that MB inference biases the content of offlinereplayed MBepisodes. For example, if the MB system is biased to replay choices of the inferred, as opposed to the noninferred object (i.e. replay the last trial with its uncertainty resolved), then this would account for enhanced MFlearning for inferred relative to the noninferred object. Future studies can address this possibility by manipulating variables that affect offlinereplay such as cognitive load^{17} or perhaps by direct examination of replay that exploits neurophysiological measures.
The importance of retrospective inference chimes with a truism that those who learn from history can also improve upon it. Consider, for example, a situation in which the ghost’s nominations are biased towards the upper object rather than being uniform. In this case, inference about the ghost nominations would allow an agent to learn about the ghost’s bias, which could subsequently be deployed in the service of better planning. A critical facet of our task, however, is that the MB system was provided with no incentive to perform retrospective inference about the ghost’s nominee. Indeed, in our task the challenge faced by the MB system was to maintain accurate representations of the values of the various rooms and to calculate, based on the task’s transition structure, prospective expected rewards for offered bandits. Because outcomes were fullyobserved in uncertainty trials, their values could still be updated and the question of which object the ghost actually nominated was inconsequential. Put differently, retrospective inference was irrelevant with respect to either MB learning or future planning. We contend that the fact that an MB system still engaged in this effortdemanding process attests to its ubiquity and importance, and this is likely to be a “lower bound” on its extent. We would expect retrospective inference to be more substantial when it aids forthcoming MB planning, and this is an interesting question for further study. We acknowledge here theories that posit an intrinsic value of information^{34,35,36,37,38,39}, according to which information (which could be obtained by retrospectiveinference) is rewarding in itself even when it lacks instrumental importance.
We conclude by noting that while there has been ample theoretical and empirical work on both dual RL systems, and on state uncertainty RL, these two literatures have mostly developed along distinct trajectories with minimal crosstalk. We believe that an integration of these interesting fields can yield fundamental insights. The current work is but a first step in this project.
Methods
Participants
Forty seven participants (29 female, 18 male) were recruited form the SONA subject pool (https://uclpsychology.sonasystems.com/Default.aspx?ReturnUrl=/) with the restrictions of having normal or corrected vision and being born after 1971. The study was approved by the University College London Research Ethics Committee (Project ID 4446/001). Subjects gave written informed consent before the experiment.
Experimental procedures
Participants were first familiarised with four pictures of objects and learned which pair of rooms each object opened (the pictures of the four objects were adopted from previous studies^{40,41}). Each room was opened by two different objects and each object opened a unique pair of rooms. The mapping between objects and rooms was created randomly anew for each participant and remained stationary throughout the task. After learning, participants were quizzed about which rooms each object opened and about which object they would choose to open a target room. Participants iterated between learning and quiz phases until they achieved perfect quiz performance (100% accuracy and RT < 3000 ms for each question).
After learning participants played 16 practice standard bandit trials, to verify that the task was understood. These practice trials proceeded as described below with the sole difference that no time limit was imposed on a choice. They next played a single block of 72 standard bandit trials. On each trial, a pair of the four objects were offered for choice and participants had 2 s to choose one of these objects (left or right). Offered objects always shared one Common outcome. This defined four objectpairs, each presented on 18 trials, in a random order. Following a choice, the room Unique to the chosen object was opened and it was either empty (0 pt.) or included a treasure (1 pt.). Next, the room that was Common to both objects was opened, revealing it to be empty or with treasure. The reward probabilities of the four rooms evolved across trials according to four independent Gaussianincrement random walks with reflecting boundaries at p = 0 and p = 1 and a standard deviation of 0.025 per trial.
On completion of the first block participants were instructed about the uncertainty trials. On uncertainty trials participants were offered two disjoint objectpairs and were asked to choose the left or right pair. Objects within each pair always shared one room outcome. Participants were instructed that a ghost would toss a fair and transparent coin and choose the vertically upper or lower object in their chosen pair, based on the coinoutcome. Once the ghostnominated an object the trial proceeded as a standard trial, i.e. the two related rooms opened and treasures earned. Importantly, the room that was Common to the objects in the chosen pair was opened first while the room that was Unique to the ghostnominated object was opened later. Following instructions, participant played a practice block consisting 18 trials, with each third trial being an uncertainty trial. During this practice block there was no time limit on choices. Following the practice trials, the test trials started. Participants played seven block of 72 trials each and short breaks were enforced between blocks. Choices were limited to 2 s and each third trial was an uncertainty trial. The 168 3n + 1 standard trials included 42 presentations of each of the four eligible objectpairs, in a random order. The 168, 3n + 2 uncertainty trial included 84 repetitions of each of the two eligible pairings in a random order. Trials 3n + 3 were defined according to their transition types relative to the choice on the preceding uncertainty trial. These 168 trials included 56 of each of the “repeat”, “switch” of “clash” types in random order. A repeat trial presented the ghostnominated object alongside its vertical counterpart, a switch trial presented the ghostrejected object alongside its vertical counterpart and a clash trial presented the previously selected pair.
The task lasted between 90–120 min. Participants were paid £7.5 per hour plus a performance based bonus, which was calculated based on the total amount of earned treasure points.
Data analysis
One participant reported not feeling well and retired voluntarily from the experiment. Six other participants failed to pass the quiz within an hour and therefore did not start the task. The remaining 40 participants were the targets for the analysis.
Model agnostic analysis
Our modelagnostic analyses were conducted using logistic mixed effect models (implemented with MATLAB’s function “fitglme”) with participants serving as random effects with a free covariance matrix. For the MFcontribution analysis (Fig. 2a–e), we analysed only standard trials n + 1 that offered for choice the standard trialn chosen object. Our regressors C (Common outcome) and U (Unique outcome) coded whether trialn outcomes were rewarding (coded as +0.5 for reward and −0.5 for nonreward), and the regressed variable REPEAT indicated whether the choice on the focal trialn + 1 was repeated. PART coded the participant contributing each trial. The model, in Wilkinson notation, was: REPEAT~ C*U + (C*UPART). For the MBcontribution analysis (Fig. 2f–n), we analysed only standard trials n + 1 that excluded from choice the standard trialn chosen object. The regressors C, U and PART were coded as in the previous analysis and one additional regressor P coded the reward probability of the Common outcome (we centralized this regressor by subtracting 0.5). The regressed variable GENERALIZE indicated whether the choice on the focal trialn + 1 was generalized. The model, in Wilkinson notation, was: GENERALIZED~C*P + (C*PPART). We also tested an extended model GENERALIZED~C*U*P + (C*U*PPART) but found that none of the effects involving U were significant while the C*P effects supported the same conclusions.
The analyses that focused on MF learning on uncertainty trials considered standard n + 1 trials following an uncertainty ntrial (Fig. 4). The first analysis that examined learning for the ghostnominated object (Fig. 4a) focused on “repeat” followups, that is trials n + 1, that offered for choice the ghostnominated object alongside the object from the previously nonselected pair that provided the previously inferenceallowing outcome. A choice repetition was defined as a choice of the ghostnominated object. We used the model REPEAT~ N*I + (N*I PART), where N and I coded outcomes (−0.5, 0.5) obtained on trialn from the Noninformative and the Informative room, respectively. The second analysis that examined learning for the ghostrejected object (Fig. 4b) focused on “switch” followups, that is trials n + 1, that offered for choice the ghostrejected object alongside an object from the previously nonselected pair that shared an outcome with the previously ghostrejected object. A choice generalization was defined as a choice of the ghostrejected object. We used the model GENERALIZE~ N*I + (N*IPART). Finally, the third analysis that compared firstoutcome learning for the ghostnominated and rejected objects (Fig. 4c) focused on “clash” followups, that is trials n + 1, that offered for choice the ghostnominated and ghostrejected objects. A choice repetition was defined as a choice of the ghostnominated object. We used the model REPEAT~N*I + (N*IPART).
Computational models
We formulated two hybrid RL models to account for the series of choices for each participant. In both models, choices are contributed by both the MB and MF systems and they differed only in how the MB system operates.
In both models, The MF system caches a Q^{MF}value for each object, subsequently retrieved when the object is offered for choice. When a pair of objects is offered for choice (on uncertainty trials), the MF pairvalue was calculated as the average of constituent object’s MFvalues:
On standard trials, the total reward (in points) from outcomes is used to update the Q^{MF}value for the chosen object, based on a prediction error (with a free learning rate parameter lr_{standard}):
On uncertainty trials, the MF system updates the Q^{MF}values for both the ghostnominated and rejected objects with free learning rate parameters lr_{ghostnom} and lr_{ghostrej}, respectively:
Our two models postulated alternative “roomvalue learning” and “objectvalue learning” formulations for an MB system. We describe these models in turn. In the roomvalue learning formulation (the model variant presented in the main text), the MB system maintains Q^{MB}values for the four different rooms. During choice on standard trials the Q^{MB}value for each offered object is calculated based on the transition structure:
During uncertainty trials the Q^{MB}value of each pair is calculated based on the average constituent values:
Following a choice, the MB system updates the Q^{MB}values of each observed rooms:
where lr_{MB} is an MB learning rate parameter.
Alternatively, in the objectvalue learning formulation, during the choice phase the MB system retrieves the values of the choiceobjects, much like the MF system. Additionally, MB learns during reward administration the values of objects rather than rooms, but unlike MF, it takes into account the task’s transition structure. For the chosen object, i.e. the object that was chosen by the participant on standard trials or by the ghost on ghost trials, a “full update” is performed:
For each of the two other nonchosen objects, each of which provides only one of the two experienced roomoutcomes, a “half update” was performed based on the relevant room.
For example (see Fig. 1a), when the key was chosen (either by the participant or the ghost) a full update was performed for the key, and half updates based on the brown and green rooms, respectively, were performed for the phone and the stove.
In both models, when a pair of objects are offered for choice on standard trials the net Q value of each object is calculated as
Where w_{MB} is a free parameter (between 0–1) that quantifies the relative contribution of the MB system (1 − w_{MB} is therefore the relative MF contribution), p is a free perseverance parameter, which quantifies a general tendency to select the object that was last selected, and 1_{lastchosen} indicates whether the focal object was selected on the previous trial. On uncertainty trials the value of each offered pair is calculated similarly as:
where here, 1_{lastchosen} indicates whether the focal pair includes the previously selected object. The Q_{net} values for the two choice options (objects or objectpairs) are then injected into a softmax choice rule with a free inverse temperature parameter β > 0 so that the probability to choose an option is:
MF Q^{MF}values where initialized to 1 for each object and MB Q^{MB}values were initialized to 0.5 for each room.
We also formulated two pure MF models with either accumulating or replacing eligibility traces^{33} to test whether these mechanisms, rather than MB inference guided learning, could account for our findings. In these models, the MB contribution was silenced by setting w_{MB} = 0 and removing lr_{MB} from the model. These models included a single free learning rate parameter for the MF system, lr_{MF}, and a free eligibility trace (decay) parameter, denoted λ. For each of the four objects we maintained throughout the series of trials, an eligibility trace, e(object). At the beginning of the experimental session, these traces were initialized to 0. At the end of each trial all four eligibility traces decayed according to
Immediately after a choice was made on a standard trial the eligibility trace of the chosen object was updated. In accumulating traces model we set
And in the replacing traces model we set
On ghost trials the eligibility traces for both objects in the chosen pair were thus updated. Finally, following reward administration the value of each of the four objects was updated. For accumulating eligibility traces we set
And for replacing trace:
In sum, in the eligibilitytrace models, the sequence of model calculation during a trial consisted of subjecting all eligibility traces to a decay, making a choice, increasing the eligibility trace(s) for the chosen object(s), obtaining outcomes (rewards or nonrewards) and updating the values of all four objects.
Model fitting and model comparison
We fit our models to the data of each individual, maximizing the likelihood (ML) of their choices (we optimized likelihood using MATLAB’s ‘fmincon’, with 200 random starting points per participant). Each of our two full hybrid models, which allowed for contributions from both an MB and an MF system, served as a supermodel in a family of nested submodels: the roomvalue learning and the objectvalue learning families. Each family consisted of four submodels: The first, a pure MB model, constrained the modelbased relative contribution to 1, w_{MB} = 1, and the learning rates of the MF system to 0, lr_{standard} = lr_{ghostnom} = lr_{ghostrej} = 0. The second, a pure MFaction model, constrained the MF relative contribution to choices to 1, w_{MB} = 0, and the MB learning rate to 0, lr_{MB} = 0. Note, however, that the MB system was still able to guide MF learning through inference. The third, ‘noninference’ submodel constrained equal learning rates for the ghostnominated and rejected objects lr_{ghostnom} = lr_{ghostrej}. The fourth, ‘nolearning for ghost rejected object’ constrained the learning rate of the ghostrejected object to 0: lr_{ghostrej} = 0. The bestfitting parameters for the supermodel are reported in Supplementary Table 1.
We next conducted, for each family separately, a bootstrapped generalized likelihood ratio test (BGLRT^{31}) for the supermodel vs. each of the submodel separately (Fig. 5). In a nutshell, this method is based on the classicalstatistics hypothesis testing approach and specifically on the generalizedlikelihood ratio test (GLRT). However, whereas GLRT assumes asymptotic Chisquared null distribution for the loglikelihood improvement of a supermodel over a submodel, in BGLRT these distributions are derived empirically based on a parametric bootstrap method. In each of our model comparison the submodel serves as the H0 null hypothesis whereas the full model as the alternative H1 hypothesis.
For each participant, we created 1001 synthetic experimental sessions by simulating the submodel with the ML parameters on novel trial sequences, which were generated as in the actual data. We next fitted both the supermodel and the submodel to each synthetic dataset and calculated the improvement in twice the logarithm of the likelihood for the full model. For each participant, these 1001 likelihoodimprovement values served as a null distribution to reject the submodel. The pvalue for each participant was calculated based on the proportion of synthetic dataset for which the twice logarithm of the likelihoodimprovement was at least as large as the empirical improvement. Additionally, we performed the model comparison at the group level. We repeated the following 10,000 times. For each participant we chose randomly, and uniformly, one of his/her 1000 synthetic twice loglikelihood supermodel improvements and we summed across participant. These 10,000 obtained values constitute the distribution of group supermodel likelihood improvement under the null hypothesis. We then calculated the pvalue for rejecting the submodel at the group level as the proportion of synthetic datasets for which the supermodel twice logarithm of the likelihood improvement was larger or equal to the empirical improvement in supermodel, summed acrossparticipants.
Additionally, we compared between the roomvalue learning and the objectvalue learning full models using the parametric bootstrap crossfitting method (PBCM^{42}). For each participant, and for each of the two modelvariants, we generated 100 synthetic experimental sessions by simulating the model using the ML parameters on novel trial sequences (which were generated as in the experiment). We then fit each of these synthetic datasets with both models. Next we repeated the following 10,000 times, focusing on data that was generated by the roomvalue learning model. For each participant we chose randomly and uniformly one of his/her 100 synthetic datasets and calculated twice the loglikelihood difference between the fits of the roomvalue learning model and the objectvalue learning models. These differences were averaged across participants. Thus, we obtained 10,000 values that represent the distribution of the group twice loglikelihood difference for data that is generated by the roomvalue learning model. Next, we repeated the same steps but this time for synthetic data that was generated by the objectvalue leaning model, to obtain a distribution of the group twice loglikelihood difference for data that is generated by the objectvalue learning model. An examination of these two distributions (Supplementary Fig. 5) showed that each model provided a better fit for the group data in terms of likelihood when it is the generating model. We thus set loglikelihood difference of 0 as the modelclassification criterion with positive difference supporting the roomvalue learning model and negative values supporting the objectvalue learning model. Finally, we averaged twice the loglikelihood difference for the empirical data across participants, to obtain the empirical group difference. This difference was 4.71, showing that the roomvalue learning model provides a superior account for the group data.
Model simulations
To generate model predictions (Fig. 2, Supplementary Figs. 1–3), we simulated for each participant, 25 synthetic experimental sessions (novel trial sequences were generated as in the actual experiment), based on his or her ML parameters obtained from the corresponding model fits (the models are described above). We then analysed these data in the same way as the original empirical data (but with datasets that were 25 times larger, as compared to the empirical data, per participant).
Comparing MF learning rates
We compared the estimated MF learning rates for standardchosen, ghostnominated and ghostrejected objects, using a linear mixed effect model (implemented with MATLAB’s function “fitglme”) with participants serving as random effects with a free covariance matrix (Fig. 6a). Regressors GS (ghostnominated) GR (ghostrejected) indicated whether the learning rate corresponded to the ghostnominated object and to the ghostrejected object, respectively. The regressed variable LR was the estimated learning rate. PART coded the participant. The model, in Wilkinson notation, was: LR~GS+GR + (GS + GRPART). We followedup with an Ftest that rejected the hypothesis that both GS and GR main effects were 0, indicating that the three learning rates are different. We next contrasted all three learning rates pairs.
Correlation between MB and MF preferential learning
Based on the ML parameters of the full models, we calculated (Fig. 6b) the acrossparticipants correlation between modelbasedness (w_{MB}) and the MF preferential learning for the ghostnominated object (lr_{ghostnom} − lr_{ghostrej}). The significance of this correlation was calculated based on a permutation test in which w_{MB} was shuffled across participants.
We note, however, that the empirical data showed an increase in participant heterogeneity with respect to preferential learning as a function of modelbasedness. This is evident in Fig. 6b in both the differences between participants and in the increase of individual error bars as model basedness increases. This occurs because the MF learning rates exert a weaker influence on performance as “modelbasedness” increases (and the relative MF contribution decreases) and hence, learningrate estimation noise increases. One caveat pertaining to the above permutation test is that it fails to control for this increasing heterogeneity pattern, as this pattern will vanish in shuffled data. To address the possibility that this pattern generated a spurious positive correlation we conducted a control test (Supplementary Fig. 7). We parametrically bootstrapped (using model simulations) 1000 synthetic experimental sessions for each participant’s data based on the ML parameters from the noninference model in which, preferential learning for the ghostnominated object is absent (lr_{ghostnom} = lr_{ghostrej}). We next fitted each of these synthetic datasets with the full model to obtain estimates of modelbaseness and preferential learning. Next we repeated the following 100,000 times: We chose for each participant randomly the fitting parameters obtained for one of his/her 1000 synthetic datasets and we calculated the group correlation between modelbaseness and preferential learning. Because this correlation is calculated for data that featured no correlation, the 100,000 values comprise a null distribution for the expected correlation. The significance value of the empirical correlation was calculated as the proportion of samples in the null distribution that were larger than the observed correlation.
Error bars for the MF preferential learning effect
We calculated the individual error bars for the difference between MFlearning rates for the ghostnominated and ghostrejected objects (Fig. 6b) as follows. For each participant, we generated 100 synthetic experimental sessions by bootstrapping his/her data based on the ML parameters of the full roomvalue learning model. We then fitted the full model to each of these synthetic datasets and calculated the difference between the ghostnominated and ghostrejected learning rates. This provided an estimate of the expected distribution of the learning ratedifference, had we been able to test a participant multiple times. Next we found the densest (i.e. narrowest) interval that contained 50% of the mass of this distribution, conditional on the interval including the empirical learning rate difference.
Reporting summary
Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The data that support the findings of this study and data analysis code have been deposited in the Open Science Framework (OSF) and are available in the following link: [https://osf.io/8j7yf/?view_only=8362bdb2672643de98daaa8e509aae30].
References
 1.
Kakade, S. & Dayan, P. Dopamine: generalization and bonuses. Neural Netw. 15, 549–559 (2002).
 2.
Daw, N. D., Courville, A. C. & Touretzky, D. S. Representation and timing in theories of the dopamine system. Neural Comput. 18, 1637–1677 (2006).
 3.
Rao, R. P. N. Decision making under uncertainty: a neural model based on partially observable markov Decision Processes. Front. Comput. Neurosci. 4, 1–18 (2010).
 4.
Lak, A., Nomoto, K., Keramati, M., Sakagami, M. & Kepecs, A. Midbrain dopamine neurons signal belief in choice accuracy during a perceptual decision. Curr. Biol. 27, 821–832 (2017).
 5.
Starkweather, C. K., Babayan, B. M., Uchida, N. & Gershman, S. J. Dopamine reward prediction errors reflect hiddenstate inference across time. Nat. Neurosci. 20, 581–589 (2017).
 6.
Sarno, S., de Lafuente, V., Romo, R. & Parga, N. Dopamine reward prediction error signal codes the temporal evaluation of a perceptual decision report. Proc. Natl Acad. Sci. USA 114, E10494–E10503 (2017).
 7.
Babayan, B. M., Uchida, N. & Gershman, S. J. Belief state representation in the dopamine system. Nat Commun. 9, 1891 (2018).
 8.
Dickinson, A. & Balleine, B. in Stevens’ Handbook of Experimental Psychology: Learning, Motivation, and Emotion 3rd edn, Vol. 3 (ed Gallistel, R.) Ch. 12 (Wiley, New York, 2002).
 9.
Daw, N. D., Niv, Y. & Dayan, P. Uncertaintybased competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci. 8, 1704–1711 (2005).
 10.
Balleine, B. W. & O’Doherty, J. P. Human and rodent homologies in action control: corticostriatal determinants of goaldirected and habitual action. Neuropsychopharmacology 35, 48–69 (2010).
 11.
Dolan, R. J. & Dayan, P. Goals and habits in the brain. Neuron 80, 312–325 (2013).
 12.
Doya, K. What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Netw. 12, 961–974 (1999).
 13.
Adams, C. D. & Dickinson, A. Instrumental responding following reinforcer devaluation. Q. J. Exp. Psychol. 33, 109–121 (1981).
 14.
Yin, H. H., Knowlton, B. J. & Balleine, B. W. Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning. Eur. J. Neurosci. 19, 181–189 (2004).
 15.
Yin, H. H., Ostlund, S. B., Knowlton, B. J. & Balleine, B. W. The role of the dorsomedial striatum in instrumental conditioning. Eur. J. Neurosci. 22, 513–523 (2005).
 16.
Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P. & Dolan, R. J. Modelbased influences on humans’ choices and striatal prediction errors. Neuron 69, 1204–1215 (2011).
 17.
Gershman, S. J., Markman, A. B. & Otto, A. R. Retrospective revaluation in sequential decision making: a tale of two systems. J. Exp. Psychol. Gen. 143, 182–194 (2014).
 18.
Gläscher, J., Daw, N., Dayan, P. & O’Doherty, J. P. States versus rewards: dissociable neural prediction error signals underlying modelbased and modelfree reinforcement learning. Neuron 66, 585–595 (2010).
 19.
Valentin, V. V., Dickinson, A. & O’Doherty, J. P. Determining the neural substrates of goaldirected learning in the human brain. J. Neurosci. 27, 4019–4026 (2007).
 20.
Smittenaar, P., FitzGerald, T. H. B., Romei, V., Wright, N. D. & Dolan, R. J. Disruption of dorsolateral prefrontal cortex decreases modelbased in favor of modelfree control in humans. Neuron 80, 914–919 (2013).
 21.
Killcross, S. & Coutureau, E. Coordination of actions and habits in the medial prefrontal cortex of rats. Cereb. Cortex 13, 400–408 (2003).
 22.
Cushman, F. & Morris, A. Habitual control of goal selection in humans. Proc. Natl Acad. Sci. USA 112, 13817–13822 (2015).
 23.
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, Cambridge, 1998).
 24.
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
 25.
Doll, B. B., Duncan, K. D., Simon, D. A., Shohamy, D. & Daw, N. D. Modelbased choices involve prospective neural activity. Nat. Neurosci. 18, 767–772 (2015).
 26.
Keramati, M., Dezfouli, A. & Piray, P. Speed/accuracy tradeoff between the habitual and the goaldirected processes. PLoS Comput. Biol. 7, e1002055 (2011).
 27.
Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proc. 7th International Conference on Machine Learning (eds Porter, B. & Mooney, R.) 216–224 (University of Texas, Austin, 1990).
 28.
Wan Lee, S., Shimojo, S. & O’Doherty, J. P. Neural computations underlying arbitration between modelbased and modelfree learning. Neuron 81, 687–699 (2014).
 29.
Keramati, M., Smittenaar, P., Dolan, R. J. & Dayan, P. Adaptive integration of habits into depthlimited planning defines a habitualgoal–directed spectrum. Proc. Natl Acad. Sci. USA 113, 12868–12873 (2016).
 30.
Rescorla, R. A. & Wagner, A. R. A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. Class. Cond. II Curr. Res Theory 21, 64–99 (1972).
 31.
Moran, R. & GoshenGottstein, Y. Old processes, new perspectives: familiarity is correlated with (not independent of) recollection and is more (not equally) variable for targets than for lures. Cogn. Psychol. 79, 40–67 (2015).
 32.
Benjamini Y., Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing on JSTOR. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).
 33.
Singh, S. P. & Sutton, R. S. Reinforcement learning with replacing eligibility traces. Mach. Learn. 22, 123–158 (1996).
 34.
BrombergMartin, E. S. & Hikosaka, O. Lateral habenula neurons signal errors in the prediction of reward information. Nat. Neurosci. 14, 1209–1216 (2011).
 35.
Vasconcelos, M., Monteiro, T. & Kacelnik, A. Irrational choice and the value of information. Sci. Rep. 5, 13874 (2015).
 36.
Zentall, T. R. & Stagner, J. Maladaptive choice behaviour by pigeons: an animal analogue and possible mechanism for gambling (suboptimal human decisionmaking behaviour). Proc. Biol. Sci. 278, 1203–1208 (2011).
 37.
Gipson, C. D., Alessandri, J. J. D., Miller, H. C. & Zentall, T. R. Preference for 50% reinforcement over 75% reinforcement by pigeons. Learn. Behav. 37, 289–298 (2009).
 38.
Bennett, D., Bode, S., Brydevall, M., Warren, H. & Murawski, C. Intrinsic valuation of information in decision making under uncertainty. PLOS Comput. Biol. 12, e1005020 (2016).
 39.
Iigaya, K., Story, G. W., KurthNelson, Z., Dolan, R. J. & Dayan, P. The modulation of savouring by prediction error and its effects on choice. eLife 5, 1–24 (2016).
 40.
Kiani, R., Esteky, H., Mirpour, K. & Tanaka, K. Object category structure in response patterns of neuronal population in monkey inferior temporal cortex. J. Neurophysiol. 97, 4296–4309 (2007).
 41.
Kriegeskorte, N. et al. Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron 60, 1126–1141 (2008).
 42.
Wagenmakers, E. J., Ratcliff, R., Gomez, P. & Iverson, G. J. Assessing model mimicry using the parametric bootstrap. J. Math. Psychol. 48, 28–50 (2004).
Acknowledgements
P.D. was on a leave of absence at Uber Technologies during part of time that this work was being carried out. We thank Eran Eldar for his helpful comments on the MS. R.M., M.K. and R.J.D. were funded by the Max Planck Society, Munich, Germany, URL: https://www.mpg.de/en, Grant number: 647070403019. R.J.D. was also funded by the Wellcome Trust, URL: https://wellcome.ac.uk/home, Grant number/reference: 098362/Z/12/Z. P.D. was funded by the Gatsby Charitable Foundation and by the Max Planck Society.
Author information
Affiliations
Contributions
R.M., P.D. and R.J.D. conceived the study. R.M. designed the experiment. R.M. programmed the experiments. R.M. performed the experiments. R.M. developed the models. R.M. analysed the data. R.M., P.D. and M.K. interpreted the results. R.M. drafted the manuscript. All authors wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Journal peer review information: Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Moran, R., Keramati, M., Dayan, P. et al. Retrospective modelbased inference guides modelfree credit assignment. Nat Commun 10, 750 (2019). https://doi.org/10.1038/s41467019086628
Received:
Accepted:
Published:
Further reading

Model based planners reflect on their modelfree propensities
PLOS Computational Biology (2021)

Human subjects exploit a cognitive map for credit assignment
Proceedings of the National Academy of Sciences (2021)

New roles for dopamine in motor skill acquisition: Lessons from primates, rodents, and songbirds
Journal of Neurophysiology (2021)

Efficiency and prioritization of inferencebased credit assignment
Current Biology (2021)

Parallel modelbased and modelfree reinforcement learning for card sorting performance
Scientific Reports (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.