Prefrontal cortex as a meta-reinforcement learning system

  • Nature Neurosciencevolume 21pages860868 (2018)
  • doi:10.1038/s41593-018-0147-8
  • Download Citation


Over the past 20 years, neuroscience research on reward-based learning has converged on a canonical model, under which the neurotransmitter dopamine ‘stamps in’ associations between situations, actions and rewards by modulating the strength of synaptic connections between neurons. However, a growing number of recent findings have placed this standard model under strain. We now draw on recent advances in artificial intelligence to introduce a new theory of reward-based learning. Here, the dopamine system trains another part of the brain, the prefrontal cortex, to operate as its own free-standing learning system. This new perspective accommodates the findings that motivated the standard model, but also deals gracefully with a wider range of observations, providing a fresh foundation for future research.

  • Subscribe to Nature Neuroscience for full access:



Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA, USA, 1998).

  2. 2.

    Montague, P. R., Dayan, P. & Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci. 16, 1936–1947 (1996).

  3. 3.

    Daw, N. D. & Tobler, P. N. Value learning through reinforcement: the basics of dopamine and reinforcement learning. Neuroeconomics: Decision Making and the Brain 2nd edn. (eds. Glimcher, P. W. & Fehr, E) 283–298 (Academic, New York, 2014).

  4. 4.

    Rushworth, M. F. & Behrens, T. E. Choice, uncertainty and value in prefrontal and cingulate cortex. Nat. Neurosci. 11, 389–397 (2008).

  5. 5.

    Seo, H. & Lee, D. Cortical mechanisms for reinforcement learning in competitive games. Phil. Trans. R. Soc. Lond. B 363, 3845–3857 (2008).

  6. 6.

    Padoa-Schioppa, C. & Assad, J. A. Neurons in the orbitofrontal cortex encode economic value. Nature 441, 223–226 (2006).

  7. 7.

    Tsutsui, K., Grabenhorst, F., Kobayashi, S. & Schultz, W. A dynamic code for economic object valuation in prefrontal cortex neurons. Nat. Commun. 7, 12554 (2016).

  8. 8.

    Kim, J.-N. & Shadlen, M. N. Neural correlates of a decision in the dorsolateral prefrontal cortex of the macaque. Nat. Neurosci. 2, 176–185 (1999).

  9. 9.

    Seo, M., Lee, E. & Averbeck, B. B. Action selection and action value in frontal-striatal circuits. Neuron 74, 947–960 (2012).

  10. 10.

    Barraclough, D. J., Conroy, M. L. & Lee, D. Prefrontal cortex and decision making in a mixed-strategy game. Nat. Neurosci. 7, 404–410 (2004).

  11. 11.

    Daw, N. D., Niv, Y. & Dayan, P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci. 8, 1704–1711 (2005).

  12. 12.

    Bromberg-Martin, E. S., Matsumoto, M., Hong, S. & Hikosaka, O. A pallidus-habenula-dopamine pathway signals inferred stimulus values. J. Neurophysiol. 104, 1068–1076 (2010).

  13. 13.

    Nakahara, H. & Hikosaka, O. Learning to represent reward structure: a key to adapting to complex environments. Neurosci. Res. 74, 177–183 (2012).

  14. 14.

    Sadacca, B. F., Jones, J. L. & Schoenbaum, G. Midbrain dopamine neurons compute inferred and cached value prediction errors in a common framework. Elife 5, e13665 (2016).

  15. 15.

    Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P. & Dolan, R. J. Model-based influences on humans’ choices and striatal prediction errors. Neuron 69, 1204–1215 (2011).

  16. 16.

    Mante, V., Sussillo, D., Shenoy, K. V. & Newsome, W. T. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature 503, 78–84 (2013).

  17. 17.

    O’Reilly, R. C. & Frank, M. J. Making working memory work: a computational model of learning in the prefrontal cortex and basal ganglia. Neural Comput. 18, 283–328 (2006).

  18. 18.

    Song, H. F., Yang, G. R. & Wang, X.-J. Reward-based training of recurrent neural networks for cognitive and value-based tasks. Elife 6, e21492 (2017).

  19. 19.

    Redish, A. D., Jensen, S., Johnson, A. & Kurth-Nelson, Z. Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. Psychol. Rev. 114, 784–805 (2007).

  20. 20.

    Haber, S. N. The place of dopamine in the cortico-basal ganglia circuit. Neuroscience 282, 248–257 (2014).

  21. 21.

    Frank, M. J., Seeberger, L. C. & O’Reilly, R. C. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306, 1940–1943 (2004).

  22. 22.

    Houk, J. C., Adams, C. M. & Barto, A. G. A model of how the basal ganglia generate and use neural signals that predict reinforcement. in Models of Information Processing in the Basal Ganglia (eds. Houk, J.C. & Davis, D.G.) 249–270 (MIT Press, Cambridge, MA, USA, 1995).

  23. 23.

    Rougier, N. P., Noelle, D. C., Braver, T. S., Cohen, J. D. & O’Reilly, R. C. Prefrontal cortex and flexible cognitive control: rules without symbols. Proc. Natl. Acad. Sci. USA 102, 7338–7343 (2005).

  24. 24.

    Acuna, D. E. & Schrater, P. Structure learning in human sequential decision-making. PLoS Comput. Biol. 6, e1001003 (2010).

  25. 25.

    Collins, A. G. & Frank, M. J. How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. Eur. J. Neurosci. 35, 1024–1035 (2012).

  26. 26.

    Thrun, S. & Pratt, L. Learning to Learn (Springer Science & Business Media, New York, 2012).

  27. 27.

    Khamassi, M., Enel, P., Dominey, P. F. & Procyk, E. Medial prefrontal cortex and the adaptive regulation of reinforcement learning parameters. Prog. Brain Res. 202, 441–464 (2013).

  28. 28.

    Ishii, S., Yoshida, W. & Yoshimoto, J. Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw. 15, 665–687 (2002).

  29. 29.

    Schweighofer, N. & Doya, K. Meta-learning in reinforcement learning. Neural Netw. 16, 5–9 (2003).

  30. 30.

    Schmidhuber, J., Zhao, J. & Wiering, M. Simple principles of metalearning. IDSIA (Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale) Technical Report 69-96, 1–23 (1996).

  31. 31.

    Wang, J.X. et al. Learning to reinforcement learn. Preprint at (2016).

  32. 32.

    Duan, Y. et al. RL2: fast reinforcement learning via slow reinforcement learning. Preprint at (2016).

  33. 33.

    Lau, B. & Glimcher, P. W. Dynamic response-by-response models of matching behavior in rhesus monkeys. J. Exp. Anal. Behav. 84, 555–579 (2005).

  34. 34.

    Behrens, T. E. J., Woolrich, M. W., Walton, M. E. & Rushworth, M. F. S. Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007).

  35. 35.

    Iigaya, K. Adaptive learning and decision-making under uncertainty by metaplastic synapses guided by a surprise detection system. Elife 5, e18073 (2016).

  36. 36.

    Schuck, N. W., Cai, M. B., Wilson, R. C. & Niv, Y. Human orbitofrontal cortex represents a cognitive map of state space. Neuron 91, 1402–1412 (2016).

  37. 37.

    Chan, S. C., Niv, Y. & Norman, K. A. A probability distribution over latent causes, in the orbitofrontal cortex. J. Neurosci. 36, 7817–7828 (2016).

  38. 38.

    Hampton, A. N., Bossaerts, P. & O’Doherty, J. P. The role of the ventromedial prefrontal cortex in abstract state-based inference during decision making in humans. J. Neurosci. 26, 8360–8367 (2006).

  39. 39.

    Miller, K. J., Botvinick, M. M. & Brody, C. D. Dorsal hippocampus contributes to model-based planning. Nat. Neurosci. 20, 1269–1276 (2017).

  40. 40.

    Harlow, H. F. The formation of learning sets. Psychol. Rev. 56, 51–65 (1949).

  41. 41.

    Trujillo-Pisanty, I., Solis, P., Conover, K., Dayan, P. & Shizgal, P. On the forms of learning supported by rewarding optical stimulation of dopamine neurons. Soc. Neurosci. Annu. Meet. 66.06,!/4071/presentation/29633 (2016).

  42. 42.

    Kim, K. M. et al. Optogenetic mimicry of the transient activation of dopamine neurons by natural reward is sufficient for operant reinforcement. PLoS One 7, e33612 (2012).

  43. 43.

    Chang, C. Y. et al. Brief optogenetic inhibition of dopamine neurons mimics endogenous negative reward prediction errors. Nat. Neurosci. 19, 111–116 (2016).

  44. 44.

    Stopper, C. M., Tse, M. T. L., Montes, D. R., Wiedman, C. R. & Floresco, S. B. Overriding phasic dopamine signals redirects action selection during risk/reward decision making. Neuron 84, 177–189 (2014).

  45. 45.

    Wang, X.-J. Synaptic reverberation underlying mnemonic persistent activity. Trends Neurosci. 24, 455–463 (2001).

  46. 46.

    Chatham, C. H. & Badre, D. Multiple gates on working memory. Curr. Opin. Behav. Sci. 1, 23–31 (2015).

  47. 47.

    Kim, H., Lee, D. & Jung, M. W. Signals for previous goal choice persist in the dorsomedial, but not dorsolateral striatum of rats. J. Neurosci. 33, 52–63 (2013).

  48. 48.

    Takahashi, Y. K. et al. Expectancy-related changes in firing of dopamine neurons depend on orbitofrontal cortex. Nat. Neurosci. 14, 1590–1597 (2011).

  49. 49.

    Collins, A. G. E. & Frank, M. J. Neural signature of hierarchically structured expectations predicts clustering and transfer of rule sets in reinforcement learning. Cognition 152, 160–169 (2016).

  50. 50.

    Gershman, S. J. & Daw, N. D. Reinforcement learning and episodic memory in humans and animals: An integrative framework. Annu. Rev. Psychol. 68, 101–128 (2017).

  51. 51.

    Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

  52. 52.

    Mnih, V. et al. Asynchronous methods for deep reinforcement learning. in Proc. 33rd Intl. Conf. Machine Learning 48, 1928–1937 (JMLR, New York, 2016).

  53. 53.

    Graves, A., Jaitly, N. & Mohamed, A.-r. Hybrid speech recognition with deep bidirectional LSTM. in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2013 273–278 (IEEE, 2013).

  54. 54.

    Leibo, J. Z. et al. Psychlab: a psychology laboratory for deep reinforcement learning agents. Preprint at (2018).

  55. 55.

    Deng, J. et al. ImageNet: a large-scale hierarchical image database. in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009 248–255 (IEEE, 2009).

Download references


We are grateful to K. Miller, F. Grabenhorst, T. Behrens, E. Bromberg-Martin, S. Floresco and P. Glimcher for graciously providing help with and permission for adapting their data. We thank C. Blundell and R. Munos for discussions and comments on an earlier draft.

Author information

Author notes

  1. These authors contributed equally: Jane X. Wang and Zeb Kurth-Nelson.


  1. DeepMind, London, UK

    • Jane X. Wang
    • , Zeb Kurth-Nelson
    • , Dharshan Kumaran
    • , Dhruva Tirumala
    • , Hubert Soyer
    • , Joel Z. Leibo
    • , Demis Hassabis
    •  & Matthew Botvinick
  2. Max Planck-UCL Centre for Computational Psychiatry and Ageing Research, University College London, London, UK

    • Zeb Kurth-Nelson
  3. Institute of Cognitive Neuroscience, University College London, London, UK

    • Dharshan Kumaran
  4. Gatsby Computational Neuroscience Unit, University College London, London, UK

    • Demis Hassabis
    •  & Matthew Botvinick


  1. Search for Jane X. Wang in:

  2. Search for Zeb Kurth-Nelson in:

  3. Search for Dharshan Kumaran in:

  4. Search for Dhruva Tirumala in:

  5. Search for Hubert Soyer in:

  6. Search for Joel Z. Leibo in:

  7. Search for Demis Hassabis in:

  8. Search for Matthew Botvinick in:


J.X.W., Z.K.-N., and M.B. designed the simulations. J.X.W. and Z.K.-N. performed the simulations and analyzed the data. D.T., H.S. and J.Z.L. contributed and helped with code. All authors wrote the manuscript.

Competing interests

The authors are employed by DeepMind Technologies Limited.

Corresponding author

Correspondence to Matthew Botvinick.

Integrated supplementary information

  1. Supplementary Figure 1

    Pseudo-code for advantage actor-critic algorithm

  2. Supplementary Figure 2 Visualization of RNN activations in illustrative example (bandit task).

    a) Evolution of RNN activation pattern during individual trials while testing on correlated bandits, after training with independent arm parameters. Format analogous to Fig. 1e in the main text. P L = probability of reward for action left. P R = probability of reward for action right. Scatter points depict the first two principal components of the RNN activation (LSTM output) vector across steps of the bandit task, drawn from single trials across a range of randomly sampled payoff parameter settings. As evidence accrues, the activation pattern follows a coherent path in one of two directions, corresponding to the arms of the bandit. Interestingly, in more difficult discrimination problems (smaller gap between two arm reward probabilities; middle two columns, also see Fig. 1e), activity sometimes appears to initially move toward one extreme of the manifold in early trials, then later reverses when late-coming observations contradict initial ones, to end up at other extremity by the end of the episode (see Ito and Doya, PLoS Comput Biol, 11, e1004540, 2015), which posits a related mechanism, though without t discussing its origins). b) RNN activity patterns from step 100 (episode completion) while testing on correlated bandits, across a range of payoff parameters. Format analogous to Fig. 1f. In contrast to Fig. 1f, in which the network is trained on linked arm parameters and thus the activity pattern traces out a one-dimensional manifold, the activity pattern after having trained on independent bandits is correspondingly more complex. c) RNN activity patterns when testing on independent bandits using the same network as in A and B, colored according to the left arm probability of reward (red = higher). d) same data as C, but colored according to the right arm probability of reward PR (red = higher). For all plots, analyses were done for 300 evaluation episodes, each consisting of 100 trials, performed by 1 fully trained network.

  3. Supplementary Figure 3 Cortico–basal ganglia–thalamic loops.

    As noted in the main paper, the recurrent neural system to which we refer as the prefrontal network includes not only the PFC itself, including orbitofrontal, ventromedial frontal, cingulate, dorsolateral prefrontal and frontopolar regions, but also the subcortical structures to which it most strongly connects, including ventral striatum, dorsomedial striatum (or its primate homologue), and mediodorsal thalamus. The recurrent connectivity that we assume is thus not only inherent within the prefrontal cortex itself but also spans parallel cortico-basal ganglia-thalamic loops. The figure diagrams a standard model of these loops, based on, based on Glimcher, P. W., & Fehr, E. (Eds.). (2013). Neuroeconomics: Decision making and the brain. Academic Press, page 378. DPFC: Dorsal prefrontal cortex, including dorsolateral prefrontal and cingulate cortices. VPFC: Ventral prefrontal cortex, including ventromedial and orbitofrontal prefrontal cortices. SM: (Somato-) sensorimotor cortex. VS: Ventral striatum. DMS: Dorsomedial striatum. DLS: Dorsolateral striatum. VTA/SNc: Ventral tegmental area and substantia nigra pars compacta, with rounded arrowheads indicating dopaminergic projections. SNr/GPi: Substantia nigra pars reticulate/Globus pallidus pars interna. MD: Mediodorsal thalamus. PO: Posterior thalamus. The circuit running through DPFC has been referred to as the “associative loop”; the circuit running through the VPFC as the “limbic loop”; and the circuit running through SM as the “sensorimotor loop” (Haber, S., Neuroscience, 282, 248–257, 2014). One way of organizing the functions of the various regions of PFC and associated sectors of the striatum is by reference to the actor-critic architecture introduced in RL research, with ventral regions performing the critic role (computing estimates of the value function) and dorsal regions performing the role of the actor (implementing the policy; see Botvinick, M. M., Niv, Y. & Barto, A. C., Cognition, 113, 262–280, 2009, and Joel, D., Niv, Y. & Ruppin, E., Neural Networks, 15, 535–547, 2002). It is no coincidence that our assumptions concerning the outputs of the prefrontal network include both value estimates and actions. In this sense, we are conceptualizing the prefrontal network in terms of the actor-critic schema, and further elaborations of the paradigm might attain finer-grained architectural differentiation by importing existing ideas about the mapping between neuroanatomy and the actor-critic architecture (see, e.g., Song, H. F., Yang, G. R. & Wang, X.-J., eLife, 6, e21492, 2017). The implementation that we introduced in Simulation 6 takes a step in this direction, by dividing the prefrontal network into two halves, corresponding functionally to actor and critic. The loss function employed in optimizing our networks includes terms for value regression and policy gradient. Actor-critic models suggest that something very much like these two losses is implemented in cortex and basal ganglia (e.g. Joel, Niv & Ruppin, Neural Networks, 15, 535–547, 2002). Value regression is believed to be implemented in the basal ganglia as dopamine prediction errors drive value estimates toward their targets (Montague, Dayan & Sejnowski, Journal of Neuroscience, 16, 1936–1947, 1996). Meanwhile, the policy gradient loss simply increases the strength of connections that participated in generating an action (when that action led to positive reward), which is believed to be the role of dopamine acting as modulator of plasticity at corticostriatal synapses. Under the meta-RL theory, the role of DA in modulating synaptic function should play out only over a relatively long time-scale, serving to sculpt prefrontal dynamics over the course of multiple tasks. This is in contrast to the standard model, which assumes that DA can drive synaptic change sufficiently quickly to affect behavior on the scale of a few seconds. Interestingly, despite decades of intensive research, no direct evidence has yet been reported to indicate that phasic DA signals can drive synaptic change this rapidly. Indeed, we are aware of no experimental evidence that phasic elevations in DA concentrations spanning less than a full minute can significantly impact synaptic efficacy, and in many studies the impact of phasic fluctuations in DA can take minutes (or even tens of minutes) to ramp up (Brzosko, Z., Schultz, W. & Paulsen, O., Elife, 4, e09685, 2015; Otmakhova, N. A. & Lisman, J. E., Journal of Neuroscience, 16, 7478–7486, 1996; Yagishita, S. et al., Science, 345, 1616–1620, 2014). The lack of evidence for faster effects of phasic DA on synaptic plasticity presents another difficulty for the standard model. In contrast, the meta-RL framework directly predicts that such short-term effects of DA should in fact be absent in the prefrontal network, and that the effect of phasic DA signaling on synaptic efficacy in this network should instead operate on a significantly longer time-scale.

  4. Supplementary Figure 4 Detailed behavioral analysis of simulation 1: reinforcement learning in the prefrontal network.

    Left: Run lengths from the model. Bars indicate log trial counts for each run length, based on data from the last two-thirds of trials in each episode, and pooling across all reward-probability assignments. The pattern fits with the exponential decay typically reported in matching experiments (Corrado, Sugrue, Seung and Newsome, Journal of the experimental analysis of behavior, 84, 581–617, 2005), paired with a tendency to alternate responses. The latter is typical in tasks not involving a changeover penalty, and was observed in the study by Tsutsui et al. (Fabian Grabenhorst, personal communication). Right: Histograms showing log counts across run lengths individually for a range of task parameterizations. p 0 (a i ) denotes the ground-truth reward probability for action i. The pattern appears bimodal toward the right of the figure, suggesting that the model may approximate a fixed alternation between poorer and richer arms, remaining at the poorer arm for only one step and at the richer arm for a fixed number of steps. Houston and McNamara (Houston and McNamara, Journal of the experimental analysis of behavior, 35, 367–396, 1981) have shown that the optimal policy for discrete-trial variable-interval tasks, like the one studied here, takes this form.

  5. Supplementary Figure 5 Additional simulations and analyses for simulation 2: adaptation of prefrontal-based learning to the task environment.

    a) Proportion of LSTM units coding for volatility changes over the course of an episode. Y-axis shows the fraction of units whose activity significantly correlated (across episodes) with the true volatility. Volatility coding emerged shortly after the 25th trial, which is when the first reversal occurred if the episode began with high volatility. Volatility coding was assessed by regressing hidden activations against the true volatility in a single regression (black line), or by multiple regression of hidden activations against true volatility, previous reward, next action, and true reward probability (red line). A unit was labelled as coding volatility if the slope of the regression was significantly different from zero (2-tailed t-test) at a threshold of 0.05/n, where n=48 is the number of hidden units. b) Because units were not independent, the true correction for multiple comparisons lies somewhere between 1 and 1/n. The significance threshold, for declaring a unit as coding volatility, varies on the x-axis in this plot, with uncorrected and Bonferroni corrected levels indicated by dashed lines. The y-axis shows the mean over trials of number of units that exceeded this threshold. Although changing the threshold had some effect on the number of units significantly coding volatility, there was a population of units that appeared to not code volatility at all, and another population that coded it very strongly. c) Causal role of volatility-coding units in controlling learning rate. After identifying the two units whose activity correlated most strongly (positively) with volatility, we inhibited these units by deducting a fixed quantity (0.3) from their activity level (LSTM cell output), analogous to an optogenetic inhibition. The black curve shows the learning rate (estimated as described in Methods) without this "optogenetic" manipulation. The red curve shows the learning rate when "optogenetic" inhibition was applied starting at the onset of the high volatility period (arrow). Inhibition of these two units resulted in a slower measured learning rate, establishing a causal connection between the volatility-coding units and learning dynamics. Note that we treated the relevant ‘pretraining’ to have occurred outside the lab, and to reflect statistics of naturalistic task situations. For additional related work on the neural basis of learning rate adjustment, see Soltani et al., Neural Networks, 19, 1075–1090, 2006; and Farashahi et al., Neuron, 94, 401–414, 2017. Like the comparable work cited in the main text, these studies posit special-purpose mechanisms. In contrast, our theory posits the relevant mechanisms as one manifestation of a more general meta-learning phenomenon.

  6. Supplementary Figure 6 Additional analyses for simulation 4: model-based behavior: the two-step task.

    a) Model-based and model-free RPEs, derived from prospective planning and SARSA, respectively, were partially correlated with one another. (This motivated the hierarchical regression in our primary analyses; see Fig. 5f,g in main text.) b) Beta coefficients from multiple regression using both MF and MB RPEs to explain meta-RL’s RPEs. Only MB RPEs significantly explained meta-RL’s RPEs. c) Two-step task performance for meta-RL, model-based RL, and two alternative models considered by Akam et al. (PLoS Comput Biol 11, e1004648, 2015). Akam and colleagues (see also Miller et al., bioRxiv 096339, 2016) have shown that a reward-by-transition interaction is not uniquely diagnostic for model-basedness. It is possible to construct agents using model-free learning, which display a reward-by-transition interaction effect on the probability of repeating an action. Here we simulated and fit two models from Akam et al. (2015): (1) the "latent-state" model, which does inference on the binary latent state of which stimulus currently has high reward probability and learns separate Q-values for two the different states, and (2) the "reward-as-cue" model, which uses for its state representation the cross-product of which second-stage state occurred on the previous trial and whether or not reward was received. The model-based and latent-state strategies yielded comparable levels of reward, while the reward-as-cue strategy paid less well. Although the model-based agent earned slightly less than the latent-state agent, this difference was eliminated if the former was modified to take account of the anti-correlation between the second-step state payoff probabilities, as in the agent used to predict model-based RPE signals in Simulation 4 (see Methods). The performance of meta-RL was comparable to that of both the model-based and latent-state agents. We note that meta-RL is therefore unlikely to implement the reward-as-cue strategy; this is further ruled out by regression results presented in Fig. 5e, which are inconsistent with the pattern produced by reward-as-cue (see Akam et al., 2015, and Miller et al., 2016). The dashed line indicates performance for random action selection. Model free Q-learning performed significantly worse than reward-as-cue (data not shown). d) Fitting three models to meta-RL’s behavior. Model-based and latent-state strategies yielded comparable fits, and again the small difference between them was eliminated if the model-based strategy was given access to a full model of the task. The reward-as-cue strategy fit significantly worse. All analyses here are done over 299 evaluation episodes, using 1 fully trained network out of 8 replicas. Results are highly similar for other replicas.

  7. Supplementary Figure 7 New simulation to directly test model-based reasoning in meta-RL.

    The two-step task alone is not sufficient to determine whether meta-RL can learn to implement truly prospective planning. Akam et al. (2015) have identified at least two non-prospective strategies ("reward-as-cue" and "latent-state") that can masquerade as prospective in the two-step task. Although we can rule out meta-RL using the reward-as-cue strategy (see Supplemental Figure 4), we cannot rule out the latent-state strategy. (Akam and colleagues, 2015, acknowledge that the latent-state strategy requires a form of model to infer the current latent state from observed reward outcomes (Kolling et al., Nature Neuroscience, 19, 1280, 2016) -- however, this is a relatively impoverished definition of model-based compared to fully prospective reasoning using a transition model.) However, other studies hint that true prospective planning may arise when recurrent neural networks are trained by RL in a multi-task environment. Recurrent neural networks can be trained to select reward-maximizing actions when given inputs representing Markov decision problem (MDP) parameters (e.g., images of random mazes) (Tamar et al., Neural Information Processing Systems, 2146, 2016; Werbos et al., IEEE International Conference on Systems, Man and Cybernetics 1764, 1996). A recent study by Duan and colleagues (arXiv, 1611.02779, 2016) also showed that recurrent networks trained on a series of random MDPs quickly adapt to unseen MDPs, outperforming standard model-free RL algorithms. And finally, recurrent networks can learn through RL to perform tree-search-like operations supporting action selection (Silver et al., arXiv 1612.08810, 2016; Hamrick et al., Neural Information Processing Systems, 2016; Graves et al., Nature 538, 471, 2016). We therefore sought to directly test prospective planning in meta-RL. To this end, we designed a novel revaluation task in which the agent was given five steps to act in a 32-state MDP whose transition structure was fixed across training and testing (for experiments using related tasks, see Kurth-Nelson et al., Neuron, 91, 194, 2016; Huys et al., PNAS, 112, 3098, 2015; Keramati et al., PNAS, 113, 12868, 2016; Lee et al., Neuron, 81, 687, 2014). The reward function was randomly permuted from episode to episode, and given as part of the input to the agent. Because planning requires iterative computations, the agent was given a ‘pondering’ (Graves, arXiv:1603.08983, 2016) period of five steps at the start of each episode. A) Transition structure of revaluation task. From each state, two actions were available (red and green arrows). For visualization, nodes are placed to minimize distances between connected notes. Rewards were sampled on each episode by randomly permuting a 32-element vector containing ten entries of +1, 21 entries of −1, and one entry of +5 -- node size in diagram shows a single sampled reward function. There were over a billion possible distinct reward functions: orders of magnitude more than the number of training episodes. Episodes began in a random state. Meta-RL’s network architecture was identical to our other simulations, except the LSTM had 128 units. B) Meta-RL reached near optimal performance, when tested on reward functions not included in training. Significantly, meta-RL outperformed a ‘greedy’ agent, which followed the shortest path toward the largest reward. Meta-RL also outperformed an agent using the successor representation (SR). Q(1): Q-learning with eligibility trace; Rand.: random action selection. Bars indicate standard error. C) Correlation between network value output (‘baseline’) and ground-truth future reward grows during pondering period (steps 1 through 5). This indicates that the network used the pondering period to perform calculations that steadily improved its accuracy in predicting future reward. D) Canonical correlation between LSTM hidden state and several task variables, performed independently at each time step. Because “last action” was given as part of the input to the network, we orthogonalized the hidden state against this variable before calculating the canonical correlation. Unsurprisingly, we found that a linear code for action appeared most robustly at the step when the action was taken (the first action occurred on step 6, after five steps of ‘pondering’). However, this signal also ramped up prior to the onset of action. (The network cannot have a perfect representation of which action it will take, because the network’s output is passed through a noisy softmax function to determine the actual action given to the environment.) The network also maintained a strong representation of which state it currently occupied. We note that the network’s knowledge of the transition probabilities was acquired through RL. However, our theory does not exclude the existence of other neural learning mechanisms capable of identifying transition probabilities and other aspects of task structure. Indeed, there is overwhelming evidence that the brain identifies sequential and causal structure independent of reward (Glascher et al. Neuron, 66, 585, 2010; Tolman et al., Psychological Review, 55, 189, 1948). In subsequent work, it will be interesting to consider how such learning mechanisms might interact with and synergize with meta-RL (e.g., Hamrick et al., Neural Information Processing Systems, 2016). All analyses here are done over 1000 evaluation episodes, using 1 fully trained network. Other simulations using slightly different hyperparameters were conducted but not reported, and yielded very similar results.

  8. Supplementary Figure 8 Trajectory of hidden activations during revaluation task (see Supplementary Figure 7).

    Each panel shows the evolution, over steps within an episode, of the first two principal components of the hidden activations of the LSTM network. Each oval is one time step. The width and height of each oval illustrates the standard error (across episodes), and the color corresponds to the index of the time step within the episode (blue at the beginning of the episode and red at the end). Each panel summarizes a subset of episodes. Episodes were randomly generated under the constraint that the large (+5) reward could be reached within exactly four or five steps from the initial state. The top row of panels shows all the episodes where the large reward was reached on the ninth time step (five ‘pondering’ steps followed by four overt actions). The bottom row shows the episodes where the large reward was reached on the tenth time step (pondering plus five actions). Together these comprise 97% of all episodes. The remaining 3% of trials, where the large reward was not reached, are not depicted. Within each row, the i-th panel (from left to right) divides the depicted episodes into two groups: those in which the i-th action ultimately executed was 0 (dotted line), and those in which it was 1 (solid line). We found strong interactions between the future reward and the current and future actions. For example, the network’s trajectory along the 2nd principal component began to diverge during the pondering period depending on whether it would take action 0 or 1 on the 3rd action. By the final step of pondering, this effect itself was strongly modulated according to whether reward would be received on step nine or ten of the episode. This appears to reflect a sophisticated but idiosyncratic planning algorithm. As in Supplementary Figure 7, PCA analysis was conducted over 1000 evaluation episodes using 1 fully trained network.

  9. Supplementary Figure 9 Agent architecture for simulation 6.

    Agent architecture employed in Simulation 6. Instead of a single LSTM, we now employ 2 LSTMs to model an actor (outputting the policy) and a critic (outputting the value estimate). The critic receives the same input as in our original model, while the actor receives the state observation, last action taken, and the reward prediction error that is computed based on the value estimate output from the critic.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–9

  2. Reporting Summary

  3. Supplementary Video 1

    – Video illustrating performance in simulation 5.Four consecutive trial blocks are shown. In the first and second blocks, the rewarded image first appears on the left; in the third and fourth, on the right. Images are shown in full resolution. See Methods for details concerning simulation methods