Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Reinforcement learning in artificial and biological systems

Abstract

There is and has been a fruitful flow of concepts and ideas between studies of learning in biological and artificial systems. Much early work that led to the development of reinforcement learning (RL) algorithms for artificial systems was inspired by learning rules first developed in biology by Bush and Mosteller, and Rescorla and Wagner. More recently, temporal-difference RL, developed for learning in artificial agents, has provided a foundational framework for interpreting the activity of dopamine neurons. In this Review, we describe state-of-the-art work on RL in biological and artificial agents. We focus on points of contact between these disciplines and identify areas where future research can benefit from information flow between these fields. Most work in biological systems has focused on simple learning problems, often embedded in dynamic environments where flexibility and ongoing learning are important, similar to real-world learning problems faced by biological systems. In contrast, most work in artificial agents has focused on learning a single complex problem in a static environment. Moving forward, work in each field will benefit from a flow of ideas that represent the strengths within each discipline.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of approaches to learning in biological and artificial agents.
Fig. 2: Anatomy of a model of reinforcement learning shown on a schematic representation of the rhesus monkey striatum.
Fig. 3: Expanded conception of neural circuitry underlying reinforcement learning.
Fig. 4: A model of spike-based deep, continuous local learning (DCLL).

References

  1. 1.

    Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, Cambridge, 1998).

  2. 2.

    Pribram, K. H. A review of theory in physiological psychology. Annu. Rev. Psychol. 11, 1–40 (1960).

    Article  Google Scholar 

  3. 3.

    Janak, P. H. & Tye, K. M. From circuits to behaviour in the amygdala. Nature 517, 284–292 (2015).

    Article  Google Scholar 

  4. 4.

    Namburi, P. et al. A circuit mechanism for differentiating positive and negative associations. Nature 520, 675–678 (2015).

    Article  Google Scholar 

  5. 5.

    Paton, J. J., Belova, M. A., Morrison, S. E. & Salzman, C. D. The primate amygdala represents the positive and negative value of visual stimuli during learning. Nature 439, 865–870 (2006).

    Article  Google Scholar 

  6. 6.

    Hamid, A. A. et al. Mesolimbic dopamine signals the value of work. Nat. Neurosci. 19, 117–126 (2016).

    Article  Google Scholar 

  7. 7.

    Costa, V. D., Dal Monte, O., Lucas, D. R., Murray, E. A. & Averbeck, B. B. Amygdala and ventral striatum make distinct contributions to reinforcement learning. Neuron 92, 505–517 (2016).

    Article  Google Scholar 

  8. 8.

    Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming (Wiley, New York, 1994).

  9. 9.

    Bertsekas, D. P. Dynamic Programming and Optimal Control (Athena Scientific, Belmont, 1995).

  10. 10.

    Vapnik, V. The Nature of Statistical Learning Theory (Springer, New York, 2013).

  11. 11.

    Hessel, M. et al. Multi-task deep reinforcement learning with PopArt. Preprint at https://arxiv.org/abs/1809.04474 (2018).

  12. 12.

    Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl Acad. Sci. USA 114, 3521–3526 (2017).

    MathSciNet  MATH  Article  Google Scholar 

  13. 13.

    Banino, A. et al. Vector-based navigation using grid-like representations in artificial agents. Nature 557, 429–433 (2018).

    Article  Google Scholar 

  14. 14.

    Mattar, M. G. & Daw, N. D. Prioritized memory access explains planning and hippocampal replay. Nat. Neurosci. 21, 1609–1617 (2018).

    Article  Google Scholar 

  15. 15.

    Rosenblatt, F. The perceptron: a probabilistic model for information-storage and organization in the brain. Psychol. Rev. 65, 386–408 (1958).

    Article  Google Scholar 

  16. 16.

    Hinton, G. E., Dayan, P., Frey, B. J. & Neal, R. M. The “wake-sleep” algorithm for unsupervised neural networks. Science 268, 1158–1161 (1995).

    Article  Google Scholar 

  17. 17.

    Rescorla, R. A. & Wagner, A. R. in Classical Conditioning II: Current Research and Theory (eds Black, A. H. & Prokasy, W. F.) 64–99 (Appleton-Century-Crofts, New York, 1972).

  18. 18.

    Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).

    Article  Google Scholar 

  19. 19.

    Montague, P. R., Dayan, P. & Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci. 16, 1936–1947 (1996).

    Article  Google Scholar 

  20. 20.

    Houk, J. C., Adamas, J. L. & Barto, A. G. in Models of Information Processing in the Basal Ganglia (eds Houk, J. C., Davis, J. L. & Beiser, D. G.) 249–274 (MIT Press, Cambridge, 1995).

  21. 21.

    Frank, M. J. Dynamic dopamine modulation in the basal ganglia: a neurocomputational account of cognitive deficits in medicated and nonmedicated Parkinsonism. J. Cogn. Neurosci. 17, 51–72 (2005).

    Article  Google Scholar 

  22. 22.

    Haber, S. N., Kim, K. S., Mailly, P. & Calzavara, R. Reward-related cortical inputs define a large striatal region in primates that interface with associative cortical connections, providing a substrate for incentive-based learning. J. Neurosci. 26, 8368–8376 (2006).

    Article  Google Scholar 

  23. 23.

    Mink, J. W. The basal ganglia: focused selection and inhibition of competing motor programs. Prog. Neurobiol. 50, 381–425 (1996).

    Article  Google Scholar 

  24. 24.

    Lau, B. & Glimcher, P. W. Value representations in the primate striatum during matching behavior. Neuron 58, 451–463 (2008).

    Article  Google Scholar 

  25. 25.

    O’Doherty, J. et al. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science 304, 452–454 (2004).

    Article  Google Scholar 

  26. 26.

    Averbeck, B. B. & Costa, V. D. Motivational neural circuits underlying reinforcement learning. Nat. Neurosci. 20, 505–512 (2017).

    Article  Google Scholar 

  27. 27.

    Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).

    Google Scholar 

  28. 28.

    Schultz, W. Dopamine reward prediction error coding. Dialog. Clin. Neurosci. 18, 23–32 (2016).

  29. 29.

    Steinberg, E. E. et al. A causal link between prediction errors, dopamine neurons and learning. Nat. Neurosci. 16, 966–973 (2013).

    Article  Google Scholar 

  30. 30.

    Saunders, B. T., Richard, J. M., Margolis, E. B. & Janak, P. H. Dopamine neurons create Pavlovian conditioned stimuli with circuit-defined motivational properties. Nat. Neurosci. 21, 1072–1083 (2018).

    Article  Google Scholar 

  31. 31.

    Sharpe, M. J. et al. Dopamine transients are sufficient and necessary for acquisition of model-based associations. Nat. Neurosci. 20, 735–742 (2017).

    Article  Google Scholar 

  32. 32.

    Averbeck, B. B., Sohn, J. W. & Lee, D. Activity in prefrontal cortex during dynamic selection of action sequences. Nat. Neurosci. 9, 276–282 (2006).

    Article  Google Scholar 

  33. 33.

    Seo, M., Lee, E. & Averbeck, B. B. Action selection and action value in frontal-striatal circuits. Neuron 74, 947–960 (2012).

    Article  Google Scholar 

  34. 34.

    Lee, E., Seo, M., Dal Monte, O. & Averbeck, B. B. Injection of a dopamine type 2 receptor antagonist into the dorsal striatum disrupts choices driven by previous outcomes, but not perceptual inference. J. Neurosci. 35, 6298–6306 (2015).

    Article  Google Scholar 

  35. 35.

    Averbeck, B. B., Lehman, J., Jacobson, M. & Haber, S. N. Estimates of projection overlap and zones of convergence within frontal-striatal circuits. J. Neurosci. 34, 9497–9505 (2014).

    Article  Google Scholar 

  36. 36.

    Rothenhoefer, K. M. et al. Effects of ventral striatum lesions on stimulus versus action based reinforcement learning. J. Neurosci. 37, 6902–6914 (2017).

    Article  Google Scholar 

  37. 37.

    Friedman, D. P., Aggleton, J. P. & Saunders, R. C. Comparison of hippocampal, amygdala, and perirhinal projections to the nucleus accumbens: combined anterograde and retrograde tracing study in the Macaque brain. J. Comp. Neurol. 450, 345–365 (2002).

    Article  Google Scholar 

  38. 38.

    Alexander, G. E., DeLong, M. R. & Strick, P. L. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci. 9, 357–381 (1986).

    Article  Google Scholar 

  39. 39.

    Averbeck, B. B. Amygdala and ventral striatum population codes implement multiple learning rates for reinforcement learning. In IEEE Symposium Series on Computational Intelligence (IEEE, 2017).

  40. 40.

    Jacobs, R. A., Jordan, M. I., Nowlan, S. J. & Hinton, G. E. Adaptive mixtures of local experts. Neural Comput. 3, 79–87 (1991).

    Article  Google Scholar 

  41. 41.

    Pfister, J. P., Toyoizumi, T., Barber, D. & Gerstner, W. Optimal spike-timing-dependent plasticity for precise action potential firing in supervised learning. Neural Comput. 18, 1318–1348 (2006).

    MathSciNet  MATH  Article  Google Scholar 

  42. 42.

    Benna, M. K. & Fusi, S. Computational principles of biological memory. Preprint at https://arxiv.org/abs/1507.07580 (2015).

  43. 43.

    Lahiri, S. & Ganguli, S. A memory frontier for complex synapses. In Advances in Neural Information Processing Systems Vol. 26 (eds Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z. & Weinberger, K. Q.) 1034–1042 (NIPS, 2013).

  44. 44.

    Koutnik, J., Greff, K., Gomez, F. & Schmidhuber, J. A clockwork RNN. Preprint at https://arxiv.org/abs/1402.3511 (2014).

  45. 45.

    Neil, D., M., P. & Liu, S.-C. Phased LSTM: accelerating recurrent network training for long or event-based sequences. In Advances in Neural Information Processing Systems Vol. 29 (eds Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I. & Garnett, R.) 3882–3890 (NIPS, 2016).

  46. 46.

    Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  Google Scholar 

  47. 47.

    O’Reilly, R. C. & Frank, M. J. Making working memory work: a computational model of learning in the prefrontal cortex and basal ganglia. Neural Comput. 18, 283–328 (2006).

    MathSciNet  MATH  Article  Google Scholar 

  48. 48.

    Bishop, C. M. Pattern Recognition and Machine Learning (Springer, New York, 2006).

  49. 49.

    Bottou, L. & LeCun, Y. Large scale online learning. In Advances in Neural Information Processing Systems Vol. 16 (eds Thrun, S., Saul, L. K. & Schölkopf, B.) (NIPS, 2004).

  50. 50.

    McCloskey, M. & Cohen, N. J. in Psychology of Learning and Motivation : Advances in Research and Theory Vol. 24 (ed. Bower, G. H.) 109–165 (1989).

  51. 51.

    McClelland, J. L., McNaughton, B. L. & O’Reilly, R. C. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 102, 419–457 (1995).

    Article  Google Scholar 

  52. 52.

    Kumaran, D., Hassabis, D. & McClelland, J. L. What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends Cogn. Sci. 20, 512–534 (2016).

    Article  Google Scholar 

  53. 53.

    Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

    Article  Google Scholar 

  54. 54.

    Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8, 293–321 (1992).

    Google Scholar 

  55. 55.

    Zenke, F., Poole, B. & Ganguli, S. Continual learning through synaptic intelligence. Preprint at https://arxiv.org/abs/1703.04200 (2017).

  56. 56.

    Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M. & Tuytelaars, T. Memory aware synapses: learning what (not) to forget. Preprint at https://arxiv.org/abs/1711.09601 (2017).

  57. 57.

    Daw, N. D., Niv, Y. & Dayan, P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci. 8, 1704–1711 (2005).

    Article  Google Scholar 

  58. 58.

    Costa, V. D., Tran, V. L., Turchi, J. & Averbeck, B. B. Reversal learning and dopamine: a Bayesian perspective. J. Neurosci. 35, 2407–2416 (2015).

  59. 59.

    Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P. & Dolan, R. J. Model-based influences on humans’ choices and striatal prediction errors. Neuron 69, 1204–1215 (2011).

    Article  Google Scholar 

  60. 60.

    Glascher, J., Daw, N., Dayan, P. & O’Doherty, J. P. States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron 66, 585–595 (2010).

    Article  Google Scholar 

  61. 61.

    Doll, B. B., Simon, D. A. & Daw, N. D. The ubiquity of model-based reinforcement learning. Curr. Opin. Neurobiol. 22, 1075–1081 (2012).

    Article  Google Scholar 

  62. 62.

    Wunderlich, K., Smittenaar, P. & Dolan, R. J. Dopamine enhances model-based over model-free choice behavior. Neuron 75, 418–424 (2012).

    Article  Google Scholar 

  63. 63.

    Miller, E. K. The prefrontal cortex and cognitive control. Nat. Rev. Neurosci. 1, 59–65 (2000).

    Article  Google Scholar 

  64. 64.

    Balleine, B. W. & Dickinson, A. Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology 37, 407–419 (1998).

    Article  Google Scholar 

  65. 65.

    Deserno, L. et al. Ventral striatal dopamine reflects behavioral and neural signatures of model-based control during sequential decision making. Proc. Natl Acad. Sci. USA 112, 1595–1600 (2015).

    Article  Google Scholar 

  66. 66.

    Harlow, H. F. The formation of learning sets. Psychol. Rev. 56, 51–65 (1949).

    Article  Google Scholar 

  67. 67.

    Iversen, S. D. & Mishkin, M. Perseverative interference in monkeys following selective lesions of the inferior prefrontal convexity. Exp. Brain Res. 11, 376–386 (1970).

    Article  Google Scholar 

  68. 68.

    Jang, A. I. et al. The role of frontal cortical and medial-temporal lobe brain areas in learning a Bayesian prior belief on reversals. J. Neurosci. 35, 11751–11760 (2015).

    Article  Google Scholar 

  69. 69.

    Wang, J. X. et al. Prefrontal cortex as a meta-reinforcement learning system. Nat. Neurosci. 21, 860–868 (2018).

    Article  Google Scholar 

  70. 70.

    Wilson, R. C., Takahashi, Y. K., Schoenbaum, G. & Niv, Y. Orbitofrontal cortex as a cognitive map of task space. Neuron 81, 267–279 (2014).

    Article  Google Scholar 

  71. 71.

    Schuck, N. W., Cai, M. B., Wilson, R. C. & Niv, Y. Human orbitofrontal cortex represents a cognitive map of state space. Neuron 91, 1402–1412 (2016).

    Article  Google Scholar 

  72. 72.

    DeGroot, M. H. Optimal Statistical Decisions (Wiley, Hoboken, 1970).

  73. 73.

    Starkweather, C. K., Babayan, B. M., Uchida, N. & Gershman, S. J. Dopamine reward prediction errors reflect hidden-state inference across time. Nat. Neurosci. 20, 581–589 (2017).

    Article  Google Scholar 

  74. 74.

    Starkweather, C. K., Gershman, S. J. & Uchida, N. The medial prefrontal cortex shapes dopamine reward prediction errors under state uncertainty. Neuron 98, 616–629 (2018).

    Article  Google Scholar 

  75. 75.

    Wang, X. J. Synaptic basis of cortical persistent activity: the importance of NMDA receptors to working memory. J. Neurosci. 19, 9587–9603 (1999).

    Article  Google Scholar 

  76. 76.

    Schöner, G. in The Cambridge Handbook of Computational Psychology (ed. Sun, R.) 101–126 (Cambridge Univ. Press, Cambridge, 2008).

  77. 77.

    Averbeck, B. B. Theory of choice in bandit, information sampling and foraging tasks. PLoS Comput. Biol. 11, e1004164 (2015).

  78. 78.

    Otto, A. R., Raio, C. M., Chiang, A., Phelps, E. A. & Daw, N. D. Working-memory capacity protects model-based learning from stress. Proc. Natl Acad. Sci. USA 110, 20941–20946 (2013).

    Article  Google Scholar 

  79. 79.

    Akam, T., Costa, R. & Dayan, P. Simple plans or sophisticated habits? State, transition and learning interactions in the two-step task. PLoS. Comput. Biol. 11, e1004648 (2015).

    Article  Google Scholar 

  80. 80.

    Riesenhuber, M. & Poggio, T. Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019–1025 (1999).

    Article  Google Scholar 

  81. 81.

    Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992).

    MATH  Google Scholar 

  82. 82.

    Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995).

    Article  Google Scholar 

  83. 83.

    Pomerleau, D. A. ALVINN: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems Vol. 1 (ed. Touretzky, D. S.) (NIPS, 1988).

  84. 84.

    Levine, S., Finn, C., Darrell, T. & Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17, 1334–1373 (2016).

    MathSciNet  MATH  Google Scholar 

  85. 85.

    Gu, S., Lillicrap, T., Sutskever, I. & Levine, S. Continuous deep Q-learning with model-based acceleration. In Proc. 33rd International Conference on Machine Learning Vol. 48 2829–2838 (PMLR, 2016).

  86. 86.

    Burda, Y., Edwards, H., Storkey, A. & Klimov, O. Exploration by random network distillation. Preprint at https://arxiv.org/abs/1810.12894 (2018).

  87. 87.

    Vezhnevets, A. S. et al. Feudal networks for hierarchical reinforcement learning. Preprint at https://arxiv.org/abs/1703.01161 (2017).

  88. 88.

    Neftci, E. O. Data and power efficient intelligence with neuromorphic learning machines. iScience 5, 52–68 (2018).

    Article  Google Scholar 

  89. 89.

    Kaiser, J., Mostafa, H. & Neftci, E. O. Synaptic plasticity dynamics for deep continuous local learning. Preprint at https://arxiv.org/abs/1811.10766 (2018).

  90. 90.

    Neftci, E. O., Augustine, C., Paul, S. & Detorakis, G. Event-driven random back-propagation: enabling neuromorphic deep learning machines. Front. Neurosci. 11, 324 (2017).

    Article  Google Scholar 

  91. 91.

    Lillicrap, T. P., Cownden, D., Tweed, D. B. & Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nat. Commun. 7, 13276 (2016).

    Article  Google Scholar 

  92. 92.

    Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J. & Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 37, 421–436 (2018).

  93. 93.

    Blundell, C. et al. Model-free episodic control. Preprint at https://arxiv.org/abs/1606.04460 (2016).

  94. 94.

    Gershman, S. J. & Daw, N. D. Reinforcement learning and episodic memory in humans and animals: an integrative framework. Ann. Rev. Psychol. 68, 101–128 (2017).

    Article  Google Scholar 

  95. 95.

    Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. Preprint at https://arxiv.org/abs/1703.03400 (2017).

  96. 96.

    Ha, D. & Schmidhuber, J. World models. Preprint at https://arxiv.org/abs/1803.10122 (2018).

  97. 97.

    Zambaldi, V. et al. Relational deep reinforcement learning. Preprint at https://arxiv.org/abs/1806.01830 (2018).

  98. 98.

    Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006).

  99. 99.

    Kriegeskorte, N. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annu. Rev. Vision Sci. 1, 417–446 (2015).

  100. 100.

    Bernacchia, A., Seo, H., Lee, D. & Wang, X. J. A reservoir of time constants for memory traces in cortical neurons. Nat. Neurosci. 14, 366–372 (2011).

  101. 101.

    Walton, M. E., Behrens, T. E., Buckley, M. J., Rudebeck, P. H. & Rushworth, M. F. Separable learning systems in the macaque brain and the role of orbitofrontal cortex in contingent learning. Neuron 65, 927–939 (2010).

  102. 102.

    Iglesias, S. et al. Hierarchical prediction errors in midbrain and basal forebrain during sensory learning. Neuron 80, 519–530 (2013).

    Article  Google Scholar 

  103. 103.

    Badre, D. & Frank, M. J. Mechanisms of hierarchical reinforcement learning in cortico-striatal circuits 2: evidence from fMRI. Cereb. Cortex 22, 527–536 (2012).

    Article  Google Scholar 

  104. 104.

    Frank, M. J. & Badre, D. Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: computational analysis. Cereb. Cortex 22, 509–526 (2012).

    Article  Google Scholar 

  105. 105.

    Botvinick, M. M. Hierarchical models of behavior and prefrontal function. Trends Cogn. Sci. 12, 201–208 (2008).

    Article  Google Scholar 

  106. 106.

    Botvinick, M. M., Niv, Y. & Barto, A. C. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition 113, 262–280 (2009).

    Article  Google Scholar 

  107. 107.

    Ribas-Fernandes, J. J. et al. A neural signature of hierarchical reinforcement learning. Neuron 71, 370–379 (2011).

    Article  Google Scholar 

  108. 108.

    Botvinick, M. M. Hierarchical reinforcement learning and decision making. Curr. Opin. Neurobiol. 22, 956–962 (2012).

    Article  Google Scholar 

  109. 109.

    Botvinick, M. & Weinstein, A. Model-based hierarchical reinforcement learning and human action control. Philos. Trans. R. Soc. Lond. B 369, 20130480 (2014).

    Article  Google Scholar 

  110. 110.

    Dayan, P. & Hinton, G. E. Feudal reinforcement learning. In Advances in Neural Information Processing Systems Vol. 5 (eds Hanson, S. J., Cowan, J. D. & Giles, C. L.) 271–278 (NIPS, 1992).

  111. 111.

    Sutton, R. S., Precup, D. & Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 112, 181–211 (1999).

    MathSciNet  MATH  Article  Google Scholar 

  112. 112.

    Bacon, P. L., Harb, J. & Precup, D. The option-critic architecture. Proc. Thirty-First AAAI Conference on Artificial Intelligence 1726–1734 (AAAI, 2017).

  113. 113.

    Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. Preprint at https://arxiv.org/abs/1611.05397 (2016).

  114. 114.

    Friston, K. The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11, 127–138 (2010).

  115. 115.

    Ross, S., Gordon, G. J. & Bagnell, J. A. A reduction of imitation learning and structured prediction to no-regret online learning. Preprint at https://arxiv.org/abs/1011.0686 (2010).

  116. 116.

    Le, H. M. et al. Hierarchical imitation and reinforcement learning. Preprint at https://arxiv.org/abs/1803.00590 (2018).

  117. 117.

    Koechlin, E., Ody, C. & Kouneiher, F. The architecture of cognitive control in the human prefrontal cortex. Science 302, 1181–1185 (2003).

    Article  Google Scholar 

  118. 118.

    Badre, D. & D’Esposito, M. Functional magnetic resonance imaging evidence for a hierarchical organization of the prefrontal cortex. J. Cogn. Neurosci. 19, 2082–2099 (2007).

    Article  Google Scholar 

  119. 119.

    Muessgens, D., Thirugnanasambandam, N., Shitara, H., Popa, T. & Hallett, M. Dissociable roles of preSMA in motor sequence chunking and hand switching—a TMS study. J. Neurophysiol. 116, 2637–2646 (2016).

    Article  Google Scholar 

  120. 120.

    Sabour, S., Frosst, N. & Hinton, G. E. Dynamic routing between capsules. In Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) 3856–3866 (2017).

  121. 121.

    Davies, M. et al. Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38, 82–99 (2018).

    Article  Google Scholar 

  122. 122.

    Friedmann, S. & Schemmel, J. Demonstrating hybrid learning in a flexible neuromorphic hardware system. Preprint at https://arxiv.org/abs/1604.05080 (2016).

  123. 123.

    Qiao, N. et al. A reconfigurable on-line learning spiking neuromorphic processor comprising 256 neurons and 128K synapses. Front. Neurosci. 9, 141 (2015).

  124. 124.

    Neftci, E. et al. Synthesizing cognition in neuromorphic electronic systems. Proc. Natl Acad. Sci. USA 110, 3468–3476 (2013).

    Article  Google Scholar 

  125. 125.

    Friedmann, S., Fremaux, N., Schemmel, J., Gerstner, W. & Meier, K. Reward-based learning under hardware constraints-using a RISC processor embedded in a neuromorphic substrate. Front. Neurosci. 7, 160 (2013).

    Article  Google Scholar 

  126. 126.

    Hassabis, D., Kumaran, D., Summerfield, C. & Botvinick, M. Neuroscience-inspired artificial intelligence. Neuron 95, 245–258 (2017).

    Article  Google Scholar 

  127. 127.

    Courbariaux, M., Bengio, Y. & David, J.-P. Training deep neural networks with low precision multiplications. Preprint at https://arxiv.org/abs/1412.7024 (2014).

  128. 128.

    Detorakis, G. et al. Neural and synaptic array transceiver: a brain-inspired computing framework for embedded learning. Front. Neurosci. 12, 583 (2018).

    Article  Google Scholar 

  129. 129.

    Liu, S. C. & Delbruck, T. Neuromorphic sensory systems. Curr. Opin. Neurobiol. 20, 288–295 (2010).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Intramural Research Program of the National Institute of Mental Health (ZIA MH002928-01), and by the National Science Foundation under grant 1640081.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Bruno B. Averbeck.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Neftci, E.O., Averbeck, B.B. Reinforcement learning in artificial and biological systems. Nat Mach Intell 1, 133–143 (2019). https://doi.org/10.1038/s42256-019-0025-4

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing