Letter

Human-level control through deep reinforcement learning

  • Nature volume 518, pages 529533 (26 February 2015)
  • doi:10.1038/nature14236
  • Download Citation
Received:
Accepted:
Published:

Abstract

The theory of reinforcement learning provides a normative account1, deeply rooted in psychological2 and neuroscientific3 perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems4,5, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms3. While reinforcement learning agents have achieved some successes in a variety of domains6,7,8, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks9,10,11 to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games12. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

  • Subscribe to Nature for full access:

    $199

    Subscribe

Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.

References

  1. 1.

    & Reinforcement Learning: An Introduction (MIT Press, 1998)

  2. 2.

    Animal Intelligence: Experimental studies (Macmillan, 1911)

  3. 3.

    , & A neural substrate of prediction and reward. Science 275, 1593–1599 (1997)

  4. 4.

    , & Object recognition with features inspired by visual cortex. Proc. IEEE. Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. 994–1000 (2005)

  5. 5.

    Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980)

  6. 6.

    Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995)

  7. 7.

    , , & Reinforcement learning for robot soccer. Auton. Robots 27, 55–73 (2009)

  8. 8.

    , & An object-oriented representation for efficient reinforcement learning. Proc. Int. Conf. Mach. Learn. 240–247 (2008)

  9. 9.

    Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1–127 (2009)

  10. 10.

    , & ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1106–1114 (2012)

  11. 11.

    & Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)

  12. 12.

    , , & The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013)

  13. 13.

    & Universal Intelligence: a definition of machine intelligence. Minds Mach. 17, 391–444 (2007)

  14. 14.

    , & General game playing: overview of the AAAI competition. AI Mag. 26, 62–72 (2005)

  15. 15.

    , & Investigating contingency awareness using Atari 2600 games. Proc. Conf. AAAI. Artif. Intell. 864–871 (2012)

  16. 16.

    , & Parallel Distributed Processing: Explorations in the Microstructure of Cognition (MIT Press, 1986)

  17. 17.

    , , & Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)

  18. 18.

    & Shape and arrangement of columns in cat’s striate cortex. J. Physiol. 165, 559–568 (1963)

  19. 19.

    & Q-learning. Mach. Learn. 8, 279–292 (1992)

  20. 20.

    & An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr. 42, 674–690 (1997)

  21. 21.

    , & Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 102, 419–457 (1995)

  22. 22.

    , , & Play it again: reactivation of waking experience and memory. Trends Neurosci. 33, 220–229 (2010)

  23. 23.

    Reinforcement learning for robots using neural networks. Technical Report, DTIC Document. (1993)

  24. 24.

    Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. Mach. Learn.: ECML 3720, 317–328 (Springer, 2005)

  25. 25.

    & Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

  26. 26.

    & Deep auto-encoder neural networks in reinforcement learning. Proc. Int. Jt. Conf. Neural. Netw. 1–8 (2010)

  27. 27.

    & Reinforcement learning can account for associative and perceptual learning on a visual decision task. Nature Neurosci. 12, 655 (2009)

  28. 28.

    & Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415, 318–320 (2002)

  29. 29.

    & Biasing the content of hippocampal replay during sleep. Nature Neurosci. 15, 1439–1444 (2012)

  30. 30.

    & Prioritized sweeping: reinforcement learning with less data and less real time. Mach. Learn. 13, 103–130 (1993)

  31. 31.

    , , & What is the best multi-stage architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–2153 (2009)

  32. 32.

    & Rectified linear units improve restricted Boltzmann machines. Proc. Int. Conf. Mach. Learn. 807–814 (2010)

  33. 33.

    , & Planning and acting in partially observable stochastic domains. Artificial Intelligence 101, 99–134 (1994)

Download references

Acknowledgements

We thank G. Hinton, P. Dayan and M. Bowling for discussions, A. Cain and J. Keene for work on the visuals, K. Keller and P. Rogers for help with the visuals, G. Wayne for comments on an earlier version of the manuscript, and the rest of the DeepMind team for their support, ideas and encouragement.

Author information

Author notes

    • Volodymyr Mnih
    • , Koray Kavukcuoglu
    •  & David Silver

    These authors contributed equally to this work.

Affiliations

  1. Google DeepMind, 5 New Street Square, London EC4A 3TW, UK

    • Volodymyr Mnih
    • , Koray Kavukcuoglu
    • , David Silver
    • , Andrei A. Rusu
    • , Joel Veness
    • , Marc G. Bellemare
    • , Alex Graves
    • , Martin Riedmiller
    • , Andreas K. Fidjeland
    • , Georg Ostrovski
    • , Stig Petersen
    • , Charles Beattie
    • , Amir Sadik
    • , Ioannis Antonoglou
    • , Helen King
    • , Dharshan Kumaran
    • , Daan Wierstra
    • , Shane Legg
    •  & Demis Hassabis

Authors

  1. Search for Volodymyr Mnih in:

  2. Search for Koray Kavukcuoglu in:

  3. Search for David Silver in:

  4. Search for Andrei A. Rusu in:

  5. Search for Joel Veness in:

  6. Search for Marc G. Bellemare in:

  7. Search for Alex Graves in:

  8. Search for Martin Riedmiller in:

  9. Search for Andreas K. Fidjeland in:

  10. Search for Georg Ostrovski in:

  11. Search for Stig Petersen in:

  12. Search for Charles Beattie in:

  13. Search for Amir Sadik in:

  14. Search for Ioannis Antonoglou in:

  15. Search for Helen King in:

  16. Search for Dharshan Kumaran in:

  17. Search for Daan Wierstra in:

  18. Search for Shane Legg in:

  19. Search for Demis Hassabis in:

Contributions

V.M., K.K., D.S., J.V., M.G.B., M.R., A.G., D.W., S.L. and D.H. conceptualized the problem and the technical framework. V.M., K.K., A.A.R. and D.S. developed and tested the algorithms. J.V., S.P., C.B., A.A.R., M.G.B., I.A., A.K.F., G.O. and A.S. created the testing platform. K.K., H.K., S.L. and D.H. managed the project. K.K., D.K., D.H., V.M., D.S., A.G., A.A.R., J.V. and M.G.B. wrote the paper.

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Koray Kavukcuoglu or Demis Hassabis.

Extended data

Supplementary information

PDF files

  1. 1.

    Supplementary Information

    This file contains a Supplementary Discussion.

Videos

  1. 1.

    Performance of DQN in the Game Space Invaders

    This video shows the performance of the DQN agent while playing the game of Space Invaders. The DQN agent successfully clears the enemy ships on the screen while the enemy ships move down and sideways with gradually increasing speed.

  2. 2.

    Demonstration of Learning Progress in the Game Breakout

    This video shows the improvement in the performance of DQN over training (i.e. after 100, 200, 400 and 600 episodes). After 600 episodes DQN finds and exploits the optimal strategy in this game, which is to make a tunnel around the side, and then allow the ball to hit blocks by bouncing behind the wall. Note: the score is displayed at the top left of the screen (maximum for clearing one screen is 448 points), number of lives remaining is shown in the middle (starting with 5 lives), and the “1” on the top right indicates this is a 1-player game.

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.