Human-level control through deep reinforcement learning

Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K.; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

doi:10.1038/nature14236

Letter
Published: 25 February 2015

Human-level control through deep reinforcement learning

Volodymyr Mnih¹^na1,
Koray Kavukcuoglu¹^na1,
David Silver¹^na1,
Andrei A. Rusu¹,
Joel Veness¹,
Marc G. Bellemare¹,
Alex Graves¹,
Martin Riedmiller¹,
Andreas K. Fidjeland¹,
Georg Ostrovski¹,
Stig Petersen¹,
Charles Beattie¹,
Amir Sadik¹,
Ioannis Antonoglou¹,
Helen King¹,
Dharshan Kumaran¹,
Daan Wierstra¹,
Shane Legg¹ &
…
Demis Hassabis¹

Nature volume 518, pages 529–533 (2015)Cite this article

509k Accesses
12k Citations
1545 Altmetric
Metrics details

Subjects

Computer science

Abstract

The theory of reinforcement learning provides a normative account¹, deeply rooted in psychological² and neuroscientific³ perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems^4,5, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms³. While reinforcement learning agents have achieved some successes in a variety of domains^6,7,8, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks^9,10,11 to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games¹². We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Schematic illustration of the convolutional neural network.**

**Figure 2: Training curves tracking the agent’s average score and average predicted action-value.**

**Figure 3: Comparison of the DQN agent with the best reinforcement learning methods¹⁵ in the literature.**

**Figure 4: Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced while playing Space Invaders.**

Using deep neural networks as a guide for modeling human planning

Article Open access 20 November 2023

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Article 30 October 2019

Mastering Atari, Go, chess and shogi by planning with a learned model

Article 23 December 2020

References

Sutton, R. & Barto, A. Reinforcement Learning: An Introduction (MIT Press, 1998)
MATH Google Scholar
Thorndike, E. L. Animal Intelligence: Experimental studies (Macmillan, 1911)
Book Google Scholar
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997)
Article CAS Google Scholar
Serre, T., Wolf, L. & Poggio, T. Object recognition with features inspired by visual cortex. Proc. IEEE. Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. 994–1000 (2005)
Google Scholar
Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980)
Article CAS Google Scholar
Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995)
Article Google Scholar
Riedmiller, M., Gabel, T., Hafner, R. & Lange, S. Reinforcement learning for robot soccer. Auton. Robots 27, 55–73 (2009)
Article Google Scholar
Diuk, C., Cohen, A. & Littman, M. L. An object-oriented representation for efficient reinforcement learning. Proc. Int. Conf. Mach. Learn. 240–247 (2008)
Google Scholar
Bengio, Y. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1–127 (2009)
Article Google Scholar
Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1106–1114 (2012)
Google Scholar
Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)
Article ADS MathSciNet CAS Google Scholar
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013)
Article Google Scholar
Legg, S. & Hutter, M. Universal Intelligence: a definition of machine intelligence. Minds Mach. 17, 391–444 (2007)
Article Google Scholar
Genesereth, M., Love, N. & Pell, B. General game playing: overview of the AAAI competition. AI Mag. 26, 62–72 (2005)
Google Scholar
Bellemare, M. G., Veness, J. & Bowling, M. Investigating contingency awareness using Atari 2600 games. Proc. Conf. AAAI. Artif. Intell. 864–871 (2012)
Google Scholar
McClelland, J. L., Rumelhart, D. E. & Group, T. P. R. Parallel Distributed Processing: Explorations in the Microstructure of Cognition (MIT Press, 1986)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
Article Google Scholar
Hubel, D. H. & Wiesel, T. N. Shape and arrangement of columns in cat’s striate cortex. J. Physiol. 165, 559–568 (1963)
Article CAS Google Scholar
Watkins, C. J. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992)
MATH Google Scholar
Tsitsiklis, J. & Roy, B. V. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr. 42, 674–690 (1997)
Article MathSciNet Google Scholar
McClelland, J. L., McNaughton, B. L. & O’Reilly, R. C. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 102, 419–457 (1995)
Article Google Scholar
O’Neill, J., Pleydell-Bouverie, B., Dupret, D. & Csicsvari, J. Play it again: reactivation of waking experience and memory. Trends Neurosci. 33, 220–229 (2010)
Article Google Scholar
Lin, L.-J. Reinforcement learning for robots using neural networks. Technical Report, DTIC Document. (1993)
Riedmiller, M. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. Mach. Learn.: ECML 3720, 317–328 (Springer, 2005)
Google Scholar
Van der Maaten, L. J. P. & Hinton, G. E. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Lange, S. & Riedmiller, M. Deep auto-encoder neural networks in reinforcement learning. Proc. Int. Jt. Conf. Neural. Netw. 1–8 (2010)
Google Scholar
Law, C.-T. & Gold, J. I. Reinforcement learning can account for associative and perceptual learning on a visual decision task. Nature Neurosci. 12, 655 (2009)
Article CAS Google Scholar
Sigala, N. & Logothetis, N. K. Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415, 318–320 (2002)
Article ADS CAS Google Scholar
Bendor, D. & Wilson, M. A. Biasing the content of hippocampal replay during sleep. Nature Neurosci. 15, 1439–1444 (2012)
Article CAS Google Scholar
Moore, A. & Atkeson, C. Prioritized sweeping: reinforcement learning with less data and less real time. Mach. Learn. 13, 103–130 (1993)
Google Scholar
Jarrett, K., Kavukcuoglu, K., Ranzato, M. A. & LeCun, Y. What is the best multi-stage architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–2153 (2009)
Google Scholar
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. Proc. Int. Conf. Mach. Learn. 807–814 (2010)
Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101, 99–134 (1994)
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank G. Hinton, P. Dayan and M. Bowling for discussions, A. Cain and J. Keene for work on the visuals, K. Keller and P. Rogers for help with the visuals, G. Wayne for comments on an earlier version of the manuscript, and the rest of the DeepMind team for their support, ideas and encouragement.

Author information

Volodymyr Mnih, Koray Kavukcuoglu and David Silver: These authors contributed equally to this work.

Authors and Affiliations

Google DeepMind, 5 New Street Square, London EC4A 3TW, UK,
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg & Demis Hassabis

Authors

Volodymyr Mnih
View author publications
You can also search for this author in PubMed Google Scholar
Koray Kavukcuoglu
View author publications
You can also search for this author in PubMed Google Scholar
David Silver
View author publications
You can also search for this author in PubMed Google Scholar
Andrei A. Rusu
View author publications
You can also search for this author in PubMed Google Scholar
Joel Veness
View author publications
You can also search for this author in PubMed Google Scholar
Marc G. Bellemare
View author publications
You can also search for this author in PubMed Google Scholar
Alex Graves
View author publications
You can also search for this author in PubMed Google Scholar
Martin Riedmiller
View author publications
You can also search for this author in PubMed Google Scholar
Andreas K. Fidjeland
View author publications
You can also search for this author in PubMed Google Scholar
Georg Ostrovski
View author publications
You can also search for this author in PubMed Google Scholar
Stig Petersen
View author publications
You can also search for this author in PubMed Google Scholar
Charles Beattie
View author publications
You can also search for this author in PubMed Google Scholar
Amir Sadik
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Antonoglou
View author publications
You can also search for this author in PubMed Google Scholar
Helen King
View author publications
You can also search for this author in PubMed Google Scholar
Dharshan Kumaran
View author publications
You can also search for this author in PubMed Google Scholar
Daan Wierstra
View author publications
You can also search for this author in PubMed Google Scholar
Shane Legg
View author publications
You can also search for this author in PubMed Google Scholar
Demis Hassabis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

V.M., K.K., D.S., J.V., M.G.B., M.R., A.G., D.W., S.L. and D.H. conceptualized the problem and the technical framework. V.M., K.K., A.A.R. and D.S. developed and tested the algorithms. J.V., S.P., C.B., A.A.R., M.G.B., I.A., A.K.F., G.O. and A.S. created the testing platform. K.K., H.K., S.L. and D.H. managed the project. K.K., D.K., D.H., V.M., D.S., A.G., A.A.R., J.V. and M.G.B. wrote the paper.

Corresponding authors

Correspondence to Koray Kavukcuoglu or Demis Hassabis.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Extended data figures and tables

Extended Data Figure 1 Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced during a combination of human and agent play in Space Invaders.

The plot was generated by running the t-SNE algorithm²⁵ on the last hidden layer representation assigned by DQN to game states experienced during a combination of human (30 min) and agent (2 h) play. The fact that there is similar structure in the two-dimensional embeddings corresponding to the DQN representation of states experienced during human play (orange points) and DQN play (blue points) suggests that the representations learned by DQN do indeed generalize to data generated from policies other than its own. The presence in the t-SNE embedding of overlapping clusters of points corresponding to the network representation of states experienced during human and agent play shows that the DQN agent also follows sequences of states similar to those found in human play. Screenshots corresponding to selected states are shown (human: orange border; DQN: blue border).

Extended Data Figure 2 Visualization of learned value functions on two games, Breakout and Pong.

a, A visualization of the learned value function on the game Breakout. At time points 1 and 2, the state value is predicted to be ∼17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ∼21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to −0.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of −1. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc.

Extended Data Table 1 List of hyperparameters and their values

Full size table

Extended Data Table 2 Comparison of games scores obtained by DQN agents with methods from the literature^12,15 and a professional human games tester

Full size table

Extended Data Table 3 The effects of replay and separating the target Q-network

Full size table

Extended Data Table 4 Comparison of DQN performance with linear function approximator

Full size table

Supplementary information

Supplementary Information

This file contains a Supplementary Discussion. (PDF 110 kb)

Performance of DQN in the Game Space Invaders

This video shows the performance of the DQN agent while playing the game of Space Invaders. The DQN agent successfully clears the enemy ships on the screen while the enemy ships move down and sideways with gradually increasing speed. (MOV 5106 kb)

Demonstration of Learning Progress in the Game Breakout

This video shows the improvement in the performance of DQN over training (i.e. after 100, 200, 400 and 600 episodes). After 600 episodes DQN finds and exploits the optimal strategy in this game, which is to make a tunnel around the side, and then allow the ball to hit blocks by bouncing behind the wall. Note: the score is displayed at the top left of the screen (maximum for clearing one screen is 448 points), number of lives remaining is shown in the middle (starting with 5 lives), and the “1” on the top right indicates this is a 1-player game. (MOV 1500 kb)

PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

PowerPoint slide for Fig. 3

PowerPoint slide for Fig. 4

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236

Download citation

Received: 10 July 2014
Accepted: 16 January 2015
Published: 25 February 2015
Issue Date: 26 February 2015
DOI: https://doi.org/10.1038/nature14236

This article is cited by

Deep learning models to predict the editing efficiencies and outcomes of diverse base editors
- Nahye Kim
- Sungchul Choi
- Hyongbum Henry Kim
Nature Biotechnology (2024)
Optimizing warfarin dosing for patients with atrial fibrillation using machine learning
- Jeremy Petch
- Walter Nelson
- Stuart J. Connolly
Scientific Reports (2024)
Spectrum-efficient user grouping and resource allocation based on deep reinforcement learning for mmWave massive MIMO-NOMA systems
- Minghao Wang
- Xin Liu
- Minglu Jin
Scientific Reports (2024)
Assuring Efficient Path Selection in an Intent-Based Networking System: A Graph Neural Networks and Deep Reinforcement Learning Approach
- Sajid Alam
- Javier Jose Diaz Rivera
- Wang-Cheol Song
Journal of Network and Systems Management (2024)
Reward Function Design Method for Long Episode Pursuit Tasks Under Polar Coordinate in Multi-Agent Reinforcement Learning
- Yubo Dong
- Tao Cui
- Peng Dong
Journal of Shanghai Jiaotong University (Science) (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Extended data figures and tables

Supplementary information

PowerPoint slides

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links