• Jump to main content
  • Jump to navigation

We use cookies to improve your experience with our site. | More info.

nature.com homepage

Publications A-Z index Browse by subject

  • Cart
  • Login
  • Register

Nature

International weekly journal of science

Advanced search
MenuMenu
  • Home
  • News & Comment
  • Research
  • Careers & Jobs
  • Current Issue
  • Archive
  • Audio & Video
  • For Authors
  • Archive
  • Volume 518
  • Issue 7540
  • Letters
  • Article
  • Extended data table 1

Extended Data Table 1: List of hyperparameters and their values

From Human-level control through deep reinforcement learning

  • Volodymyr Mnih1, n1
  • Koray Kavukcuoglu1, n1
  • David Silver1, n1
  • Andrei A. Rusu1,
  • Joel Veness1,
  • Marc G. Bellemare1,
  • Alex Graves1,
  • Martin Riedmiller1,
  • Andreas K. Fidjeland1,
  • Georg Ostrovski1,
  • Stig Petersen1,
  • Charles Beattie1,
  • Amir Sadik1,
  • Ioannis Antonoglou1,
  • Helen King1,
  • Dharshan Kumaran1,
  • Daan Wierstra1,
  • Shane Legg1,
  • Demis Hassabis1,
Journal name:
Nature
Volume:
518,
Pages:
529–533
Date published:
(26 February 2015)
DOI:
doi:10.1038/nature14236

back to article

Extended Data Table 1: List of hyperparameters and their values

The values of all the hyperparameters were selected by performing an informal search on the games Pong, Breakout, Seaquest, Space Invaders and Beam Rider. We did not perform a systematic grid search owing to the high computational cost, although it is conceivable that even better results could be obtained by systematically tuning the hyperparameter values.

  • Figures/tables index
  • Next table

back to article

Additional data

Author footnotes

  1. These authors contributed equally to this work.

    • Volodymyr Mnih,
    • Koray Kavukcuoglu &
    • David Silver

Affiliations

  1. Google DeepMind, 5 New Street Square, London EC4A 3TW, UK

    • Volodymyr Mnih,
    • Koray Kavukcuoglu,
    • David Silver,
    • Andrei A. Rusu,
    • Joel Veness,
    • Marc G. Bellemare,
    • Alex Graves,
    • Martin Riedmiller,
    • Andreas K. Fidjeland,
    • Georg Ostrovski,
    • Stig Petersen,
    • Charles Beattie,
    • Amir Sadik,
    • Ioannis Antonoglou,
    • Helen King,
    • Dharshan Kumaran,
    • Daan Wierstra,
    • Shane Legg &
    • Demis Hassabis

Contributions

V.M., K.K., D.S., J.V., M.G.B., M.R., A.G., D.W., S.L. and D.H. conceptualized the problem and the technical framework. V.M., K.K., A.A.R. and D.S. developed and tested the algorithms. J.V., S.P., C.B., A.A.R., M.G.B., I.A., A.K.F., G.O. and A.S. created the testing platform. K.K., H.K., S.L. and D.H. managed the project. K.K., D.K., D.H., V.M., D.S., A.G., A.A.R., J.V. and M.G.B. wrote the paper.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

  • Koray Kavukcuoglu or
  • Demis Hassabis

Author details

  • Volodymyr Mnih

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Koray Kavukcuoglu

    Contact Koray Kavukcuoglu

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • David Silver

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Andrei A. Rusu

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Joel Veness

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Marc G. Bellemare

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Alex Graves

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Martin Riedmiller

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Andreas K. Fidjeland

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Georg Ostrovski

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Stig Petersen

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Charles Beattie

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Amir Sadik

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Ioannis Antonoglou

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Helen King

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Dharshan Kumaran

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Daan Wierstra

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Shane Legg

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Demis Hassabis

    Contact Demis Hassabis

    Search for this author in:

    • NPG journals•
    • PubMed•
    • Google Scholar
  • Extended Data Figure 1: Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced during a combination of human and agent play in Space Invaders.
    Hover over figure to zoom

    The plot was generated by running the t-SNE algorithm25 on the last hidden layer representation assigned by DQN to game states experienced during a combination of human (30 min) and agent (2 h) play. The fact that there is similar structure in the two-dimensional embeddings corresponding to the DQN representation of states experienced during human play (orange points) and DQN play (blue points) suggests that the representations learned by DQN do indeed generalize to data generated from policies other than its own. The presence in the t-SNE embedding of overlapping clusters of points corresponding to the network representation of states experienced during human and agent play shows that the DQN agent also follows sequences of states similar to those found in human play. Screenshots corresponding to selected states are shown (human: orange border; DQN: blue border).

    Full size image

    Enable zoom

  • Extended Data Figure 2: Visualization of learned value functions on two games, Breakout and Pong.
    Hover over figure to zoom

    a, A visualization of the learned value function on the game Breakout. At time points 1 and 2, the state value is predicted to be ~17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ~21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to −0.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of −1. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc.

    Full size image

    Enable zoom

  • Video 1: Performance of DQN in the Game Space Invaders
    This video shows the performance of the DQN agent while playing the game of Space Invaders. The DQN agent successfully clears the enemy ships on the screen while the enemy ships move down and sideways with gradually increasing speed.

    Full size video or download original

  • Video 2: Demonstration of Learning Progress in the Game Breakout
    This video shows the improvement in the performance of DQN over training (i.e. after 100, 200, 400 and 600 episodes). After 600 episodes DQN finds and exploits the optimal strategy in this game, which is to make a tunnel around the side, and then allow the ball to hit blocks by bouncing behind the wall. Note: the score is displayed at the top left of the screen (maximum for clearing one screen is 448 points), number of lives remaining is shown in the middle (starting with 5 lives), and the “1” on the top right indicates this is a 1-player game.

    Full size video or download original

  • Extended Data Table 1: List of hyperparameters and their values
    Hover over figure to zoom

    The values of all the hyperparameters were selected by performing an informal search on the games Pong, Breakout, Seaquest, Space Invaders and Beam Rider. We did not perform a systematic grid search owing to the high computational cost, although it is conceivable that even better results could be obtained by systematically tuning the hyperparameter values.

    Full size table

    Enable zoom

  • Extended Data Table 2: Comparison of games scores obtained by DQN agents with methods from the literature12, 15 and a professional human games tester
    Hover over figure to zoom

    Best Linear Learner is the best result obtained by a linear function approximator on different types of hand designed features12. Contingency (SARSA) agent figures are the results obtained in ref. 15. Note the figures in the last column indicate the performance of DQN relative to the human games tester, expressed as a percentage, that is, 100 × (DQN score − random play score)/(human score − random play score).

    Full size table

    Enable zoom

  • Extended Data Table 3: The effects of replay and separating the target Q-network
    Hover over figure to zoom

    DQN agents were trained for 10 million frames using standard hyperparameters for all possible combinations of turning replay on or off, using or not using a separate target Q-network, and three different learning rates. Each agent was evaluated every 250,000 training frames for 135,000 validation frames and the highest average episode score is reported. Note that these evaluation episodes were not truncated at 5 min leading to higher scores on Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames was shorter (10 million frames) as compared to the main results presented in Extended Data Table 2 (50 million frames).

    Full size table

    Enable zoom

  • Extended Data Table 4: Comparison of DQN performance with linear function approximator
    Hover over figure to zoom

    The performance of the DQN agent is compared with the performance of a linear function approximator on the 5 validation games (that is, where a single linear layer was used instead of the convolutional network, in combination with replay and separate target network). Agents were trained for 10 million frames using standard hyperparameters, and three different learning rates. Each agent was evaluated every 250,000 training frames for 135,000 validation frames and the highest average episode score is reported. Note that these evaluation episodes were not truncated at 5 min leading to higher scores on Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames was shorter (10 million frames) as compared to the main results presented in Extended Data Table 2 (50 million frames).

    Full size table

    Enable zoom

Nature
ISSN: 0028-0836
EISSN: 1476-4687
  • About us
  • Contact us
  • Accessibility statement
  • Help
  • Privacy policy
  • Use of cookies
  • Legal notice
  • Terms
  • Nature jobs
  • Nature Asia
  • Nature Education
  • RSS web feeds
  • About Nature
  • Contact Nature
  • About the Editors
  • Nature awards

Springer Nature © 2015 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. partner of AGORA, HINARI, OARE, INASP, ORCID, CrossRef, COUNTER and COPE