Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# First return, then explore

## Abstract

Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse1 and deceptive2 feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can do so remains one of the central challenges of the field. Here we hypothesize that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (detachment) and failing to first return to a state before exploring from it (derailment). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly ‘remembering’ promising states and returning to such states before intentionally exploring. Go-Explore solves all previously unsolved Atari games and surpasses the state of the art on all hard-exploration games1, with orders-of-magnitude improvements on the grand challenges of Montezuma’s Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore’s exploration efficiency and enable it to handle stochasticity throughout training. The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration—an insight that may prove critical to the creation of truly intelligent learning agents.

This is a preview of subscription content

## Access options

\$32.00

All prices are NET prices.

## Data availability

The data that support the findings of this study (including the raw data for all figures and tables in the manuscript, Extended Data, Supplementary Information, as well as the demonstration trajectories used in robustification) are available from the corresponding authors upon reasonable request.

## Code availability

The Go-Explore code is available at https://github.com/uber-research/go-explore.

## References

1. Bellemare, M. et al. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29 (NIPS 2016) (eds Lee, D. et al.) 1471–1479 (2016).

2. Lehman, J. & Stanley, K. O. Novelty search and the problem with objectives. In Genetic Programming Theory and Practice IX (eds Riolo, R. et al.) 37–56 (2011).

3. Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).

4. Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019).

5. Open AI. Dota 2 with large-scale deep reinforcement learning. Preprint at https://arxiv.org/abs/1912.06680 (2019).

6. Merel, J. et al. Hierarchical visuomotor control of humanoids. In Int. Conf. Learning Representations https://openreview.net/forum?id=BJfYvo09Y7 (2019).

7. Open AI. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 39, 3–20 (2020).

8. Lehman, J. et al. The surprising creativity of digital evolution: a collection of anecdotes from the evolutionary computation and artificial life research communities. Artif. Life 26, 274–306 (2020).

9. Amodei, D. et al. Concrete problems in AI safety. Preprint https://arxiv.org/abs/1606.06565 (2016).

10. Smart, W. D. & Kaelbling, L. P. Effective reinforcement learning for mobile robots. In Proc. 2002 IEEE Int. Conf. Robotics and Automation 3404–3410 (IEEE, 2002).

11. Lehman, J. & Stanley, K. O. Abandoning objectives: evolution through the search for novelty alone. Evol. Comput. 19, 189–223 (2011).

12. Conti, E. et al. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (eds Bengio S. et al.) 5027–5038 (2018).

13. Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The Arcade Learning Environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).

14. Puigdomènech Badia, A. et al. Agent57: outperforming the Atari human benchmark. In Int. Conf. Machine Learning 507–517 (PMLR, 2020).

15. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

16. Aytar, Y. et al. Playing hard exploration games by watching YouTube. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (eds Bengio, S. et al.) 2930–2941 (2018).

17. Machado, M. C. et al. Revisiting the Arcade Learning Environment: evaluation protocols and open problems for general agents. J. Artif. Intell. Res. 61, 523–562 (2018).

18. Lipovetzky, N., Ramirez, M. & Geffner, H. Classical planning with simulators: results on the Atari video games. In IJCAI’15 Proc. 24th Int. Conf. Artificial Intelligence (eds Yang, Q. & Woolridge, M.) 1610–1616 (2015).

19. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (Bradford, 1998).

20. Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In Proc. 33rd Int. Conf. Machine Learning (eds Balcan, M. F. & Weinberger, K. Q.) 1928–1937 (2016).

21. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. Preprint at https://arxiv.org/abs/1707.06347 (2017).

22. Cully, A., Clune, J., Tarapore, D. & Mouret, J.-B. Robots that can adapt like animals. Nature 521, 503–507 (2015).

23. Peng, X. B., Andrychowicz, M., Zaremba, W. & Abbeel, P. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE Int. Conf. Robotics and Automation (ICRA) (ed. Lynch, K.) 3803–3817 (IEEE, 2018).

24. Tan, J. et al. Sim-to-real: learning agile locomotion for quadruped robots. In Proc. Robotics: Science and Systems (eds Kress-Gazit, H. et al.) https://doi.org/10.15607/RSS.2018.XIV.010 (2018).

25. Hester, T. et al. Deep Q-learning from demonstrations. In Thirty-Second AAAI Conf. Artificial Intelligence 3223–3230 (2018).

26. Guo, X., Singh, S. P., Lee, H., Lewis, R. L. & Wang, X. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In Advances in Neural Information Processing Systems 27 (NIPS 2014) (eds Ghahramani, Z. et al.) 3338–3346 (2014).

27. Horgan, D. et al. Distributed prioritized experience replay. In Int. Conf. Learning Representations https://openreview.net/forum?id=H1Dy---0Z (2018).

28. Espeholt, L. et al. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In Proc. 35th Int. Conf. Machine Learning (eds Dy, J. & Krause, A.) 1407–1416 (2018).

29. Salimans, T. & Chen, R. Learning Montezuma’s Revenge from a single demonstration. Preprint at https://arxiv.org/abs/1812.03381 (2018).

30. Van Hasselt, H. P., Guez, A., Hessel, M., Mnih, V. & Silver, D. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems 29 (NIPS 2016) (eds Lee, D. et al.) 4287–4295 (2016).

31. Puigdomènech Badia, A. et al. Never give up: learning directed exploration strategies. In Int. Conf. Learning Representations https://openreview.net/forum?id=Sye57xStvB (2020).

32. Brockman, G. et al. OpenAI gym. Preprint at https://arxiv.org/abs/1606.01540 (2016).

33. ATARI VCS/2600 Scoreboard. Atari Compendium http://www.ataricompendium.com/game_library/high_scores/high_scores.html (accessed 6 January 2020).

34. Guo, Y. et al. Efficient exploration with self-imitation learning via trajectory-conditioned policy. Preprint at https://arxiv.org/abs/1907.10247 (2019).

35. Wise, M., Ferguson, M., King, D., Diehr, E. & Dymesich, D. Fetch and freight: standard platforms for service robot applications. In Workshop on Autonomous Mobile Service Robots of the Intl Joint Conf. Artificial Intelligence (2016).

36. Eysenbach, B., Salakhutdinov, R. R. & Levine, S. Search on the replay buffer: bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) 15220–15231 (2019).

37. Oh, J., Guo, Y., Singh, S. & Lee, H. Self-imitation learning. In Proc. 35th Int. Conf. Machine Learning (eds Dy, J. & Krause, A.) 3878–3887 (2018).

38. Madotto, A. et al. Exploration-based language learning for text-based games. Preprint at https://arxiv.org/abs/2001.08868 (2020).

39. Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).

40. Alvernaz, S. & Togelius, J. Autoencoder-augmented neuroevolution for visual Doom playing. In 2017 IEEE Conf. Computational Intelligence and Games (CIG) 1–8 (IEEE, 2017).

41. Cuccu, G., Togelius, J. & Cudré-Mauroux, P. Playing Atari with six neurons. In Proc. 18th Intl Conf. Autonomous Agents and MultiAgent Systems 998–1006 (2019).

42. Oord, A. d., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2018).

43. Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. In Int. Conf. Learning Representations https://openreview.net/forum?id=SJ6yPD5xg (2017).

44. Chaslot, G., Bakkes, S., Szita, I. & Spronck, P. Monte-Carlo tree search: a new framework for game AI. In AIIDE'08: Proc. Fourth AAAI Conf. Artificial Intelligence and Interactive Digital Entertainment (eds Darken, C. & Mateas, M.) 216–217 (2008).

45. Lavalle, S. M. Rapidly-Exploring Random Trees: A New Tool for Path Planning. Technical Report No. 98-11 (Iowa State Univ., 1998).

46. Hart, P. E., Nilsson, N. J. & Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4, 100–107 (1968).

47. Smith, D. E. & Weld, D. S. Conformant Graphplan. In AAAI '98/IAAI '98: Proc. 15th Natl/10th Conf. Artificial Intelligence/Innovative Applications of Artificial Intelligence (eds Mostow, J. et al.) 889–896 (1998).

48. Castro, P. S., Moitra, S., Gelada, C., Kumar, S. & Bellemare, M. G. Dopamine: a research framework for deep reinforcement learning. Preprint at https://arxiv.org/abs/1812.06110 (2018).

49. Toromanoff, M., Wirbel, E. & Moutarde, F. Is deep reinforcement learning really superhuman on Atari? In Deep Reinforcement Learning Workshop of 39th Conf. Neural Information Processing Systems (NeurIPS 2019) (2019).

50. Burda, Y., Edwards, H., Storkey, A. & Klimov, O. Exploration by random network distillation. In Int. Conf. Learning Representations https://openreview.net/forum?id=H1lJJnR5Ym (2019).

51. Choi, J. et al. Contingency-aware exploration in reinforcement learning. In Int. Conf. Learning Representations https://openreview.net/forum?id=HyxGB2AcY7 (2019).

52. Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G. & Larochelle, H. Hyperbolic discounting and learning over multiple horizons. Preprint at https://arxiv.org/abs/1902.06865 (2019).

53. Taiga, A. A., Fedus, W., Machado, M. C., Courville, A. & Bellemare, M. G. On bonus based exploration methods in the Arcade Learning Environment. In Int. Conf. Learning Representations https://openreview.net/forum?id=BJewlyStDr (2020).

54. Tang, Y., Valko, M. & Munos, R. Taylor expansion policy optimization. In Proc. 37th Int. Conf. Machine Learning (eds Daumé III, H. & Singh, A.) 9397–9406 (2020).

55. Ostrovski, G., Bellemare, M. G., van den Oord, A. & Munos, R. Count-based exploration with neural density models. In Proc. 34th Int. Conf. Machine Learning (eds Precup, D. & Teh, Y. W.) 2721–2730 (2017).

56. Martin, J., Sasikumar, S. N., Everitt, T. & Hutter, M. Count-based exploration in feature space for reinforcement learning. In IJCAI’17: Proc. 26th Int. Joint Conf. Artificial Intelligence (ed. Sierra, C.) 2471–2478 (2017).

57. O’Donoghue, B., Osband, I., Munos, R. & Mnih, V. The uncertainty Bellman equation and exploration. In Proc. 35th Int. Conf. Machine Learning (eds Dy, J. & Krause, A.) 3839–3848 (2018).

58. Goldenberg, A., Benhabib, B. & Fenton, R. A complete generalized solution to the inverse kinematics of robots. IEEE J. Robot. Autom. 1, 14–20 (1985).

59. Spong, M. W., Hutchinson, S., Vidyasagar, M. Robot Modeling and Control (Wiley, 2006).

60. Zhao, Z.-Q., Zheng, P., Xu, S.-t. & Wu, X. Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30, 3212–3232 (2019).

61. Todorov, E., Erez, T. & Tassa, Y. MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ Int. Conf. Intelligent Robots and Systems 5026–5033 (IEEE, 2012).

62. Kocsis, L. & Szepesvári, C. Bandit-based Monte Carlo planning. In European Conf. Machine Learning ECML 2006 (eds Fürnkranz, J. et al.) 282–293 (Springer, 2006).

63. Strehl, A. L. & Littman, M. L. An analysis of model-based interval estimation for Markov decision processes. J. Comput. Syst. Sci. 74, 1309–1331 (2008).

64. Tang, H. et al. #Exploration: a study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems 30 (NIPS 2017) (eds Guyon, I. et al.) 2750–2759 (2017).

65. Ng, A. Y., Harada, D. & Russell, S. Policy invariance under reward transformations: theory and application to reward shaping. In Proc. 16th Int. Conf. Machine Learning (eds Bratko, I. & Džeroski, S.) 278–287 (1999).

66. Hussein, A., Gaber, M. M., Elyan, E. & Jayne, C. Imitation learning: a survey of learning methods. ACM Comput. Surv. 50, 21 (2017).

67. Plappert, M. et al. Multi-goal reinforcement learning: challenging robotics environments and request for research. Preprint at https://arxiv.org/abs/1802.09464 (2018).

68. Cho, K., Van Merriënboer, B., Bahdanau, D. & Bengio, Y. On the properties of neural machine translation: encoder-decoder approaches. In Proc. SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation 103–111 (Association for Computational Linguistics, 2014).

## Acknowledgements

We thank A. Edwards, S. Kapoor, F. Petroski Such and J. Zhi for their ideas, feedback, technical support and work on aspects of Go-Explore not presented in this work. We are grateful to the Colorado Data Center and OpusStack Teams at Uber for providing our computing platform. We thank V. Kumar for creating the MuJoCo files that served as the basis for our robotics environment (https://github.com/vikashplus/fetch).

## Author information

Authors

### Contributions

A.E. and J.H. contributed equally and are responsible for the technical work (J.H. focused primarily on policy-based Go-Explore and A.E. on most other technical contributions) as well as the initial draft of the paper. J.C. and K.O.S. led the team. All authors (A.E., J.H., J.L., K.O.S. and J.C.) significantly contributed to ideation, experimental design, analysing data, strategic decisions, developing the philosophical motivation for the algorithm and editing the paper.

### Corresponding authors

Correspondence to Adrien Ecoffet, Joost Huizinga or Jeff Clune.

## Ethics declarations

### Competing interests

Uber Technologies, Inc. has filed a publicly available provisional patent application 16/696,893 about some Go-Explore variants featuring a deep reinforcement learning model, with all authors (A.E., J.H., J.L, K.O.S. and J.C.) listed as inventors.

Peer review information Nature thanks Julian Togelius and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data figures and tables

### Extended Data Fig. 1 Neural network architectures.

a, The Atari architecture is based on the architecture provided with the backward algorithm implementation. The input consists of the RGB channels of the last four frames (rescaled to 80 by 105 pixels) concatenated, resulting in 12 input channels. The network consists of three convolutional layers (C), two fully connected layers (FC), and a layer of gated recurrent units (GRUs)68. The network has a policy head πt(st|at) and a value head Vt(st). b, For the robotics problem, the architecture consists of two separate networks, each with two fully connected layers and a GRU layer. One network specifies the policy πt(st|at) by returning a mean μt and variance σt for the actuator torques of the arm and the desired position of each of the two fingers of the gripper (gripper fingers are implemented as Mujoco position actuators61 with kp = 104 and a control range of [0, 0.05]). The other network implements the value function Vt(st). c, The architecture for policy-based Go-Explore is identical to the Atari architecture, except that the goal representation gt is concatenated with the input of the first fully connected layer. Activation functions (Act.) are: the rectified-linear unit (Relu), the exponential function (Exp) and the softmax function (Softmax). Layers can also include layer normalization (Layer norm), which transforms the output of the layer by subtracting the mean and dividing by the standard deviation of the layer.

### Extended Data Fig. 2 Maximum end-of-episode score found by the exploration phase on Atari.

a, Exploration phase without domain knowledge. b, Exploration phase with domain knowledge, compared to downscaled. Because only scores achieved at the episode end are reported, the plots for some games (for example, Solaris) begin after the start of the run, when the episode end is first reached. In a, averaging is over 50 runs for the 11 focus games and five runs for other games. In b, averaging is over 100 runs. Shaded areas show 95% bootstrap CIs of the mean with 1,000 samples. Avg. Human, average human performance; SOTA, state-of-the-art performance; M, ×106; K, ×103.

### Extended Data Fig. 3 Number of cells in archive during the exploration phase on Atari.

a, Exploration phase without domain knowledge. b, Exploration phase with domain knowledge. In a, archive size can decrease when the representation is recomputed. Previous archives are converted to the new format when the representation is recomputed, possibly leading to an archive with a size larger than 50,000. In this case, one iteration of the exploration phase runs and the representation is recomputed again. In a, averaging is over 50 runs for the 11 focus games and five runs for other games. In b, averaging is over 100 runs. Shaded areas show 95% bootstrap CIs of the mean with 1,000 samples.

### Extended Data Fig. 4 Progress of robustification phase on Atari.

a, Exploration phase without domain knowledge. b, Exploration phase with domain knowledge. Shown are the scores achieved by robustifying agents across training time for the exploration phase without domain-knowledge representations (a) and with representations informed by domain knowledge (b). In particular, the rolling mean is shown for performance across the past 100 episodes when starting from the virtual demonstration (which corresponds to the domain’s traditional starting state). Note that in a, averaging is over five independent runs, whereas in b, averaging is over 10 runs. Because the final performance is obtained by testing the highest-performing network checkpoint for each run over 1,000 additional episodes, rather than directly extracted from the curves above, the performance reported in Fig. 2b does not necessarily match any particular point along these curves (Methods). Shaded areas show 95% bootstrap CIs of the mean with 1,000 samples.

### Extended Data Fig. 5 Progress of the exploration phase in the robotics environment.

a, Runs with successful trajectories. b, Length of the shortest successful trajectory. In a, the exploration phase quickly achieves 100% success rate for all shelves in the robotics environment. However, b shows that although success is achieved quickly it is useful to keep the exploration phase running longer to reduce the length of the successful trajectories, thus making robustification easier. Lines show the mean over 50 runs. Shaded areas show 95% bootstrap CIs of the mean with 1,000 samples.

### Extended Data Fig. 6 Policy-based Go-Explore overview.

With respect to their practical implementation, the main difference between policy-based Go-Explore and Go-Explore when restoring a simulator state is that in policy-based Go-Explore there exist separate actors that each have an internal loop switching between the ‘select’, ‘go’, and ‘explore’ steps, rather than one outer loop in which the ‘select’, ‘go’, and ‘explore’ steps are executed in synchronized batches. This structure allows policy-based Go-Explore to be easily combined with popular reinforcement learning algorithms like A3C20, PPO21 or DQN15, which already divide data-gathering over many actors.

### Extended Data Fig. 7 Method by which cells are found.

a, b, In both Montezuma’s Revenge (a) and Pitfall (b), sampling from the goal-conditioned policy results in the discovery of roughly four times more cells than when taking random actions. At the start of training there is effectively no difference between random actions and sampling from the policy, supporting the intuition that sampling from the policy only becomes more efficient than random actions after the policy has acquired the basic skills for moving towards the indicated goal. Lastly, the number of cells that are discovered while returning is about twice that of the cells discovered when taking random actions after returning, indicating that the frames spent while returning to a previously visited cell are not just overhead required for moving towards the frontier of yet-undiscovered states and training the policy network, but actually provide a substantial contribution towards exploration as well. Lines show the mean over 10 runs. Shaded areas show 95% bootstrap CIs of the mean with 1,000 samples.

## Supplementary information

### Supplementary Information

The Supplementary Information is made up of a single PDF file containing 13 Supplementary Figures, 2 Supplementary Tables, and additional sections.

## Rights and permissions

Reprints and Permissions

Ecoffet, A., Huizinga, J., Lehman, J. et al. First return, then explore. Nature 590, 580–586 (2021). https://doi.org/10.1038/s41586-020-03157-9

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s41586-020-03157-9

• ### Resilience and recovery of public transport use during COVID-19

• Jiaoe Wang
• Jie Huang
• David Levinson

npj Urban Sustainability (2022)

• ### Lucid dreaming for experience replay: refreshing past states with the current policy

• Yunshu Du
• Garrett Warnell
• Matthew E. Taylor

Neural Computing and Applications (2022)

• ### Explicit Explore, Exploit, or Escape ($$E^4$$): near-optimal safety-constrained reinforcement learning in polynomial time

• David M. Bossens
• Nicholas Bishop

Machine Learning (2022)

• ### Model-free reinforcement learning from expert demonstrations: a survey

• Jorge Ramírez
• Wen Yu

Artificial Intelligence Review (2022)

• ### VARL: a variational autoencoder-based reinforcement learning Framework for vehicle routing problems

• Qi Wang

Applied Intelligence (2022)