Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Reinforcement learning improves behaviour from evaluative feedback

Abstract

Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Decisions of machine-learning feedback.
Figure 2: A series of boards in noughts and crosses leading to a loss for nought.
Figure 3: Schematic comparison of evaluation-function and Monte Carlo methods for planning.

Similar content being viewed by others

References

  1. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 1998). This book is the definitive reference on computational reinforcement learning.

    MATH  Google Scholar 

  2. Kaelbling, L. P., Littman, M. L. & Moore, A. W. Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996).

    Article  Google Scholar 

  3. Berry, D. A. & Fristedt, B. Bandit Problems: Sequential Allocation of Experiments (Chapman and Hall, 1985).

    Book  MATH  Google Scholar 

  4. Shrager, J. & Tenenbaum, J. M. Rapid learning for precision oncology. Nature Rev. Clin. Onco. 11, 109–118 (2014).

    Article  Google Scholar 

  5. Auer, P., Cesa-Bianchi, N. & Fischer, P. Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47, 235–256 (2002).

    Article  MATH  Google Scholar 

  6. Kaelbling, L. P. Learning in Embedded Systems (MIT Press, 1993).

    Book  Google Scholar 

  7. Li, L., Chu, W., Langford, J. & Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proc. 19th International World Wide Web Conference 661–670 (2010).

    Book  Google Scholar 

  8. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933).

    Article  MATH  Google Scholar 

  9. West, R. F. & Stanovich, K. E. Is probability matching smart? Associations between probabilistic choices and cognitive ability. Mem. Cognit. 31, 243–251 (2003).

    Article  Google Scholar 

  10. May, B. C., Korda, N., Lee, A. & Leslie, D. S. Optimistic Bayesian sampling in contextual-bandit problems. J. Mach. Learn. Res. 13, 2069–2106 (2012).

    MathSciNet  MATH  Google Scholar 

  11. Bubeck, S. & Liu, C.-Y. Prior-free and prior-dependent regret bounds for Thompson sampling. In Proc. Advances in Neural Information Processing Systems 638–646 (2013).

    Google Scholar 

  12. Gershman, S. & Blei, D. A tutorial on Bayesian nonparametric models. J. Math. Psychol. 56, 1–12 (2012).

    Article  MathSciNet  MATH  Google Scholar 

  13. Sutton, R. S. Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988).

    Google Scholar 

  14. Boyan, J. A. & Moore, A. W. Generalization in reinforcement learning: safely approximating the value function. In Proc. Advances in Neural Information Processing Systems 369–376 (1995).

    Google Scholar 

  15. Baird, L. Residual algorithms: reinforcement learning with function approximation. In Proc. 12th International Conference on Machine Learning (eds Prieditis, A. & Russell, S.) 30–37 (Morgan Kaufmann, 1995).

    Google Scholar 

  16. Sutton, R. S. et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proc. 26th Annual International Conference on Machine Learning 993–1000 (2009).

    Google Scholar 

  17. Sutton, R. S., Maei, H. R. & Szepesvári, C. A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Proc. Advances in Neural Information Processing Systems 1609–1616 (2009).

    Google Scholar 

  18. Maei, H. R. et al. Convergent temporal-difference learning with arbitrary smooth function approximation. In Proc. Advances in Neural Information Processing Systems 1204–1212 (2009).

    Google Scholar 

  19. Maei, H. R., Szepesvári, C., Bhatnagar, S. & Sutton, R. S. Toward off-policy learning control with function approximation. In Proc. 27th International Conference on Machine Learning 719–726 (2010).

    Google Scholar 

  20. van Hasselt, H., Mahmood, A. R. & Sutton, R. S. Off-policy TD(λ) with a true online equivalence. In Proc. 30th Conference on Uncertainty in Artificial Intelligence 324 (2014).

    Google Scholar 

  21. Russell, S. J. & Norvig, P. Artificial Intelligence: A Modern Approach (Prentice–Hall, 1994).

    MATH  Google Scholar 

  22. Campbell, M., Hoane, A. J. & Hsu, F. H. Deep blue. Artif. Intell. 134, 57–83 (2002).

    Article  MATH  Google Scholar 

  23. Samuel, A. L. Some studies in machine learning using the game of checkers. IBM J. Res. Develop. 3, 211–229 (1959).

    Article  MathSciNet  Google Scholar 

  24. Tesauro, G. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput. 6, 215–219 (1994). This article describes the first reinforcement-learning system to solve a truly non-trivial task.

    Article  Google Scholar 

  25. Tesauro, G., Gondek, D., Lenchner, J., Fan, J. & Prager, J. M. Simulation, learning, and optimization techniques in Watson's game strategies. IBM J. Res. Develop. 56, 1–11 (2012).

    Article  Google Scholar 

  26. Kocsis, L. & Szepesvári, C. Bandit based Monte-Carlo planning. In Proc. 17th European Conference on Machine Learning 282–293 (2006). This article introduces UCT, the decision-making algorithm that revolutionized gameplay in Go.

    Google Scholar 

  27. Gelly, S. et al. The grand challenge of computer Go: Monte Carlo tree search and extensions. Communications of the ACM 55, 106–113 (2012).

    Article  Google Scholar 

  28. İpek. E., Mutlu, O., Martínez, J. F. & Caruana, R. Self-optimizing memory controllers: a reinforcement learning approach. In Proc. 35th International Symposium on Computer Architecture 39–50 (2008).

    Google Scholar 

  29. Ng, A. Y., Kim, H. J., Jordan, M. I. & Sastry, S. Autonomous helicopter flight via reinforcement learning. In Proc. Advances in Neural Information Processing Systems http://papers.nips.cc/paper/2455-autonomous-helicopter-flight-via-reinforcement-learning (2003).

    Google Scholar 

  30. Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proc. 7th International Conference on Machine Learning 216–224 (Morgan Kaufmann, 1990).

    Google Scholar 

  31. Kearns, M. J. & Singh, S. P. Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002). This article provides the first algorithm and analysis that shows that reinforcement-learning tasks can be solved approximately optimally with a relatively small amount of experience.

    Article  MATH  Google Scholar 

  32. Brafman, R. I. & Tennenholtz, M. R-MAX — a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002).

    MathSciNet  MATH  Google Scholar 

  33. Li, L., Littman, M. L., Walsh, T. J. & Strehl, A. L. Knows what it knows: a framework for self-aware learning. Mach. Learn. 82, 399–443 (2011).

    Article  MathSciNet  MATH  Google Scholar 

  34. Langley, P. Machine learning as an experimental science. Mach. Learn. 3, 5–8 (1988).

    Google Scholar 

  35. Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).

    Article  Google Scholar 

  36. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). This article describes the application of deep learning in a reinforcement-learning setting to address the challenging task of decision making in an arcade environment.

    Article  ADS  CAS  PubMed  Google Scholar 

  37. Murphy, S. A. An experimental design for the development of adaptive treatment strategies. Stat. Med. 24, 1455–1481 (2005).

    Article  MathSciNet  CAS  PubMed  Google Scholar 

  38. Li, L., Chu, W., Langford, J. & Wang, X. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proc. 4th ACM International Conference on Web Search and Data Mining 297–306 (2011).

    Book  Google Scholar 

  39. Nouri, A. et al. A novel benchmark methodology and data repository for real-life reinforcement learning. In Proc. Multidisciplinary Symposium on Reinforcement Learning, Poster (2009).

    Google Scholar 

  40. Marivate, V. N., Chemali, J., Littman, M. & Brunskill, E. Discovering multi-modal characteristics in observational clinical data. In Proc. Machine Learning for Clinical Data Analysis and Healthcare NIPS Workshop http://paul.rutgers.edu/vukosi/papers/nips2013workshop.pdf (2013).

    Google Scholar 

  41. Ng, A. Y., Harada, D. & Russell, S. Policy invariance under reward transformations: theory and application to reward shaping. In Proc. 16th International Conference on Machine Learning 278–287 (1999).

    Google Scholar 

  42. Thomaz, A. L. & Breazeal, C. Teachable robots: understanding human teaching behaviour to build more effective robot learners. Artif. Intell. 172, 716–737 (2008).

    Article  Google Scholar 

  43. Knox, W. B. & Stone, P. Interactively shaping agents via human reinforcement: The TAMER framework. In Proc. 5th International Conference on Knowledge Capture 9–16 (2009).

    Book  Google Scholar 

  44. Loftin, R. et al. A strategy-aware technique for learning behaviors from discrete human feedback. In Proc. 28th Association for the Advancement of Artificial Intelligence Conference https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8579 (2014).

    Google Scholar 

  45. Ng, A. Y. & Russell, S. Algorithms for inverse reinforcement learning. In Proc. International Conference on Machine Learning 663–670 (2000).

    Google Scholar 

  46. Babes, M., Marivate, V. N., Littman, M. L. & Subramanian, K. Apprenticeship learning about multiple intentions. In Proc. International Conference on Machine Learning 897–904 (2011).

    Google Scholar 

  47. Singh, S., Lewis, R.L., Barto, A.G. & Sorg, J. Intrinsically motivated reinforcement learning: an evolutionary perspective. IEEE Trans. Auto. Mental Dev. 2, 70–82 (2010).

    Article  CAS  Google Scholar 

  48. Newell, A. The chess machine: an example of dealing with a complex task by adaptation. In Proc. Western Joint Computer Conference 101–108 (1955).

  49. Minsky, M. L. Some methods of artificial intelligence and heuristic programming. In Proc. Symposium on the Mechanization of Thought Processes 24–27 (1958).

    Google Scholar 

  50. Sutton, R. S. & Barto, A. G. Toward a modern theory of adaptive networks: expectation and prediction. Psychol. Rev. 88, 135–170 (1981).

    Article  CAS  PubMed  Google Scholar 

  51. Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).

    Article  CAS  PubMed  Google Scholar 

  52. Dayan, P. & Niv, Y. Reinforcement learning and the brain: the good, the bad and the ugly. Curr. Opin. Neurobiol. 18, 185–196 (2008).

    Article  CAS  PubMed  Google Scholar 

  53. Niv, Y. Neuroscience: dopamine ramps up. Nature 500, 533–535 (2013).

    Article  ADS  CAS  PubMed  Google Scholar 

  54. Cushman, F. Action, outcome, and value a dual-system framework for morality. Pers. Soc. Psychol. Rev. 17, 273–292 (2013).

    Article  PubMed  Google Scholar 

  55. Shapley, L. Stochastic games. Proc. Natl Acad. Sci. USA 39, 1095–1100 (1953).

    Article  ADS  MathSciNet  CAS  PubMed  MATH  PubMed Central  Google Scholar 

  56. Bellman, R. Dynamic Programming (Princeton Univ. Press, 1957).

    MATH  Google Scholar 

  57. Kober, J., Bagnell, J. A. & Peters, J. Reinforcement learning in robotics: a survey. Int. J. Rob. Res. 32, 1238–1274 (2013).

    Article  Google Scholar 

  58. Watkins, C. J. C. H. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992). This article introduces the first provably correct approach to reinforcement learning for both prediction and decision making.

    MATH  Google Scholar 

  59. Jaakkola, T., Jordan, M. I. & Singh, S. P. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems 6, 703–710 (Morgan Kaufmann, 1994).

    Google Scholar 

  60. Diuk, C., Li, L. & Leffner, B. R. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proc. 26th International Conference on Machine Learning 32–40 (2009).

    Google Scholar 

Download references

Acknowledgements

The author appreciates his discussions with his colleagues that led to this synthesis of current work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael L. Littman.

Ethics declarations

Competing interests

The author declares no competing financial interests.

Additional information

Reprints and permissions information is available at www.nature.com/reprints.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Littman, M. Reinforcement learning improves behaviour from evaluative feedback. Nature 521, 445–451 (2015). https://doi.org/10.1038/nature14540

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nature14540

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics