Reinforcement learning improves behaviour from evaluative feedback

Littman, Michael L.

doi:10.1038/nature14540

Review Article
Published: 27 May 2015

Reinforcement learning improves behaviour from evaluative feedback

Michael L. Littman¹

Nature volume 521, pages 445–451 (2015)Cite this article

29k Accesses
180 Citations
31 Altmetric
Metrics details

Subjects

Abstract

Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Decisions of machine-learning feedback.**

**Figure 2: A series of boards in noughts and crosses leading to a loss for nought.**

**Figure 3: Schematic comparison of evaluation-function and Monte Carlo methods for planning.**

Beyond dichotomies in reinforcement learning

Article 01 September 2020

Anne G. E. Collins & Jeffrey Cockburn

Multi-task reinforcement learning in humans

Article 28 January 2021

Momchil S. Tomov, Eric Schulz & Samuel J. Gershman

Generalizable control for quantum parameter estimation through reinforcement learning

Article Open access 04 October 2019

Han Xu, Junning Li, … Xin Wang

References

Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 1998). This book is the definitive reference on computational reinforcement learning.
MATH Google Scholar
Kaelbling, L. P., Littman, M. L. & Moore, A. W. Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996).
Article Google Scholar
Berry, D. A. & Fristedt, B. Bandit Problems: Sequential Allocation of Experiments (Chapman and Hall, 1985).
Book MATH Google Scholar
Shrager, J. & Tenenbaum, J. M. Rapid learning for precision oncology. Nature Rev. Clin. Onco. 11, 109–118 (2014).
Article Google Scholar
Auer, P., Cesa-Bianchi, N. & Fischer, P. Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47, 235–256 (2002).
Article MATH Google Scholar
Kaelbling, L. P. Learning in Embedded Systems (MIT Press, 1993).
Book Google Scholar
Li, L., Chu, W., Langford, J. & Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proc. 19th International World Wide Web Conference 661–670 (2010).
Book Google Scholar
Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933).
Article MATH Google Scholar
West, R. F. & Stanovich, K. E. Is probability matching smart? Associations between probabilistic choices and cognitive ability. Mem. Cognit. 31, 243–251 (2003).
Article Google Scholar
May, B. C., Korda, N., Lee, A. & Leslie, D. S. Optimistic Bayesian sampling in contextual-bandit problems. J. Mach. Learn. Res. 13, 2069–2106 (2012).
MathSciNet MATH Google Scholar
Bubeck, S. & Liu, C.-Y. Prior-free and prior-dependent regret bounds for Thompson sampling. In Proc. Advances in Neural Information Processing Systems 638–646 (2013).
Google Scholar
Gershman, S. & Blei, D. A tutorial on Bayesian nonparametric models. J. Math. Psychol. 56, 1–12 (2012).
Article MathSciNet MATH Google Scholar
Sutton, R. S. Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988).
Google Scholar
Boyan, J. A. & Moore, A. W. Generalization in reinforcement learning: safely approximating the value function. In Proc. Advances in Neural Information Processing Systems 369–376 (1995).
Google Scholar
Baird, L. Residual algorithms: reinforcement learning with function approximation. In Proc. 12th International Conference on Machine Learning (eds Prieditis, A. & Russell, S.) 30–37 (Morgan Kaufmann, 1995).
Google Scholar
Sutton, R. S. et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proc. 26th Annual International Conference on Machine Learning 993–1000 (2009).
Google Scholar
Sutton, R. S., Maei, H. R. & Szepesvári, C. A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Proc. Advances in Neural Information Processing Systems 1609–1616 (2009).
Google Scholar
Maei, H. R. et al. Convergent temporal-difference learning with arbitrary smooth function approximation. In Proc. Advances in Neural Information Processing Systems 1204–1212 (2009).
Google Scholar
Maei, H. R., Szepesvári, C., Bhatnagar, S. & Sutton, R. S. Toward off-policy learning control with function approximation. In Proc. 27th International Conference on Machine Learning 719–726 (2010).
Google Scholar
van Hasselt, H., Mahmood, A. R. & Sutton, R. S. Off-policy TD(λ) with a true online equivalence. In Proc. 30th Conference on Uncertainty in Artificial Intelligence 324 (2014).
Google Scholar
Russell, S. J. & Norvig, P. Artificial Intelligence: A Modern Approach (Prentice–Hall, 1994).
MATH Google Scholar
Campbell, M., Hoane, A. J. & Hsu, F. H. Deep blue. Artif. Intell. 134, 57–83 (2002).
Article MATH Google Scholar
Samuel, A. L. Some studies in machine learning using the game of checkers. IBM J. Res. Develop. 3, 211–229 (1959).
Article MathSciNet Google Scholar
Tesauro, G. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput. 6, 215–219 (1994). This article describes the first reinforcement-learning system to solve a truly non-trivial task.
Article Google Scholar
Tesauro, G., Gondek, D., Lenchner, J., Fan, J. & Prager, J. M. Simulation, learning, and optimization techniques in Watson's game strategies. IBM J. Res. Develop. 56, 1–11 (2012).
Article Google Scholar
Kocsis, L. & Szepesvári, C. Bandit based Monte-Carlo planning. In Proc. 17th European Conference on Machine Learning 282–293 (2006). This article introduces UCT, the decision-making algorithm that revolutionized gameplay in Go.
Google Scholar
Gelly, S. et al. The grand challenge of computer Go: Monte Carlo tree search and extensions. Communications of the ACM 55, 106–113 (2012).
Article Google Scholar
İpek. E., Mutlu, O., Martínez, J. F. & Caruana, R. Self-optimizing memory controllers: a reinforcement learning approach. In Proc. 35th International Symposium on Computer Architecture 39–50 (2008).
Google Scholar
Ng, A. Y., Kim, H. J., Jordan, M. I. & Sastry, S. Autonomous helicopter flight via reinforcement learning. In Proc. Advances in Neural Information Processing Systems http://papers.nips.cc/paper/2455-autonomous-helicopter-flight-via-reinforcement-learning (2003).
Google Scholar
Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proc. 7th International Conference on Machine Learning 216–224 (Morgan Kaufmann, 1990).
Google Scholar
Kearns, M. J. & Singh, S. P. Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002). This article provides the first algorithm and analysis that shows that reinforcement-learning tasks can be solved approximately optimally with a relatively small amount of experience.
Article MATH Google Scholar
Brafman, R. I. & Tennenholtz, M. R-MAX — a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002).
MathSciNet MATH Google Scholar
Li, L., Littman, M. L., Walsh, T. J. & Strehl, A. L. Knows what it knows: a framework for self-aware learning. Mach. Learn. 82, 399–443 (2011).
Article MathSciNet MATH Google Scholar
Langley, P. Machine learning as an experimental science. Mach. Learn. 3, 5–8 (1988).
Google Scholar
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).
Article Google Scholar
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). This article describes the application of deep learning in a reinforcement-learning setting to address the challenging task of decision making in an arcade environment.
Article ADS CAS PubMed Google Scholar
Murphy, S. A. An experimental design for the development of adaptive treatment strategies. Stat. Med. 24, 1455–1481 (2005).
Article MathSciNet CAS PubMed Google Scholar
Li, L., Chu, W., Langford, J. & Wang, X. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proc. 4th ACM International Conference on Web Search and Data Mining 297–306 (2011).
Book Google Scholar
Nouri, A. et al. A novel benchmark methodology and data repository for real-life reinforcement learning. In Proc. Multidisciplinary Symposium on Reinforcement Learning, Poster (2009).
Google Scholar
Marivate, V. N., Chemali, J., Littman, M. & Brunskill, E. Discovering multi-modal characteristics in observational clinical data. In Proc. Machine Learning for Clinical Data Analysis and Healthcare NIPS Workshop http://paul.rutgers.edu/∼vukosi/papers/nips2013workshop.pdf (2013).
Google Scholar
Ng, A. Y., Harada, D. & Russell, S. Policy invariance under reward transformations: theory and application to reward shaping. In Proc. 16th International Conference on Machine Learning 278–287 (1999).
Google Scholar
Thomaz, A. L. & Breazeal, C. Teachable robots: understanding human teaching behaviour to build more effective robot learners. Artif. Intell. 172, 716–737 (2008).
Article Google Scholar
Knox, W. B. & Stone, P. Interactively shaping agents via human reinforcement: The TAMER framework. In Proc. 5th International Conference on Knowledge Capture 9–16 (2009).
Book Google Scholar
Loftin, R. et al. A strategy-aware technique for learning behaviors from discrete human feedback. In Proc. 28th Association for the Advancement of Artificial Intelligence Conference https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8579 (2014).
Google Scholar
Ng, A. Y. & Russell, S. Algorithms for inverse reinforcement learning. In Proc. International Conference on Machine Learning 663–670 (2000).
Google Scholar
Babes, M., Marivate, V. N., Littman, M. L. & Subramanian, K. Apprenticeship learning about multiple intentions. In Proc. International Conference on Machine Learning 897–904 (2011).
Google Scholar
Singh, S., Lewis, R.L., Barto, A.G. & Sorg, J. Intrinsically motivated reinforcement learning: an evolutionary perspective. IEEE Trans. Auto. Mental Dev. 2, 70–82 (2010).
Article CAS Google Scholar
Newell, A. The chess machine: an example of dealing with a complex task by adaptation. In Proc. Western Joint Computer Conference 101–108 (1955).
Minsky, M. L. Some methods of artificial intelligence and heuristic programming. In Proc. Symposium on the Mechanization of Thought Processes 24–27 (1958).
Google Scholar
Sutton, R. S. & Barto, A. G. Toward a modern theory of adaptive networks: expectation and prediction. Psychol. Rev. 88, 135–170 (1981).
Article CAS PubMed Google Scholar
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Article CAS PubMed Google Scholar
Dayan, P. & Niv, Y. Reinforcement learning and the brain: the good, the bad and the ugly. Curr. Opin. Neurobiol. 18, 185–196 (2008).
Article CAS PubMed Google Scholar
Niv, Y. Neuroscience: dopamine ramps up. Nature 500, 533–535 (2013).
Article ADS CAS PubMed Google Scholar
Cushman, F. Action, outcome, and value a dual-system framework for morality. Pers. Soc. Psychol. Rev. 17, 273–292 (2013).
Article PubMed Google Scholar
Shapley, L. Stochastic games. Proc. Natl Acad. Sci. USA 39, 1095–1100 (1953).
Article ADS MathSciNet CAS PubMed MATH PubMed Central Google Scholar
Bellman, R. Dynamic Programming (Princeton Univ. Press, 1957).
MATH Google Scholar
Kober, J., Bagnell, J. A. & Peters, J. Reinforcement learning in robotics: a survey. Int. J. Rob. Res. 32, 1238–1274 (2013).
Article Google Scholar
Watkins, C. J. C. H. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992). This article introduces the first provably correct approach to reinforcement learning for both prediction and decision making.
MATH Google Scholar
Jaakkola, T., Jordan, M. I. & Singh, S. P. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems 6, 703–710 (Morgan Kaufmann, 1994).
Google Scholar
Diuk, C., Li, L. & Leffner, B. R. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proc. 26th International Conference on Machine Learning 32–40 (2009).
Google Scholar

Download references

Acknowledgements

The author appreciates his discussions with his colleagues that led to this synthesis of current work.

Author information

Authors and Affiliations

Department of Computer Science, Brown University, Providence, 02912, Rhode Island, USA
Michael L. Littman

Authors

Michael L. Littman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael L. Littman.

Ethics declarations

Competing interests

The author declares no competing financial interests.

Additional information

Reprints and permissions information is available at www.nature.com/reprints.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Littman, M. Reinforcement learning improves behaviour from evaluative feedback. Nature 521, 445–451 (2015). https://doi.org/10.1038/nature14540

Download citation

Received: 11 January 2015
Accepted: 28 April 2015
Published: 27 May 2015
Issue Date: 28 May 2015
DOI: https://doi.org/10.1038/nature14540

This article is cited by

Optimizing sepsis treatment strategies via a reinforcement learning model
- Tianyi Zhang
- Yimeng Qu
- Mingwei Zhang
Biomedical Engineering Letters (2024)
MyPGI - a methodology to yield personalized gestural interaction
- Rúbia Eliza de Oliveira Schultz Ascari
- Luciano Silva
- Roberto Pereira
Universal Access in the Information Society (2023)
Wearable EEG electronics for a Brain–AI Closed-Loop System to enhance autonomous machine decision-making
- Joo Hwan Shin
- Junmo Kwon
- Tae-il Kim
npj Flexible Electronics (2022)
A cortical circuit for audio-visual predictions
- Aleena R. Garner
- Georg B. Keller
Nature Neuroscience (2022)
Transferring policy of deep reinforcement learning from simulation to reality for robotics
- Hao Ju
- Rongshun Juan
- Guangliang Li
Nature Machine Intelligence (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.