Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Probabilistic machine learning and artificial intelligence

Abstract

How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, has a central role in scientific data analysis, machine learning, robotics, cognitive science and artificial intelligence. This Review provides an introduction to this framework, and discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Bayesian inference.
Figure 2: Probabilistic programming.
Figure 3: Bayesian optimization.
Figure 4: The Automatic Statistician.

Similar content being viewed by others

References

  1. Russell, S. & Norvig, P. Artificial Intelligence: a Modern Approach (Prentice–Hall, 1995).

    MATH  Google Scholar 

  2. Thrun, S., Burgard, W. & Fox, D. Probabilistic Robotics (MIT Press, 2006).

    MATH  Google Scholar 

  3. Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).

    MATH  Google Scholar 

  4. Murphy, K. P. Machine Learning: A Probabilistic Perspective (MIT Press, 2012).

    MATH  Google Scholar 

  5. Hinton, G. et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012).

    ADS  Google Scholar 

  6. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems 25 1097–1105 (2012).

    Google Scholar 

  7. Sermanet, P. et al. Overfeat: integrated recognition, localization and detection using convolutional networks. In Proc. International Conference on Learning Representations http://arxiv.org/abs/1312.6229 (2014).

    Google Scholar 

  8. Bengio, Y., Ducharme, R., Vincent, P. & Janvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003).

    MATH  Google Scholar 

  9. Ghahramani, Z. Bayesian nonparametrics and the probabilistic approach to modelling. Phil. Trans. R. Soc. A 371, 20110553 (2013). A review of Bayesian non-parametric modelling written for a general scientific audience.

    ADS  PubMed  MATH  Google Scholar 

  10. Jaynes, E. T. Probability Theory: the Logic of Science (Cambridge Univ. Press, 2003).

    MATH  Google Scholar 

  11. Koller, D. & Friedman, N. Probabilistic Graphical Models: Principles and Techniques (MIT Press, 2009). This is an encyclopaedic text on probabilistic graphical models spanning many key topics.

    MATH  Google Scholar 

  12. Cox, R. T. The Algebra of Probable Inference (Johns Hopkins Univ. Press, 1961).

    MATH  Google Scholar 

  13. Van Horn, K. S. Constructing a logic of plausible inference: a guide to Cox's theorem. Int. J. Approx. Reason. 34, 3–24 (2003).

    MathSciNet  MATH  Google Scholar 

  14. De Finetti, B. La prévision: ses lois logiques, ses sources subjectives. In Annales de l'institut Henri Poincaré [in French] 7, 1–68 (1937).

    Google Scholar 

  15. Knill, D. & Richards, W. Perception as Bayesian inference (Cambridge Univ.Press, 1996).

  16. Griffiths, T. L. & Tenenbaum, J. B. Optimal predictions in everyday cognition. Psychol. Sci. 17, 767–773 (2006).

    PubMed  Google Scholar 

  17. Wolpert, D. M., Ghahramani, Z. & Jordan, M. I. An internal model for sensorimotor integration. Science 269, 1880–1882 (1995).

    ADS  CAS  PubMed  Google Scholar 

  18. Tenenbaum, J. B., Kemp, C., Griffiths, T. L. & Goodman, N. D. How to grow a mind: statistics, structure, and abstraction. Science 331, 1279–1285 (2011).

    ADS  MathSciNet  CAS  PubMed  MATH  Google Scholar 

  19. Marcus, G. F. & Davis, E. How robust are probabilistic models of higher-level cognition? Psychol. Sci. 24, 2351–2360 (2013).

    PubMed  Google Scholar 

  20. Goodman, N. D. et al. Relevant and robust a response to Marcus and Davis (2013). Psychol. Sci. 26, 539–541 (2015).

    PubMed  Google Scholar 

  21. Doya, K., Ishii, S., Pouget, A. & Rao, R. P. N. Bayesian Brain: Probabilistic Approaches to Neural Coding (MIT Press, 2007).

    MATH  Google Scholar 

  22. Deneve, S. Bayesian spiking neurons I: inference. Neural Comput. 20, 91–117 (2008).

    MathSciNet  PubMed  MATH  Google Scholar 

  23. Neal, R. M. Probabilistic Inference Using Markov Chain Monte Carlo Methods. Report No. CRG-TR-93–1 http://www.cs.toronto.edu/radford/review.abstract.html (Univ. Toronto, 1993).

    Google Scholar 

  24. Jordan, M., Ghahramani, Z., Jaakkola, T. & Saul, L. An introduction to variational methods in graphical models. Mach. Learn. 37, 183–233 (1999).

    MATH  Google Scholar 

  25. Doucet, A., de Freitas, J. F. G. & Gordon, N. J. Sequential Monte Carlo Methods in Practice (Springer, 2000).

    MATH  Google Scholar 

  26. Minka, T. P. Expectation propagation for approximate Bayesian inference. In Proc. Uncertainty in Artificial Intelligence 17 362–369 (2001).

    Google Scholar 

  27. Neal, R. M. In Handbook of Markov Chain Monte Carlo (eds Brooks, S., Gelman, A., Jones, G. & Meng, X.-L.) (Chapman & Hall/CRC, 2010).

    Google Scholar 

  28. Girolami, M. & Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Series B Stat. Methodol. 73, 123–214 (2011).

    MathSciNet  MATH  Google Scholar 

  29. Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems 27, 3104–3112 (2014).

    Google Scholar 

  30. Neal, R. M. in Maximum Entropy and Bayesian Methods 197–211 (Springer, 1992).

    Google Scholar 

  31. Orbanz, P. & Teh, Y. W. in Encyclopedia of Machine Learning 81–89 (Springer, 2010).

    Google Scholar 

  32. Hjort, N., Holmes, C., Müller, P. & Walker, S. (eds). Bayesian Nonparametrics (Cambridge Univ. Press, 2010).

    MATH  Google Scholar 

  33. Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MIT Press, 2006). This is a classic monograph on Gaussian processes, relating them to kernel methods and other areas of machine learning.

    MATH  Google Scholar 

  34. Lu, C. & Tang, X. Surpassing human-level face verification performance on LFW with GaussianFace. In Proc. 29th AAAI Conference on Artificial Intelligence http://arxiv.org/abs/1404.3840 (2015).

    Google Scholar 

  35. Ferguson, T. S. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1, 209–230 (1973).

    MathSciNet  MATH  Google Scholar 

  36. Teh, Y. W., Jordan, M. I., Beal, M. J. & Blei, D. M. Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101, 1566–1581 (2006).

    MathSciNet  CAS  MATH  Google Scholar 

  37. Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. Learning systems of concepts with an infinite relational model. In Proc. 21st National Conference on Artificial Intelligence 381–388 (2006).

    Google Scholar 

  38. Medvedovic, M. & Sivaganesan, S. Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18, 1194–1206 (2002).

    CAS  PubMed  Google Scholar 

  39. Rasmussen, C. E., De la Cruz, B. J., Ghahramani, Z. & Wild, D. L. Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures. Trans. Comput. Biol. Bioinform. 6, 615–628 (2009).

    CAS  Google Scholar 

  40. Griffiths, T. L. & Ghahramani, Z. The Indian buffet process: an introduction and review. J. Mach. Learn. Res. 12, 1185–1224 (2011). This article introduced a new class of Bayesian non-parametric models for latent feature modelling.

    MathSciNet  MATH  Google Scholar 

  41. Adams, R. P., Wallach, H. & Ghahramani, Z. Learning the structure of deep sparse graphical models. In Proc. 13th International Conference on Artificial Intelligence and Statistics (eds Teh, Y. W. & Titterington, M.) 1–8 (2010).

    Google Scholar 

  42. Miller, K., Jordan, M. I. & Griffiths, T. L. Nonparametric latent feature models for link prediction. In Proc. Advances in Neural Information Processing Systems 1276–1284 (2009).

    Google Scholar 

  43. Hinton, G. E., McClelland, J. L. & Rumelhart, D. E. in Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations 77–109 (MIT Press, 1986).

    Google Scholar 

  44. Neal, R. M. Bayesian Learning for Neural Networks (Springer, 1996). This text derived MCMC-based Bayesian inference in neural networks and drew important links to Gaussian processes.

    MATH  Google Scholar 

  45. Koller, D., McAllester, D. & Pfeffer, A. Effective Bayesian inference for stochastic programs. In Proc. 14th National Conference on Artificial Intelligence 740–747 (1997).

    Google Scholar 

  46. Goodman, N. D. & Stuhlmüller, A. The Design and Implementation of Probabilistic Programming Languages. Available at http://dippl.org (2015).

    Google Scholar 

  47. Pfeffer, A. Practical Probabilistic Programming (Manning, 2015).

    Google Scholar 

  48. Freer, C., Roy, D. & Tenenbaum, J. B. in Turing's Legacy (ed. Downey, R.) 195–252 (2014).

    Google Scholar 

  49. Marjoram, P., Molitor, J., Plagnol, V. & Tavaré, S. Markov chain Monte Carlo without likelihoods. Proc. Natl Acad. Sci. USA 100, 15324–15328 (2003).

    ADS  CAS  PubMed  Google Scholar 

  50. Mansinghka, V., Kulkarni, T. D., Perov, Y. N. & Tenenbaum, J. Approximate Bayesian image interpretation using generative probabilistic graphics programs. In Proc. Advances in Neural Information Processing Systems 26 1520–1528 (2013).

    Google Scholar 

  51. Bishop, C. M. Model-based machine learning. Phil. Trans. R. Soc. A 371, 20120222 (2013). This article is a very clear tutorial exposition of probabilistic modelling.

    ADS  MathSciNet  PubMed  MATH  Google Scholar 

  52. Lunn, D. J., Thomas, A., Best, N. & Spiegelhalter, D. WinBUGS — a Bayesian modelling framework: concepts, structure, and extensibility. Stat. Comput. 10, 325–337 (2000). This reports an early probabilistic programming framework widely used in statistics.

    Google Scholar 

  53. Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.5.0. http://mc-stan.org/ (2014).

  54. Fischer, B. & Schumann, J. AutoBayes: a system for generating data analysis programs from statistical models. J. Funct. Program. 13, 483–508 (2003).

    MathSciNet  MATH  Google Scholar 

  55. Minka, T. P., Winn, J. M., Guiver, J. P. & Knowles, D. A. Infer.NET 2.4. http://research.microsoft.com/infernet (Microsoft Research, 2010).

    Google Scholar 

  56. Wingate, D., Stuhlmüller, A. & Goodman, N. D. Lightweight implementations of probabilistic programming languages via transformational compilation. In Proc. International Conference on Artificial Intelligence and Statistics 770–778 (2011).

    Google Scholar 

  57. Pfeffer, A. IBAL: a probabilistic rational programming language. In Proc. International Joint Conference on Artificial Intelligence 733–740 (2001).

    Google Scholar 

  58. Milch, B. et al. BLOG: probabilistic models with unknown objects. In Proc. 19th International Joint Conference on Artificial Intelligence 1352–1359 (2005).

    Google Scholar 

  59. Goodman, N., Mansinghka, V., Roy, D., Bonawitz, K. & Tenenbaum, J. Church: a language for generative models. In Proc. Uncertainty in Artificial Intelligence 22 23 (2008). This is an influential paper introducing the Turing-complete probabilistic programming language Church.

    Google Scholar 

  60. Pfeffer, A. Figaro: An Object-Oriented Probabilistic Programming Language. Tech. Rep. (Charles River Analytics, 2009).

    Google Scholar 

  61. Mansinghka, V., Selsam, D. & Perov, Y. Venture: a higher-order probabilistic programming platform with programmable inference. Preprint at http://arxiv.org/abs/1404.0099 (2014).

  62. Wood, F., van de Meent, J. W. & Mansinghka, V. A new approach to probabilistic programming inference. In Proc. 17th International Conference on Artificial Intelligence and Statistics 1024–1032 (2014).

    Google Scholar 

  63. Li, L., Wu, Y. & Russell, S. J. SWIFT: Compiled Inference for Probabilistic Programs. Report No. UCB/EECS-2015–12 (Univ. California, Berkeley, 2015).

    Google Scholar 

  64. Bergstra, J. et al. Theano: a CPU and GPU math expression compiler. In Proc. 9th Python in Science Conference http://conference.scipy.org/proceedings/scipy2010/ (2010).

    Google Scholar 

  65. Kushner, H. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J. Basic Eng. 86, 97–106 (1964).

    Google Scholar 

  66. Jones, D. R., Schonlau, M. & Welch, W. J. Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13, 455–492 (1998).

    MathSciNet  MATH  Google Scholar 

  67. Brochu, E., Cora, V. M. & de Freitas, N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Preprint at http://arXiv.org/abs/1012.2599 (2010).

  68. Hennig, P. & Schuler, C. J. Entropy search for information-efficient global optimization. J. Mach. Learn. Res. 13, 1809–1837 (2012).

    MathSciNet  MATH  Google Scholar 

  69. Hernández-Lobato, J. M., Hoffman, M. W. & Ghahramani, Z. Predictive entropy search for efficient global optimization of black-box functions. In Proc. Advances in Neural Information Processing Systems 918–926 (2014).

    Google Scholar 

  70. Snoek, J., Larochelle, H. & Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Proc. Advances in Neural Information Processing Systems 2960–2968 (2012).

    Google Scholar 

  71. Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In Proc. 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 847–855 (2013).

    Google Scholar 

  72. Robbins, H. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 55, 527–535 (1952).

    MathSciNet  MATH  Google Scholar 

  73. Deisenroth, M. P. & Rasmussen, C. E. PILCO: a model-based and data-efficient approach to policy search. In Proc. 28th International Conference on Machine Learning 465–472 (2011).

    Google Scholar 

  74. Poupart, P. in Encyclopedia of Machine Learning 90–93 (Springer, 2010).

    Google Scholar 

  75. Diaconis, P. in Statistical Decision Theory and Related Topics IV 163–175 (Springer, 1988).

    Google Scholar 

  76. O'Hagan, A. Bayes-Hermite quadrature. J. Statist. Plann. Inference 29, 245–260 (1991).

    MathSciNet  MATH  Google Scholar 

  77. Shannon, C. & Weaver, W. The Mathematical Theory of Communication (Univ. Illinois Press, 1949).

    MATH  Google Scholar 

  78. MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms (Cambridge Univ. Press, 2003).

    MATH  Google Scholar 

  79. Wood, F., Gasthaus, J., Archambeau, C., James, L. & Teh, Y. W. The sequence memoizer. Commun. ACM 54, 91–98 (2011). This article derives a state-of-the-art data compression scheme based on Bayesian nonparametric models.

    Google Scholar 

  80. Steinruecken, C., Ghahramani, Z. & MacKay, D. J. C. Improving PPM with dynamic parameter updates. In Proc. Data Compression Conference (in the press).

  81. Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B. & Ghahramani, Z. Automatic construction and natural-language description of nonparametric regression models. In Proc. 28th AAAI Conference on Artificial Intelligence Preprint at: http://arxiv.org/abs/1402.4304 (2014). Introduces the Automatic Statistician, translating learned probabilistic models into reports about data.

    Google Scholar 

  82. Grosse, R. B., Salakhutdinov, R. & Tenenbaum, J. B. Exploiting compositionality to explore a large space of model structures. In Proc. Conference on Uncertainty in Artificial Intelligence 306–315 (2012).

    Google Scholar 

  83. Schmidt, M. & Lipson, H. Distilling free-form natural laws from experimental data. Science 324, 81–85 (2009).

    ADS  CAS  PubMed  Google Scholar 

  84. Wolstenholme, D. E., O'Brien, C. M. & Nelder, J. A. GLIMPSE: a knowledge-based front end for statistical analysis. Knowl. Base. Syst. 1, 173–178 (1988).

    Google Scholar 

  85. Hand, D. J. Patterns in statistical strategy. In Artificial Intelligence and Statistics (ed Gale, W. A.) (Addison-Wesley Longman, 1986).

    MATH  Google Scholar 

  86. King, R. D. et al. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427, 247–252 (2004).

    ADS  CAS  PubMed  Google Scholar 

  87. Welling, M. et al. Bayesian inference with big data: a snapshot from a workshop. ISBA Bulletin 21, https://bayesian.org/sites/default/files/fm/bulletins/1412.pdf (2014).

    Google Scholar 

  88. Bakker, B. & Heskes, T. Task clustering and gating for Bayesian multitask learning. J. Mach. Learn. Res. 4, 83–99 (2003).

    MATH  Google Scholar 

  89. Houlsby, N., Hernández-Lobato, J. M., Huszár, F. & Ghahramani, Z. Collaborative Gaussian processes for preference learning. In Proc. Advances in Neural Information Processing Systems 26 2096–2104 (2012).

    Google Scholar 

  90. Russell, S. J. & Wefald, E. Do the Right Thing: Studies in Limited Rationality (MIT Press, 1991).

    Google Scholar 

  91. Jordan, M. I. On statistics, computation and scalability. Bernoulli 19, 1378–1390 (2013).

    MathSciNet  MATH  Google Scholar 

  92. Hoffman, M., Blei, D., Paisley, J. & Wang, C. Stochastic variational inference. J. Mach. Learn. Res. 14, 1303–1347 (2013).

    MathSciNet  MATH  Google Scholar 

  93. Hensman, J., Fusi, N. & Lawrence, N. D. Gaussian processes for big data. In Proc. Conference on Uncertainty in Artificial Intelligence 244 (UAI, 2013).

    Google Scholar 

  94. Korattikara, A., Chen, Y. & Welling, M. Austerity in MCMC land: cutting the Metropolis-Hastings budget. In Proc. 31th International Conference on Machine Learning 181–189 (2014).

    Google Scholar 

  95. Paige, B., Wood, F., Doucet, A. & Teh, Y. W. Asynchronous anytime sequential Monte Carlo. In Proc. Advances in Neural Information Processing Systems 27 3410–3418 (2014).

    Google Scholar 

  96. Jefferys, W. H. & Berger, J. O. Ockham's Razor and Bayesian Analysis. Am. Sci. 80, 64–72 (1992).

    ADS  Google Scholar 

  97. Rasmussen, C. E. & Ghahramani, Z. Occam's Razor. In Neural Information Processing Systems 13 (eds Leen, T. K., Dietterich, T. G., & Tresp, V.) 294–300 (2001).

    Google Scholar 

  98. Rabiner, L. R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989).

    Google Scholar 

  99. Gelman, A. et al. Bayesian Data Analysis 3rd edn (Chapman & Hall/CRC, 2013).

    Google Scholar 

  100. Lloyd, J. R. & Ghahramani, Z. Statistical model criticism using kernel two sample tests http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf (2015).

Download references

Acknowledgements

I acknowledge an EPSRC grant EP/I036575/1, the DARPA PPAML programme, a Google Focused Research Award for the Automatic Statistician and support from Microsoft Research. I am also grateful for valuable input from D. Duvenaud, H. Ge, M. W. Hoffman, J. R. Lloyd, A. Ścibior, M. Welling and D. Wolpert.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zoubin Ghahramani.

Ethics declarations

Competing interests

I serve as an advisor to a number of companies working on machine learning and artificial intelligence.

Additional information

Reprints and permissions information is available at www.nature.com/reprints.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature 521, 452–459 (2015). https://doi.org/10.1038/nature14541

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nature14541

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics