A free-energy principle has been proposed recently that accounts for action, perception and learning. This Review looks at some key brain theories in the biological (for example, neural Darwinism) and physical (for example, information theory and optimal control theory) sciences from the free-energy perspective. Crucially, one key theme runs through each of these theories — optimization. Furthermore, if we look closely at what is optimized, the same quantity keeps emerging, namely value (expected reward, expected utility) or its complement, surprise (prediction error, expected cost). This is the quantity that is optimized under the free-energy principle, which suggests that several global brain theories might be unified within a free-energy framework.
Adaptive agents must occupy a limited repertoire of states and therefore minimize the long-term average of surprise associated with sensory exchanges with the world. Minimizing surprise enables them to resist a natural tendency to disorder.
Surprise rests on predictions about sensations, which depend on an internal generative model of the world. Although surprise cannot be measured directly, a free-energy bound on surprise can be, suggesting that agents minimize free energy by changing their predictions (perception) or by changing the predicted sensory inputs (action).
Perception optimizes predictions by minimizing free energy with respect to synaptic activity (perceptual inference), efficacy (learning and memory) and gain (attention and salience). This furnishes Bayes-optimal (probabilistic) representations of what caused sensations (providing a link to the Bayesian brain hypothesis).
Bayes-optimal perception is mathematically equivalent to predictive coding and maximizing the mutual information between sensations and the representations of their causes. This is a probabilistic generalization of the principle of efficient coding (the infomax principle) or the minimum-redundancy principle.
Learning under the free-energy principle can be formulated in terms of optimizing the connection strengths in hierarchical models of the sensorium. This rests on associative plasticity to encode causal regularities and appeals to the same synaptic mechanisms as those underlying cell assembly formation.
Action under the free-energy principle reduces to suppressing sensory prediction errors that depend on predicted (expected or desired) movement trajectories. This provides a simple account of motor control, in which action is enslaved by perceptual (proprioceptive) predictions.
Perceptual predictions rest on prior expectations about the trajectory or movement through the agent's state space. These priors can be acquired (as empirical priors during hierarchical inference) or they can be innate (epigenetic) and therefore subject to selective pressure.
Predicted motion or state transitions realized by action correspond to policies in optimal control theory and reinforcement learning. In this context, value is inversely proportional to surprise (and implicitly free energy), and rewards correspond to innate priors that constrain policies.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $22.08 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
This work was funded by the Wellcome Trust. I would like to thank my colleagues at the Wellcome Trust Centre for Neuroimaging, the Institute of Cognitive Neuroscience and the Gatsby Computational Neuroscience Unit for collaborations and discussions.
The entropy of sensory states and their causes
Variational free energy
The free-energy principle and infomax
Value and surprise
Policies and cost
- Free energy
An information theory measure that bounds or limits (by being greater than) the surprise on sampling some data, given a generative model.
The process whereby an open or closed system regulates its internal environment to maintain its states within bounds.
The average surprise of outcomes sampled from a probability distribution or density. A density with low entropy means that, on average, the outcome is relatively predictable. Entropy is therefore a measure of uncertainty.
(Surprisal or self information.) The negative log-probability of an outcome. An improbable outcome (for example, water flowing uphill) is therefore surprising.
- Fluctuation theorem
(A term from statistical mechanics.) Deals with the probability that the entropy of a system that is far from the thermodynamic equilibrium will increase or decrease over a given amount of time. It states that the probability of the entropy decreasing becomes exponentially smaller with time.
A set to which a dynamical system evolves after a long enough time. Points that get close to the attractor remain close, even under small perturbations.
- Kullback-Leibler divergence
(Or information divergence, information gain or cross entropy.) A non-commutative measure of the non-negative difference between two probability distributions.
- Recognition density
(Or 'approximating conditional density'.) An approximate probability distribution of the causes of data (for example, sensory input). It is the product of inference or inverting a generative model.
- Generative model
A probabilistic model (joint density) of the dependencies between causes and consequences (data), from which samples can be generated. It is usually specified in terms of the likelihood of data, given their causes (parameters of a model) and priors on the causes.
- Conditional density
(Or posterior density.) The probability distribution of causes or model parameters, given some data; that is, a probabilistic mapping from observed data to causes.
The probability distribution or density of the causes of data that encodes beliefs about those causes before observing the data.
- Bayesian surprise
A measure of salience based on the Kullback-Leibler divergence between the recognition density (which encodes posterior beliefs) and the prior density. It measures the information that can be recognized in the data.
- Bayesian brain hypothesis
The idea that the brain uses internal probabilistic (generative) models to update posterior beliefs, using sensory information, in an (approximately) Bayes-optimal fashion.
- Analysis by synthesis
Any strategy (in speech coding) in which the parameters of a signal coder are evaluated by decoding (synthesizing) the signal and comparing it with the original input signal.
- Epistemological automata
Possibly the first theory for why top-down influences (mediated by backward connections in the brain) might be important in perception and cognition.
- Empirical prior
A prior induced by hierarchical models; empirical priors provide constraints on the recognition density in the usual way but depend on the data.
- Sufficient statistics
Quantities that are sufficient to parameterize a probability density (for example, mean and covariance of a Gaussian density).
- Laplace assumption
(Or Laplace approximation or method.) A saddle-point approximation of the integral of an exponential function, that uses a second-order Taylor expansion. When the function is a probability density, the implicit assumption is that the density is approximately Gaussian.
- Predictive coding
A tool used in signal processing for representing a signal using a linear predictive (generative) model. It is a powerful speech analysis technique and was first considered in vision to explain lateral interactions in the retina.
An optimization principle for neural networks (or functions) that map inputs to outputs. It says that the mapping should maximize the Shannon mutual information between the inputs and outputs, subject to constraints and/or noise processes.
Governed by random effects.
- Biased competition
An attentional effect mediated by competitive interactions among neurons representing visual stimuli; these interactions can be biased in favour of behaviourally relevant stimuli by both spatial and non-spatial and both bottom-up and top-down processes.
- Reentrant signalling
Reciprocal message passing among neuronal groups.
- Reinforcement learning
An area of machine learning concerned with how an agent maximizes long-term reward. Reinforcement learning algorithms attempt to find a policy that maps states of the world to actions performed by the agent.
- Optimal control theory
An optimization method (based on the calculus of variations) for deriving an optimal control law in a dynamical system. A control problem includes a cost function that is a function of state and control variables.
- Bellman equation
(Or dynamic programming equation.) Named after Richard Bellman, it is a necessary condition for optimality associated with dynamic programming in optimal control theory.
- Optimal decision theory
(Or game theory.) An area of applied mathematics concerned with identifying the values, uncertainties and other constraints that determine an optimal decision.
- Gradient ascent
(Or method of steepest ascent.) A first-order optimization scheme that finds a maximum of a function by changing its arguments in proportion to the gradient of the function at the current value. In short, a hill-climbing scheme. The opposite scheme is a gradient descent.
- Principle of optimality
An optimal policy has the property that whatever the initial state and initial decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
- Exploration–exploitation trade-off
Involves a balance between exploration (of uncharted territory) and exploitation (of current knowledge). In reinforcement learning, it has been studied mainly through the multi-armed bandit problem.
- Dynamical systems theory
An area of applied mathematics that describes the behaviour of complex (possibly chaotic) dynamical systems as described by differential or difference equations.
Concerns the self-organization of patterns and structures in open systems far from thermodynamic equilibrium. It rests on the order parameter concept, which was generalized by Haken to the enslaving principle: that is, the dynamics of fast-relaxing (stable) modes are completely determined by the 'slow' dynamics of order parameters (the amplitudes of unstable modes).
Referring to the fundamental dialectic between structure and function.
Refers to a device or scheme that uses a generative model to furnish a recognition density and learns hidden structures in data by optimizing the parameters of generative models.