Canonical neural networks perform active inference

This work considers a class of canonical neural networks comprising rate coding models, wherein neural activity and plasticity minimise a common cost function—and plasticity is modulated with a certain delay. We show that such neural networks implicitly perform active inference and learning to minimise the risk associated with future outcomes. Mathematical analyses demonstrate that this biological optimisation can be cast as maximisation of model evidence, or equivalently minimisation of variational free energy, under the well-known form of a partially observed Markov decision process model. This equivalence indicates that the delayed modulation of Hebbian plasticity—accompanied with adaptation of firing thresholds—is a sufficient neuronal substrate to attain Bayes optimal inference and control. We corroborated this proposition using numerical analyses of maze tasks. This theory offers a universal characterisation of canonical neural networks in terms of Bayesian belief updating and provides insight into the neuronal mechanisms underlying planning and adaptive behavioural control.

T he sentient behaviour of biological organisms is characterised by optimisation. Biological organisms recognise the state of their environment by optimising internal representations of the external (i.e. environmental) dynamics generating sensory inputs. In addition, they optimise their behaviour for adaptation to the environment, thereby increasing their probability of survival and reproduction. This biological self-organisation is typically formulated as the minimisation of cost functions [1][2][3] , wherein a gradient descent on a cost function furnishes neural dynamics and synaptic plasticity. However, two fundamental issues remain to be established-namely, the characterisation of the dynamics of an arbitrary neural network as a generic optimisation process-and the correspondence between such neural dynamics and statistical inference 4 found in applied mathematics and machine learning. The present work addresses these issues by demonstrating that a class of canonical neural networks of rate coding models is functioning as-and thus universally characterised in terms of-variational Bayesian inference, under a particular but generic form of the generative model.
Variational Bayesian inference offers a unified explanation for inference, learning, prediction, decision making, and the evolution of biological form 5,6 . This kind of inference rests upon a generative model that expresses a hypothesis about the generation of sensory inputs. Perception and behaviour can then be read as optimising the evidence for a 'generative model', inherent in sensory exchanges with the environment. The ensuing evidence lower bound (ELBO) 7 -or equivalently variational free energy, which is the negative of the ELBO-then plays the role of a cost function. Variational free energy is the standard cost function in variational Bayes-and provides an upper bound on surprise (i.e., improbability) of sensory inputs. Minimisation of variational free energy, with respect to internal representations, then yields approximate posterior beliefs about external states. Similarly, the minimisation of variational free energy with respect to action on external states maximises the evidence or marginal likelihood of resulting sensory samples. This framework integrates perceptual (unsupervised), reward-based (reinforcement), and motor (supervised) learning in a unified formulation that shares many commitments with neuronal implementations of approximate Bayesian inference using spiking models [8][9][10][11][12] . In short, internal states of an autonomous system under a (possibly nonequilibrium) steady state can be viewed as parameterising posterior beliefs of external states [13][14][15] . In particular, active inference aims to optimise behaviours of a biological organism to minimise a certain kind of risk in the future [16][17][18] , wherein risk is typically expressed in a form of expected free energy (i.e., the variational free energy expected under posterior predictive beliefs about the outcomes of a given course of action).
Crucially, as a corollary of the complete class theorem [19][20][21] , any neural network minimising a cost function can be viewed as performing variational Bayesian inference, under some prior beliefs. We have previously introduced a reverse-engineering approach that identifies a class of biologically plausible cost functions for neural networks 22 . This foundational work identified a class of cost functions for single-layer feedforward neural networks of rate coding models with a sigmoid (or logistic) activation function-based on the assumption that the dynamics of neurons and synapses follow a gradient descent on a common cost function. We subsequently demonstrated the mathematical equivalence between the class of cost functions for such neural networks and variational free energy under a particular form of the generative model. This equivalence licences variational Bayesian inference as a fundamental optimisation process that underlies both the dynamics and function of such neural networks. Moreover, it enables one to characterise any variables and constants in the network in terms of quantities (e.g. priors) that underwrite variational Bayesian inference 22 . However, it remains to be established whether the active inference is an apt explanation for any given neural network that actively exchanges with its environment. In this paper, we address this enactive or control aspect to complete the formal equivalence of neural network optimisation and the free-energy principle.
In most formulations, active inference goes further than simply assuming action and perception minimise variational free energyit also considers the consequences of action as minimising expected free energy, i.e. planning (and control) as inference, as a foundational approach to sentient behaviour 12,[23][24][25][26][27] . Thus, to evince active inference in neural networks, it is necessary to demonstrate that they can plan to minimise future risks.
To address this issue, this work identifies a class of biologically plausible cost functions for two-layer recurrent neural networks, under an assumption that neural activity and plasticity minimise a common cost function (referred to as assumption 1). Then, we analytically and numerically demonstrate the implicit ability of neural networks to plan and minimise future risk, when viewed through the lens of active inference. Namely, we suppose a network of rate coding neurons with a sigmoid activation function, wherein the middle layer involves recurrent connections, and the output layer provides feedback responses to the environment (assumption 2). In this work, we will call such architectures canonical neural networks (Table 1). Then, we demonstrate that the class of cost functions-describing their dynamics-can be cast as variational free energy under an implicit generative model, in the well-known form of a partially observable Markov decision process (POMDP). The gradient descent on the ensuing cost function naturally yields Hebbian plasticity [28][29][30] with an activity-dependent homoeostatic term.
In particular, we consider the case where an arbitrary modulator [31][32][33] regulates synaptic plasticity with a certain delay (assumption 3) and demonstrate that such modulation is identical to the update of a policy through a post hoc evaluation of past decisions. The modulator renders the implicit cost function a risk function, which in turn renders behavioural control Bayes optimal-to minimise future risk. The proposed analysis affirms that active inference is an inherent property of canonical neural networks exhibiting delayed modulation of Hebbian plasticity. We discuss possible neuronal substrates that realise this modulation.

Results
Overview of equivalence between neural networks and variational Bayes. First, we summarise the formal correspondence between neural networks and variational Bayes. A biological agent is formulated here as an autonomous system comprising a network of rate coding neurons (Fig. 1a). We presume that neural activity, action (decision), synaptic plasticity, and changes in any other free parameters minimise a common cost function L :¼ Lðo 1:t ; φÞ (c.f., assumption 1 specified in Introduction). Here, o 1:t :¼ fo 1 ; ; o t g is a sequence of observations and φ :¼ fx 1:t ; y 1:t ; W; ϕg is a set of internal states comprising the middle-layer (x τ ) and output-layer (y τ ) neural activity, synaptic strengths (W), and other free parameters (ϕ) that characterise L (e.g. firing threshold). Output-layer activity y t determines the network's actions or decisions δ t . Based on assumption 1 and the continuous updating nature of φ, the update rule for the i-th component of φ is derived as the gradient descent on the cost function, _ φ i / À∂L=∂φ i . This determines the dynamics of neural networks, including their activity and plasticity.
In contrast, variational Bayesian inference depicts a process of updating the prior distribution of external states PðϑÞ to the corresponding posterior distribution QðϑÞ based on a sequence of observations. Here, QðϑÞ approximates (or possibly exactly equals to) Pðϑjo 1:t Þ. This process is formulated as a minimisation of the surprise of past-to-present observations-or equivalently maximisation of the model evidence-which is attained by minimising variational free energy as a tractable proxy. We suppose that the generative model Pðo 1:t ; ϑÞ is characterised by a set of external states, ϑ :¼ fs 1:t ; δ 1:t ; θ; λg, comprising hidden states (s τ ), decision (δ τ ), model parameters (θ) and hyper parameters (λ) (Fig. 1b). Based on the given generative model, variational free energy is defined as a function of QðϑÞ as follows: Fðo 1:t ; QðϑÞÞ :¼ E QðϑÞ ½Àln Pðo 1:t ; ϑÞ þ ln QðϑÞ. Here, E QðϑÞ ½Á :¼ R Á QðϑÞdϑ denotes the expectation over QðϑÞ. In particular, we assume that QðϑÞ is an exponential family (as considered in previous works 10 ) and the posterior expectation of ϑ, ϑ :¼ E QðϑÞ ½ϑ, or its counterpart, are the sufficient statistics that parameterise (i.e. uniquely determine) QðϑÞ. Under this condition, F is reduced to a function of ϑ, F ¼ Fðo 1:t ; ϑÞ. The variational update rule for the i-th component of ϑ is given as the gradient descent on variational free energy, Crucially, according to the complete class theorem, a dynamical system that minimises its cost function can be viewed as performing Bayesian inference under some generative model and prior beliefs. The complete class theorem [19][20][21] states that for any pair of admissible decision rules and cost functions, there is some generative model with prior beliefs that renders the decisions Bayes optimal (refer to Supplementary Methods 1 for technical details). Thus, this theorem ensures the presence of a generative model that formally corresponds to the above-defined neural network characterised by L. Hence, this speaks to the equivalence between the class of neural network cost functions and variational free energy under such a generative model: wherein the internal states of a network φ encode or parameterise the posterior expectation ϑ. This mathematical equivalence means that an arbitrary neural network, in the class under consideration, is implicitly performing active inference through variational free Fig. 1 Schematic of an external milieu and neural network, and the corresponding Bayesian formation. a Interaction between the external milieu and autonomous system comprising a two-layer neural network. On receiving sensory inputs or observations oðtÞ that are generated from hidden states sðtÞ, the network activity xðtÞ generates outputs yðtÞ. The gradient descent on a neural network cost function L determines the dynamics of neural activity and plasticity. Thus, L is sufficient to characterise the neural network. The proposed theory affirms that the ensuing neural dynamics are self-organised to encode the posterior beliefs about hidden states and decisions. b Corresponding variational Bayesian formation. The interaction depicted in a is formulated in terms of a POMDP model, which is parameterised by A; B; C 2 θ and D; E 2 λ. Variational free energy minimisation allows an agent to self-organise to encode the hidden states of the external milieu-and to make decisions minimising future risk. Here, variational free energy F is sufficient to characterise the inferences and behaviours of the agent.

Expression Description
Canonical neural network In this work, a canonical neural network is defined by differential equations of neural activity derived as a reduction of realistic neuron models through some approximations, which give a network of rate coding neurons with a sigmoid activation function. In particular, we consider networks comprising a middle layer that involves recurrent connections and the output layer that provides feedback responses to the environment. o t ; s t ; δ t Observations o t , hidden states s t and decisions (actions) δ t are random variables that follow categorical distributions. Each element of them takes 0 or 1. Γ t Risk function Γ t parameterises a categorical distribution over γ t : Pðγ t Þ ¼ CatððΓ t ; Γ t Þ T Þ. This can be read as an arbitrary neuromodulator that regulates synaptic plasticity through Eq. (9), which becomes a risk function under the variational Bayes formulation. A; B; C Parameter matrices A, B, and C are random variables with Dirichlet distributions PðAÞ ¼ DirðaÞ, PðBÞ ¼ DirðbÞ and PðCÞ ¼ DirðcÞ.

CatðAÞ
Categorical distribution. In this expression, the probability that o τ ¼ i occurs given For any pair of i and j, this is expressed as Pðo τ js τ ; AÞ ¼ CatðAÞ, where matrix A is the likelihood mapping that maps s τ to o τ . Although one may prefer to denote it as CatðAs τ Þ to emphasise that it changes depending on s τ , in this paper, we use CatðAÞ following the notation in previous work. Note that A ðiÞ kl means that the probability that o ðiÞ energy minimisation. Minimisation of F is achieved when and only when the posterior beliefs best match the true conditional probability of the external states. Thus, the dynamics that minimise L must induce a recapitulation of the external states in the internal states of the neural network. This is a fundamental aspect of optimisation in neural networks. This notion is essential to understand the functional meaning of the dynamics evinced by an arbitrary neural network, which is otherwise unclear by simply observing the network dynamics. Note that being able to characterise the neural network in terms of maximising model evidence lends it an 'explainability', in the sense that the internal (neural network) states and parameters encode Bayesian beliefs or expectations about the causes of observations. In other words, the generative model explains how outcomes were generated. However, the complete class theorem does not specify the form of a generative model for any given neural network. To address this issue, in the remainder of the paper, we formulate active inference using a particular form of POMDP models, whose states take binary values. This facilitates the identification of a class of generative models that corresponds to a class of canonical neural networks-comprising rate coding models with the sigmoid activation function.
Active inference formulated using a postdiction of past decisions. In this section, we define a generative model and ensuing variational free energy that corresponds to a class of canonical neural networks that will be considered in the subsequent section. The external milieu is expressed as a discrete state space in the form of a POMDP (Fig. 2 is expressed in the form of a categorical distribution, Pðo τ js τ ; AÞ ¼ CatðAÞ, where matrix A is also known as the likelihood mapping (Table 1). Here, A ij means the probability that o τ ¼ i is realised given s τ ¼ j, and 1 ≤ τ ≤ t denotes an arbitrary time in the past or the present. Our agent receives o τ , infers latent variables (hidden states) s τ , and provides a feedback decision δ τ :¼ ðδ ð1Þ τ ; ; δ ðN δ Þ τ Þ T to the external milieu. Thus, the state transition at time τ depends on the previous decision δ τÀ1 , characterised by the state transition matrix B δ , Pðs τ js τÀ1 ; δ τÀ1 ; BÞ ¼ CatðB δ Þ. Moreover, decision at time τ is conditioned on the previous state s τÀ1 and current risk γ t , characterised by the policy mapping C, Pðδ τ js τÀ1 ; γ t ; CÞ, where γ t contextualises Pðδ τ js τÀ1 ; γ t ; CÞ as described below. Each element of s τ , o τ , and δ τ adopts a binary value, which is suitable for characterising generative models implicit in canonical neural networks. Note that when dealing with external states that factorise (e.g. what and where), block matrices A, B and C are the outer products of submatrices (please refer to Methods section 'Generative model' for further details; see  71 . The arrows from the present risk γ t -sampled from Γ t -to past decisions δ τ optimise the policy in a post hoc manner, to minimise future risk. In reality, the current error γ t is determined based on past decisions (top). In contrast, decision making to minimise the future risk implies a fictive causality from γ t to δ τ (bottom). Inference and learning correspond to the inversion of this generative model. Postdiction of past decisions is formulated as the learning of the policy mapping, conditioned by γ t . Here, A, B and C indicate matrices of the conditional probability, and bold case variables are the corresponding posterior beliefs. Moreover, D Ã and E Ã indicate the true prior beliefs about hidden states and decisions, while D and E indicate the priors that the network operates under. When and only when D ¼ D Ã and E ¼ E Ã , inferences and behaviours are optimal for a given task or set of environmental contingencies, and are biased otherwise. also ref. 22 ). Hence, we define the generative model as follows: Pðo 1:t ; δ 1:t ; s 1:t ; γ t ; θÞ Pðo τ js τ ; AÞPðs τ js τÀ1 ; δ τÀ1 ; BÞPðδ τ js τÀ1 ; γ t ; CÞ where θ :¼ fA; B; Cg constitute the set of parameters, and Pðs 1 js 0 ; δ 0 ; BÞ ¼ Pðs 1 Þ and Pðδ 1 js 0 ; γ t ; CÞ ¼ Pðδ 1 Þ denote probabilities at τ ¼ 1. PðθÞ and Pðγ t Þ are the prior distributions of the parameters and the risk, which implies that θ and γ t are treated as random variables in this work (Table 1). Initial states and decisions are characterised by prior distributions Pðs 1 Þ ¼ CatðDÞ and Pðδ 1 Þ ¼ CatðEÞ, where D and E are block vectors. The agent makes decisions to minimise a risk function Γ t :¼ Γðo 1:t ; s 1:t ; δ 1:tÀ1 ; θÞ that it employs (where 0 ≤ Γ t ≤ 1; see Table 1). Because the current risk Γ t is a consequence of past decisions, the agent needs to select decisions that minimise the future risk. In this sense, Γ t is associated with the expected free energy and precision in the usual formulation of active inference 17,18 (see Methods section 'Generative model' and Supplementary Methods 2 for details).
To characterise optimal decisions as minimising expected risk, in our POMDP model, we use a fictive mapping from the current risk Γ t to past decisions δ 1 ; ; δ tÀ1 ( Fig. 2). Although this is not the true causality in the real generative process that generates sensory data, here we intend to model the manner that an agent subjectively evaluates its previous decisions after experiencing their consequences. This fictive causality is expressed in the form of a categorical distribution, wherein policy mapping C is switched by a binarized risk γ t 2 f0; 1g-sampled from Pðγ t Þ ¼ CatððΓ t ; Γ t Þ T Þ-in a form of mixture model. We select this form of generative model because it speaks to the neuromodulation of synaptic plasticity, as shown in the next section. Equation (3) says that the probability of selecting a decision δ τ after receiving s τÀ1 is determined by matrix C when γ t ¼ 0, whereas it is inversely proportional to C (in the elementwise sense) when γ t ¼ 1. We note that matrix C' denotes a normalisation factor that can be dropped from the following formulations, and indicates the element-wise division operator. Throughout the manuscript, the overline variable indicates one minus the variable; e.g. γ t ¼ 1 À γ t . Importantly, the agent needs to keep selecting 'good' decisions while avoiding 'bad' decisions. To this end, Eq. (3) supposes that the agent learns from the failure of decisions, by assuming that the bad decisions were sampled from the opposite of the optimal policy mapping. In other words, the agent is assumed to have the prior belief such that the decision-sampled from CatðCÞshould result in γ t ¼ 0, while sampling from CatðC' CÞ should yield γ t ¼ 1. This construction enables the agent to conduct a postdiction of its past decisions-and thereby to update the policy mapping to minimise future risk-by associating the past decision rule (policy) with the current risk. Further details are provided in the Methods section 'Generative model'. In the next section, we will explain the biological plausibility of this form of adaptive behavioural control, wherein the update of the policy mapping turns out to be identical to a delayed modulation of Hebbian plasticity.
In short, Eq. (3) presumes that a past decision δ τ (1 ≤ τ ≤ t À 1) is determined based on a past state s τÀ1 and the current risk γ t . In contrast, the current decision δ t is determined to minimise the future risk, Pðδ t js tÀ1 ; CÞ ¼ CatðCÞ, because the agent has not yet observed the consequences of the current decision. We note that although by convention, active inference uses C to denote the prior preference, this work uses C to denote a mapping to determine a decision depending on the previous state. Herein, the prior preference is implicit in the risk function Γ t . Due to construction, C' does not explicitly appear in the inference; thus, it is omitted in the following formulations.
Variational Bayesian inference refers to the process that optimises the posterior belief QðϑÞ. Based on the mean-field approximation, QðϑÞ is expressed as Here, the posterior beliefs about states and decisions are categorical probability distributions, Qðs τ Þ ¼ Catðs τ Þ and Qðδ τ Þ ¼ Catðδ τ Þ, whereas those about parameters are Dirichlet distributions, QðAÞ ¼ DirðaÞ, QðBÞ ¼ DirðbÞ, and QðCÞ ¼ DirðcÞ. Throughout the manuscript, bold case variables (e.g. s τ ) denote the posterior expectations of the corresponding italic case random variables (e.g. s τ ). Thus, s τ forms a block vector that represents the posterior probabilities of elements of s τ taking 1 or 0. The agent samples a decision δ t at time t from the posterior distribution Qðδ t Þ. In this paper, the posterior belief of transition mapping is averaged over all possible decisions, B ¼ E QðδÞ ½B δ , to ensure the exact correspondence to canonical neural networks. We use θ :¼ fa; b; cg to denote the parameter posteriors. For simplicity, here we suppose that state and decision priors (D, E) are fixed.
Under the above-defined generative model and posterior beliefs, the ensuing variational free energy is analytically expressed as follows: The derivation details are provided in the Methods section 'Variational free energy'. Note that Γ t;τ ¼ 0 for τ ¼ t; otherwise, Γ t;τ ¼ Γ t . The order lnt term indicates the complexity of parameters, which is negligible when the leading order term is large. The gradient descent on variational free energy updates the posterior beliefs about hidden states (s t ), decisions (δ t ) and parameters (θ). The optimal posterior beliefs that minimise variational free energy are obtained as the fixed point of the implicit gradient descent, which ensures that ∂F=∂s t ¼ 0, ∂F=∂δ t ¼ 0 and ∂F=∂θ ¼ O. The explicit forms of the posterior beliefs are provided in the Methods section 'Inference and learning'.
To explicitly demonstrate the formal correspondence with the cost functions for neural networks considered in the next section, we further transform the variational free energy as follows: based on Bayes theorem Pðs τ js τÀ1 ; B δ Þ / Pðs τÀ1 js τ ; B δ ÞPðs τ Þ, the inverse transition mapping is expressed as B y ¼ B T diag½D À1 using the state prior Pðs τ Þ ¼ CatðDÞ (where Pðs τÀ1 Þ is supposed to be a flat prior belief). Moreover, from Bayes theorem Pðδ τ js τÀ1 ; γ t ; CÞ / Pðs τÀ1 jδ τ ; γ t ; CÞPðδ τ Þ, the inverse policy mapping is expressed as C y ¼ C T diag½E À1 using the decision prior Pðδ t Þ ¼ CatðEÞ. Using these relationships, Eq. (5) is transformed into the form shown in Fig. 3 (top). Please see the Methods section 'Variational free energy' for further details. This specific form of variational free energy constitutes a class of cost functions for canonical neural networks, as we will see below.
In summary, variational free energy minimisation underwrites optimisation of posterior beliefs. In neurobiological formulations, it is usually assumed that neurons encode s t and δ t , while synaptic strengths encode θ 17,18 . In what follows, we demonstrate that the internal states of canonical neural networks encode posterior beliefs.
Canonical neural networks perform active inference. In this section, we identify the neuronal substrates that correspond to components of the active inference scheme defined above. We consider a class of two-layer neural networks with recurrent connections in the middle layer (Fig. 1a). The modelling of the networks in this section (referred to as canonical neural networks) is based on the following three assumptions-that reflect physiological knowledge: (1) gradient descent on a cost function L determines the updates of neural activity and synaptic weights (see Methods section 'Neural networks' for details); (2) neural activity is updated by the weighted sum of inputs, and its fixed point is expressed in a form of the sigmoid (or logistic) function; and (3) a modulatory factor mediates synaptic plasticity in a post hoc manner.
Based on assumption 2, we formulate neural activity in the middle layer (x) and output layer (y) as follows: Here, xðtÞ :¼ ðx 1 ðtÞ; ; x N x ðtÞÞ T and yðtÞ :¼ ðy 1 ðtÞ; ; y N y ðtÞÞ T denote column vectors of firing intensities; oðtÞ :¼ ðo 1 ðtÞ; ; o N o ðtÞÞ T is a column vector of binary sensory inputs; One may think of W 1 , K 1 and V 1 as excitatory synapses, whereas W 0 , K 0 and V 0 can be regarded as inhibitory synapses.
Here, ðW 1 À W 0 ÞoðtÞ represents the total synaptic input from the sensory layer, and ðK 1 À K 0 Þxðt À ΔtÞ forms a recurrent circuit with a time delay Δt > 0. Receiving inputs from the middle layer xðtÞ, the output-layer neural activity yðtÞ determines the decision δðtÞ :¼ ðδ 1 ðtÞ; ; δ N δ ðtÞÞ T , that is, Prob½δ i ðtÞ ¼ 1 ¼ y i ðtÞ. We select the inverse sigmoid (i.e. logit) leak current to ensure that the fixed point of Eq. (6) (i.e. x and y that ensure _ x ¼ 0 and _ y ¼ 0) has the form of a sigmoid activation function (c.f., assumption 2). The sigmoid activation function is also known as the neurometric function 34 .
Without loss of generality, Eq. (6) can be cast as the gradient descent on cost function L. Such a cost function can be identified by simply integrating the right-hand side of Eq. (6) with respect to x and y, consistent with previous treatments 22 . Moreover, we presume that output-layer synapses (V 1 , V 0 ) are updated through synaptic plasticity mediated by the modulator ΓðtÞ (c.f., assumption 3; 0 ≤ ΓðtÞ ≤ 1), as a model of plasticity modulations that are empirically observed [31][32][33] . Because neural activity and synaptic plasticity minimise the same cost function L, the derivatives of L must generate the modulated synaptic plasticity. Under these constraints reflecting assumptions 1-3, a class of cost functions is identified as follows: where xðτÞ :¼ 1 ! À xðτÞ with a vector of ones 1 ! ¼ ð1; ; 1Þ T .
Here, Oð1Þ-that denotes a function of synaptic strengths-is of a smaller order than the other terms that are of order t. Thus, Oð1Þ is negligible when t is large. We suppose Γðt; τÞ ¼ 0 for t À Δt < τ ≤ t and Γðt; τÞ ¼ ΓðtÞ for 0 ≤ τ ≤ t À Δt, to satisfy assumptions 1-3. This means that the optimisation of L by associative plasticity is mediated by ΓðtÞ. We note that a gradient descent on L, i.e. _ x / Àd=dt Á ∂L=∂x and _ y / Àd=dt Á ∂L=∂y, has the same functional form (and solution) as Eq. (6) (see Methods section 'Neural networks' and Supplementary Methods 4 for further details).
Synaptic plasticity rules conjugate to the above rate coding model can now be expressed as gradient descent on the same cost function L, according to assumption 1. To simplify notation, we define synaptic strength matrix as ω i 2 fW 1 ; W 0 ; K 1 ; K 0 ; V 1 ; V 0 g, presynaptic activity as pre i ðtÞ 2 foðtÞ; oðtÞ; xðt À ΔtÞ; xðt À ΔtÞ; xðt À ΔtÞ; xðt À ΔtÞg, postsynaptic activity as post i ðtÞ 2 fxðtÞ; xðtÞ; xðtÞ; xðtÞ; yðtÞ; yðtÞg and firing thresholds as n i 2 fh 1 ; h 0 ; h 1 ; h 0 ; m 1 ; m 0 g. Note that some variables (e.g. xðtÞ) appear several times because some synapses connect to the same pre-or postsynaptic neurons as other synapses. Thus, synaptic plasticity in the middle layer Fig. 3 Mathematical equivalence between variational free energy and neural network cost functions, depicted by one-to-one correspondence of their components. Top: variational free energy transformed from Eq. (5) using the Bayes theorem. Here, B y ¼ B T diag½D À1 and C y ¼ C T diag½E À1 indicate the inverse mappings, and D and E are the state and decision priors. Bottom: neural network cost function that is a counterpart to the aforementioned variational free energy. In this equation,Ŵ l :¼ sigðW l Þ,K l :¼ sigðK l Þ, andV l :¼ sigðV l Þ (for l ¼ 0; 1) indicate the sigmoid functions of synaptic strengths. Moreover, ϕ l and ψ l are perturbation terms that characterise the bias in firing thresholds. Here, ; 4) is derived as follows: Moreover, synaptic plasticity in the output layer (i = 5, 6) is derived as follows: Here, hpost i ðtÞpre i ðtÞ T i :¼ 1 t R t 0 post i ðτÞpre i ðτÞ T dτ indicates the average over time, indicates the element-wise product operator, and t ) Δt.
These synaptic update rules are biologically plausible as they comprise Hebbian plasticity-determined by the outer product of pre-and postsynaptic activity-accompanied by an activitydependent homoeostatic term. In Eq. (9), the neuromodulator ΓðtÞ-that encodes an arbitrary risk-alters the form of Hebbian plasticity in a post hoc manner. This can facilitate the association between past decisions and the current risk, thus leading to the optimisation of the decision rule to minimise future risk. In short, ΓðtÞ < 0:5 yields Hebbian plasticity, whereas ΓðtÞ > 0:5 yields anti-Hebbian plasticity. Empirical observations suggest that some modulators [31][32][33] , such as dopamine neurons [35][36][37] , are a possible neuronal substrate of ΓðtÞ; please see Discussion for further details.
Based on the above considerations, we now establish the formal correspondence between the neural network cost function and variational free energy. Under the aforementioned three minimal assumptions, we identify the neural network cost function as Eq. (7). Equation (7) can be transformed into the form shown in Fig. 3 (bottom) using sigmoid functions of synaptic strengths (e.g. W l :¼ sigðW l Þ for l ¼ 0; 1). Here, the firing thresholds (h l ; m l ) are replaced with the perturbation terms in the thresholds, Figure 3 depicts the formal equivalence between the neural network cost function (Fig. 3, bottom) and variational free energy (Fig. 3, top), visualised by one-by-one correspondence between their components. The components of variational free energy-including the log-likelihood function and complexities of states and decisionsre-emerge in the neural network cost function.
This means that when ðxðτÞ T ; , the neural network cost function is identical to variational free energy, up to the negligible ln t residual. This further endorses the asymptotic equivalence of Eqs. (5) and (7).
The neural network cost function is characterised by the perturbation terms implicit in firing thresholds ϕ : These terms correspond to the state and decision priors, ln Pðs t Þ ¼ ln D ¼ ϕ and ln Pðδ t Þ ¼ ln E ¼ ψ, respectively. Further, an arbitrary neuromodulator that regulates synaptic plasticity as depicted in Eq. (9) plays the role of the risk function in the POMDP model defined in Eq. (2), where the equivalence can be confirmed by comparing Eqs. (9) and (18). Hence, this class of cost functions for canonical neural networks is formally homologous to variational free energy, under the particular form of the POMDP generative model, defined in the previous section. In other words, Eqs. (2) and (5) express the class of generative models-and ensuing variational free energy-that ensure Eq. (1) is apt, for the class of canonical neural networks considered. We obtained this result based on analytic derivations -without reference to the complete class theorem-thereby confirming the proposition in Eq. (1). This in turn suggests that any canonical neural network in this class is implicitly performing active inference. Table 2 summarises the correspondence between the quantities of the neural network and their homologues in variational Bayes.
In summary, when a neural network minimises the cost function with respect to its activity and plasticity, the network self-organises to furnish responses that minimise a risk implicit in ΓðtÞ Parameter prior Bold case variables (e.g. s τ ) denote the posterior expectations of the corresponding italic case random variables (e.g. s τ ). Note that W init l ; K init l ; V init l are initial values of W l ; K l ; V l (for l ¼ 0; 1) and λ W l ; λ K l ; λ V l are inverse learning rate factors that express the insensitivity of synaptic strengths to plasticity. Please refer to the previous paper 22 for details.
the cost function. This biological optimisation is identical to variational free energy minimisation under a particular form of the POMDP model. Hence, this equivalence indicates that minimising the expected risk through variational free energy minimisation is an inherent property of canonical neural networks featuring a delayed modulation of Hebbian plasticity.
Numerical simulations. Here, we demonstrate the performance of canonical neural networks using maze tasks-as an example of a delayed reward task. The agent comprised the aforementioned canonical neural networks (Fig. 4a). Thus, it implicitly performs active inference by minimising variational free energy. The maze affords a discrete state space (Fig. 4b). The agent received the states of the neighbouring cells as sensory inputs, and its neural activity represented the hidden states (Fig. 4a, panels on the right; See Methods section 'Simulations' for further details). Although we denoted s as hidden states, the likelihood mapping A was a simple identity mapping in these simulations. When solving Eqs. (6), (8), and (9), the agent's neural network implicitly updates posterior beliefs about its behaviour based on the policy mapping. It then selects an appropriate action to move towards a neighbouring cell according to the inferred policy. The action was accepted if the selected movement was allowed. Before training, the agent moved to a random direction in each step, resulting in a failure to reach the goal position (right end) within the time limit. During training, the neural network updated synaptic strengths depending on its neural activity and ensuing outcomes (i.e. risk). The training comprised a cycle of action and learning phases. In the action phase, the agent enacted a sequence of decisions, until it reached the goal or T ¼ 2 10 4 time steps passed (Fig. 4c). In the learning phase, the agent evaluated the risk associated with past decisions after a certain period: the risk was minimum (i.e. ΓðtÞ ¼ 0) if the agent moved rightwards with a certain distance during the period; otherwise ΓðtÞ ¼ 0:45 if the agent moved rightwards during the period, or ΓðtÞ ¼ 0:55 if it did not. The synaptic strengths V (i.e. the policy mapping) were then potentiated if the risk was low, or suppressed otherwise, based on Eq. (9). This mechanism made it possible to optimise decision making. Other synapses (W, K) were also updated based on Eq. (8), although we assumed a small learning rate to focus on the implicit policy learning. Through training, the neural network of the agent self-organised its behaviour to efficiently secure its goal (Fig. 4d). We also observed that modulations of Hebbian plasticity without delay did not lead to optimal behaviour, resulting in a considerably higher probability of failing to reach the goal. These results indicate that delayed modulation is essential to enable canonical neural networks to solve delayed reward tasks.
With this setup in place, we numerically validated the dependency of performance on the threshold factors (ϕ; ψ). Consistent with our theoretical prediction-that ϕ and ψ encode prior beliefs about hidden states (D) and decisions (E)-alternations of ψ ¼ ln E from the optimum to a suboptimal value changed the landscape of the cost function (i.e. variational free energy), thereby providing suboptimal inferences and decisions (in relation to the environment). Subsequently, the suboptimal network firing thresholds led to a suboptimal behavioural strategy, taking a longer time or failing to reach the goal (Fig. 4e). Thus, we could attribute the agent's impaired performance to its suboptimal priors. This treatment renders neural activity and adaptive behaviours of the agent highly explainable and manipulatable in terms of the appropriate prior beliefs-implicit in firing thresholds-for a given task or environment. In other words, these results suggest that firing thresholds are the neuronal substrates that encode state and decision priors, as predicted mathematically.
Furthermore, when the updating of ϕ and ψ is slow in relation to experimental observations, ϕ and ψ can be estimated through Bayesian inference based on empirically observed neuronal responses (see Methods section 'Data analysis' for details). Using this approach, we estimated implicit prior E-which is encoded by ψ-from sequences of neural activity generated from the synthetic neural networks used in the simulations reported in Fig. 4. We confirmed that the estimator was a good approximation to the true E (Fig. 5a). The estimation of ϕ and ψ based on empirical observations offered the reconstruction of the cost function (i.e. variational free energy) that an agent employs. The resulting cost function could predict subsequent learning of behaviours within previously unexperienced, randomly generated mazes-without observing neural activity and behaviour (Fig. 5b). This is because-given the canonical neural network at hand-the learning self-organisation is based exclusively on state and decision priors, implicit in ϕ and ψ. Therefore, the identification of these implicit priors is sufficient to asymptotically determine the fixed point of synaptic strengths when t is large (see Methods section 'Neural networks' for further details; see also ref. 22 ). These results highlight the utility of the proposed equivalence to understand neuronal mechanisms underlying adaptation of neural activity and behaviour through accumulation of past experiences and ensuing outcomes.

Discussion
Biological organisms formulate plans to minimise future risks. In this work, we captured this characteristic in biologically plausible terms under minimal assumptions. We derived simple differential equations that can be plausibly interpreted in terms of a neural network architecture that entails degrees of freedom with respect to certain free parameters (e.g. firing threshold). These free parameters play the role of prior beliefs in variational Bayesian formation. Thus, the accuracies of inferences and decisions depend upon prior beliefs, implicit in neural networks. Consequently, synaptic plasticity with false prior beliefs leads to suboptimal inferences and decisions for any task under consideration.
Based on the view of the brain as an agent that performs Bayesian inference, neuronal implementations of Bayesian belief updating have been proposed, which enables neural networks to store and recall spiking sequences 8 , learn temporal dynamics and  Fig. 4). Here, ψ was estimated through Bayesian inference based on sequences of neural activity, obtained with ten distinct mazes. Then, E right was computed by ln E 1 ¼ ψ 1 for each of 64 elements. The other 192 elements of E 1 (i.e. E left ; E up ; E down ) were also estimated. The sum of all the elements of E 1 was normalised to 1. b Prediction of the learning process within previously unexperienced, randomly generated mazes. Using the estimated E, we reconstructed the computational architecture (i.e. neural network) of the agent. Then, we simulated the adaptation process of the agent's behaviour using the reconstructed neural network and computed the trajectory of the probability of failure to reach the goal within T ¼ 2 10 4 time steps. The resulting learning trajectories (solid lines) predict the learning trajectories of the original agent (dashed lines) under three different prior conditions, in the absence of observed neural responses and behaviours. Lines and shaded areas indicate the mean and standard error, respectively. Inset panels depict comparisons between the failure probability of the original and reconstructed agent after learning (average over session 51-100), within ten previously unexperienced mazes. Refer to Methods section 'Data analysis' for further details.
causal hierarchy 9 , extract hidden causes 10 , solve maze tasks 11 and make plans to control robots 12 . In these approaches, the update rules are generally derived from Bayesian cost functions (e.g. variational free energy). However, the precise relationship between these update rules and the neural activity and plasticity of canonical neural networks has yet to be fully established.
We identified a one-to-one correspondence between neural network architecture and a specific POMDP implicit in that network. Equation (2) speaks to a unique POMDP model consistent with the neural network architecture defined in Eq. (6), where their correspondences are summarised in Table 2. This means that our scheme can be used to identify the form of POMDP, given an observable circuit structure. Moreover, the free parameters-that parameterise Eq. (6)-can be estimated using Eq. (24). This means that the generative model and ensuing variational free energy can, in principle, be reconstructed from empirical data. This offers a formal characterisation of implicit Bayesian models entailed by neural circuits, thereby enabling a prediction of subsequent learning. Our numerical simulations, accompanied by previous work 22 , show that canonical neural networks behave systematically-and in a distinct waydepending on the implicit POMDP (Fig. 4). This kind of selforganisation depends on implicit prior beliefs, which can therefore be characterised empirically (Fig. 5).
A simple Hebbian plasticity strengthens synaptic wiring when pre-and post-synaptic neurons fire together, which enhances the association between (presynaptic) causes and (postsynaptic) consequences 28 . Hebbian plasticity depends on the activity level 29,30 , spike timings 38,39 or burst timings 40 of pre-and postsynaptic neurons. Furthermore, modulatory factors can regulate the magnitude and parity of Hebbian plasticity, possibly with some time delay, leading to the emergence of various associative functions [31][32][33] . This means that neuromodulators can be read as encoding precision which regulates inference and learning 41 . These modulations have been observed empirically with various neuromodulators and neurotransmitters, such as dopamine [35][36][37]42,43 , noradrenaline 44,45 , muscarine 46 and GABA 47,48 , as well as glial factors 49 .
In particular, a delayed modulation of synaptic plasticity is well-known with dopamine neurons [35][36][37] . This speaks to a learning scheme that is conceptually distinct from standard reinforcement learning algorithms, such as the temporal difference learning with actor-critic models based on state-action value functions 3 . Please see the previous work 50 for a detailed comparison between active inference and reinforcement learning. Delayed modulations are also observed with noradrenaline and serotonin 51 . We mathematically demonstrated that such plasticity enhances the association between the pre-post mapping and the future value of the modulatory factor, where the latter is cast as a risk function. This means that postsynaptic neurons self-organise to react in a manner that minimises future risk. Crucially, this computation corresponds formally to variational Bayesian inference under a particular form of POMDP generative models, suggesting that the delayed modulation of Hebbian plasticity is a realisation of active inference. Regionally specific projections of neuromodulators may allow each brain region to optimise activity to minimise risk and leverage a hierarchical generative model implicit in cortical and subcortical hierarchies. This is reminiscent of theories of neuromodulation and (meta-)learning developed previously 52 . Our work may be potentially useful, when casting these theories in terms of generative models and variational free energy minimisation.
The complete class theorem [19][20][21] ensures that any neural network, whose activity and plasticity minimise the same cost function, can be cast as performing Bayesian inference. However, identifying the implicit generative model that underwrites any canonical neural network is a more delicate problem because the theorem does not specify a form of a generative model for a given canonical neural network. The posterior beliefs are largely shaped by prior beliefs, making it challenging to identify the generative model by simply observing systemic dynamics. To this end, it is necessary to commit to a particular form of the generative model and elucidate how the posterior beliefs are encoded or parameterised by the neural network states. This work addresses these issues by establishing a reverse-engineering approach to identify a generative model implicit in a canonical neural network, thereby establishing one-to-one correspondences between their components. Remarkably, a network of rate coding models with a sigmoid activation function formally corresponds to a class of POMDP models, which provide an analytically tractable example of the present equivalence (please refer to the previous paper 22 for further discussion).
It is remarkable that the proposed equivalence can be leveraged to identify a generative model that an arbitrary neural network implicitly employs. This contrasts with naive neural network models that address only the dynamics of neural activity and plasticity. If the generative model differs from the true generative process-that generates the sensory input-inferences and decisions are biased (i.e. suboptimal), relative to Bayes optimal inferences and decisions based on the right sort of prior beliefs. In general, the implicit priors may or may not be equal to the true priors; thus, a generic neural network is typically suboptimal. Nevertheless, these implicit priors can be optimised by updating free parameters (e.g. threshold factors ϕ; ψ) based on the gradient descent on cost function L. By updating the free parameters, the network will eventually, in principle, becomes Bayes optimal for any given task. In essence, when the cost function is minimised with respect to neural activity, synaptic strengths and any other constants that characterise the cost function, the cost function becomes equivalent to variational free energy with the optimal prior beliefs. Simultaneously, the expected risk is minimised because variational free energy is minimised only when the precision of the risk (γ t ) is maximised (see Methods section 'Generative model' for further details).
When the rate coding activation function differs from the sigmoid function, it can be assumed that neurons encode state posteriors under a generative model that differs from a typical POMDP model considered in the main text (see Supplementary Methods 4 for details; see also ref. 22 ). Nevertheless, the complete class theorem guarantees the existence of some pair of a generative model (i.e. priors) and cost function that corresponds to an arbitrary activation function. The form or time window of empirically observed plasticity rules can also be used to identify the implicit cost and risk functions-and further to reverse engineer the task or problem that the neural network is solving or learning: c.f., inverse reinforcement learning 53 . In short, neural activity and plasticity can be interpreted, universally, in terms of Bayesian belief updating.
The class of neural networks we consider can be viewed as a class of reservoir networks 54,55 . The proposed equivalence could render such reservoir networks explainable-and may provide the optimal plasticity rules for these networks to minimise future risk -by using the formal analogy to variational free energy minimisation (under the particular form of POMDP models). A clear interpretation of reservoir networks remains an important open issue in computational neuroscience and machine learning.
Assumption 1 places constraints on the relationship between neural activity and plasticity. The ensuing synaptic plasticity rules (Eqs. (8) and (9)) represent a key prediction of the proposed scheme. We have shown that these plasticity rules enable neural circuits to assimilate sensory inputs-as Bayes optimal encoders of external states-and generate responses, as Bayes optimal controllers that minimise risk. Tracking changes in synaptic strengths is empirically difficult in large networks, and thus the functional form of synaptic plasticity is more difficult to establish, compared to that of synaptic responses. The current scheme enables one to predict the nature of plasticity, given observed neural (and behavioural) responses, under ideal Bayesian assumptions. This is sometimes referred to as meta-Bayesian inference 56 . The validity of these predictions can, therefore, in principle, be verified using empirical measures of neural activity and behaviour. In short, the current scheme predicts specific plasticity rules for canonical neural networks, when learning a given task.
The equivalence between neural network dynamics and gradient flows on variational free energy is empirically testable using electrophysiological recordings or functional imaging of brain activity. For instance, neuronal populations in layers 2/3 and 5 of the parietal lobe of rodents have been shown to encode posterior expectations about hidden sound cues, which are used to reach a goal in uncertain environments 57 . According to the current formulation, synaptic afferents to these expectation-coding populations-from neurons encoding sensory information-should obey the plasticity rule in Eq. (8) and self-organise to express the likelihood mapping (A). Related predictions are summarised in Table 2. We have previously shown that the self-organisation of in vitro neural networks minimises empirically computed variational free energy in a manner consistent with variational free energy minimisation under a POMDP generative model 58,59 . Our analyses in the present work speak to the predictive validity of the proposed formulation: when the threshold factors (ϕ; ψ) can be treated as constants-during a short experiment-we obtain the analytical form of fixed points for synaptic update rules (Methods section 'Neural networks'). Furthermore, ϕ and ψ can be estimated using empirical data (Methods section 'Data analysis'). This approach enables the reconstruction of the cost function and prediction of subsequent learning process, as demonstrated in Fig. 5 using in silico data. Hence, it is possible to examine the predictive validity of the proposed theory by comparing the predicted synaptic trajectory with the actual trajectory. In future work, we hope to address these issues using in vitro and in vivo data.
Crucially, the proposed equivalence guarantees that an arbitrary neural network that minimises its cost function-possibly implemented in biological organisms or neuromorphic hardware 60,61can be cast as performing variational Bayesian inference. Thus, in addition to contributions to neurobiology, this notion can dramatically reduce the complexity of designing self-learning neuromorphic hardware to perform various types of tasks; therefore, it offers a simple architecture and low computational cost. This leads to a unified design principle for biologically inspired machines such as neuromorphic hardware to perform statistically optimal inference, learning, prediction, planning, and decision making.
In summary, a class of biologically plausible cost functions for canonical neural networks can be cast as variational free energy. Formal correspondences exist between priors, posteriors and cost functions. This means that canonical neural networks that optimise their cost functions implicitly perform active inference. This approach enables identification of the implicit generative model and reconstruction of variational free energy that neural networks employ. This means that, in principle, neural activity, behaviour and learning through plasticity can be predicted under Bayes optimality assumptions.

Methods
Generative model. The proposed POMDP model comprises N s -dimensional hidden states s t 2 f0; 1g N s that depend on the previous states s tÀ1 through a transition probability of B δ , and a process of generating N o -dimensional observations o t 2 f0; 1g N o from those states through a likelihood mapping A (Fig. 2). Here, the transition probability B δ is a function of N δ -dimensional decisions of an agent δ t 2 f0; 1g N δ , indicating that the agent's behaviour changes the subsequent states of the external milieu. Each state, observation, and decision take the values 1 or 0. We use o 1:t ¼ fo 1 ; ; o t g to denote a sequence of observations. Hereafter, i indicates the i-th observation, j indicates the j-th hidden state and k indicates the k-th decision.
Due to the multidimensional (i.e. factorial) nature of the states, A, B and C are usually the outer products of submatrices (i.e. tensors); see also ref. 22 . The probability of an observation is determined by the likelihood mapping, from s t to o ðiÞ t , in terms of a categorical distribution: Pðo ðiÞ t js t ; A ðiÞ Þ ¼ CatðA ðiÞ Þ, where the elements of A ðiÞ are given by A ðiÞ (see also Table 1). This encodes the probability that o ðiÞ t takes 1 or 0 when s t ¼ l ! ¼ ðl 1 ; ; l N Þ T 2 f0; 1g N s . The hidden states are determined by the transition probability, from s tÀ1 to s t , depending on a given decision, in terms of a categorical distribution: Pðs ðjÞ t js tÀ1 ; B δðjÞ Þ ¼ CatðB δðjÞ Þ. As defined in Eq. (3), a generative model of decisions δ τ is conditioned on the current risk γ t 2 f0; 1g that obeys This can be viewed as a postdiction of past decisions. For 1 ≤ τ ≤ t À 1, the conditional probability of δ ðkÞ τ has a form of a mixture model: Pðδ ðkÞ τ js τÀ1 ; γ t ; C ðkÞ Þ ¼ CatðC ðkÞ Þ γ t CatðC0 ðkÞ C ðkÞ Þ γ t .
Conversely, for τ ¼ t, δ ðkÞ t is sampled from Pðδ ðkÞ t js tÀ1 ; γ t ; C ðkÞ Þ ¼ CatðC ðkÞ Þ to minimise the future risk because the agent does not yet observe the consequence of the current decision.
Equation (3) represents a mechanism by which the agent learns to select the preferred decisions. This is achieved by updating the policy mapping C depending on the consequences of previous decisions. Following Eq. (3), when observing γ t ¼ 0, the agent regards that the past decisions were sampled from the preferable policy mapping C that minimises Γ t , and thereby updates the posterior belief of C to facilitate the association between δ τ and s τÀ1 . In contrast, when observing γ t ¼ 1, the agent hypothesises that this is because the past decisions were sampled from the unpreferable policy mapping, and thereby updates the posterior belief of C to reduce (forget) the association. In short, this postdiction evaluates the past decisions after observing their consequence. The posterior belief of C is then updated by associating the past decision rule (policy) and current risk, leading to the optimisation of decisions to minimise future risk. Interestingly, this behavioural optimisation mechanism derived from the Bayesian inference turns out to be identical to a post hoc modulation of Hebbian plasticity (see Eqs. (7) and (9), and Methods section 'Neural networks').
An advantage of this generative model-based on counterfactual causality-is that the agent does not need to explicitly compute the expected future risk based on the current states, because it instead updates the policy mapping C, by associating the current risk with past decisions. Note that this construction of risk corresponds to a simplification of expected free energy, that would normally include risk and ambiguity, where risk corresponds to the Kullback-Leibler divergence between the posterior predictive and prior distribution over outcomes 17,18 . However, by using a precise likelihood mapping, ambiguity can be discounted and expected free energy reduces to the sort of decision risk considered in this work. In other words, one can consider that the expected free energy is implicit in the policy mapping C. From this perspective, the risk Γ t is associated with precision that regulates the magnitude of expected free energy; refer to Supplementary Methods 2 for further details.
We can now define the generative model as Eq. (2), where Pðs 1 js 0 ; δ 0 ; BÞ ¼ Pðs 1 Þ ¼ CatðDÞ and Pðδ 1 js 0 ; γ t ; CÞ ¼ Pðδ 1 Þ ¼ CatðEÞ are assumed. We further suppose that δ τ , given s τÀ1 , is conditionally independent of fo τ ; s τ g, and that only the generation of δ τ depends on γ t , as visualised in the factor graph (Fig. 2). Equation (2) can be further expanded as follows: and c ðkÞ Ál , respectively. As described in the Results section, this form of the generative model is suitable to characterise a class of canonical neural networks defined by Eq. (6). This means that none of the aforementioned assumptions-regarding the generative model-limit the scope of the proposed equivalence between neural networks and variational Bayesian inference, as long as neural networks satisfy assumptions 1-3.
Variational free energy. The agent aims to minimise surprise, or equivalently maximise the marginal likelihood of outcomes, by minimising variational free energy as a tractable proxy. Thereby, they perform approximate or variational Bayesian inference. From the above-defined generative model, we motivate a meanfield approximation to the posterior distribution as follows: Here, the posterior beliefs of s τ and δ τ are categorical distributions, Qðs τ Þ ¼ Catðs τ Þ and Qðδ τ Þ ¼ Catðδ τ Þ, respectively. Whereas, the posterior beliefs of A, B and C are Dirichlet distributions, QðAÞ ¼ DirðaÞ, QðBÞ ¼ DirðbÞ and QðCÞ ¼ DirðcÞ, respectively. In this expression, s τ and δ τ represent the expectations between 0 and 1, and a, b and c express the (positive) concentration parameters.
In this paper, the posterior transition mapping is averaged over all possible decisions, B ¼ E QðδÞ ½B δ , to ensure exact correspondence to canonical neural networks. Moreover, we suppose that A comprises the outer product of submatrices A ði;jÞ 2 R 2 2 to simplify the calculation of the posterior beliefs, i.e.
Variational free energy is defined as a functional of the posterior beliefs, given as: Fðo 1:t ; Qðs 1:t ; δ 1:t ; θÞÞ : ¼ E Qðs 1:t ;δ 1:t ;θÞPðγ t Þ ½Àln Pðo 1:t ; δ 1:t ; s 1:t ; γ t ; θÞ þ ln Qðs 1:t ; δ 1:t ; θÞ E Qðs τ ÞQðs τÀ1 ÞQðAÞQðBÞ ½ln Qðs τ Þ À ln Pðo τ js τ ; AÞ À ln Pðs τ js τÀ1 ; δ τÀ1 ; B δ Þ þ ∑ t τ¼1 E Qðδ τ ÞQðs τÀ1 ÞQðCÞPðγ t Þ ½ln Qðδ τ Þ À ln Pðδ τ js τÀ1 ; γ t ; CÞ This provides an upper bound of sensory surprise À ln Pðo 1:t Þ. Here, D KL ½ÁjjÁ is the complexity of parameters scored by the Kullback-Leibler divergence. Minimisation of variational free energy is attained when the entropy of the risk, H½γ t :¼ E Pðγ t Þ ½À ln Pðγ t Þ ¼ ÀΓ t ln Γ t À Γ t ln Γ t , is minimised. This is achieved when Γ t shifts toward 0, meaning that the risk minimisation is a corollary of variational free energy minimisation (the case where Γ t shifts toward 1 is negligible). Under the POMDP scheme, F is expressed as a function of the posterior expectations, F ¼ Fðo 1:t ; s 1:t ; δ 1:t ; θÞ. Thus, using the vector expression, variational free energy under our POMDP model is provided as follows: δ τ Á ðln δ τ À ð1 À 2Γ t;τ Þln Cs τÀ1 Þ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} decision complexity þ ða À aÞ Á ln A À ln BðaÞ þ ðb À bÞ Á ln B À ln BðbÞ þ ðc À cÞ Á ln C À ln BðcÞ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} parameter complexity Here, Bða Ál Þ Γða 1l ÞΓða 0l Þ=Γða 1l þ a 0l Þ is the beta function (where ΓðÁÞ is the gamma function), and ln A Á o τ indicates the inner product of ln A and one-hot expressed o τ , which is a custom to express the sum of the product of ðlnA ði;jÞ Þ T 2 R 2 2 and ðo ðiÞ τ ; o ðiÞ τ Þ T 2 R 2 over all i. The first and second terms of Eq. (14)-comprising accuracy and the complexity of state and decision-increases in proportion to time t. Conversely, other terms-the complexity of parameters-increases in the order of ln t, which is thus negligible when t is large. Hence, we will drop the parameter complexity by assuming that the scheme has experienced a sufficient number of observations. Please see the previous work 22 for further details. The entropy of the risk is omitted as it does not explicitly appear in the inference.
Inference and learning. Inference optimises the posterior beliefs about the hidden states and decisions by minimising variational free energy. The posterior beliefs are updated by the gradient descent on F, _ s t / À∂F=∂s t and _ δ t / À∂F=∂δ t . The fixed point of these updates furnishes the posterior beliefs, which are analytically computed by solving ∂F=∂s t ¼ 0 and ∂F=∂δ t ¼ 0. Thus, from Eq. (15), the posterior belief about the hidden states is provided as follows: Moreover, the posterior belief about the decisions is provided as follows: Here, σðÁÞ denotes the softmax function; and D and E denote the prior beliefs about hidden states and decisions, respectively, which we assume are fixed in this paper. Note that Eqs. (16) and (17) are equivalent to s t ¼ σðln A Á o t þ ln Bs tÀ1 Þ and δ t ¼ σðln Cs tÀ1 Þ, respectively, as indicates a block column vector of the state posterior under a mean-field assumption, where s ðiÞ t1 is the posterior belief that s ðiÞ t takes a value of 1. Because s ðiÞ t takes a binary value, s ðiÞ t1 has the form of a sigmoid function. Here, we assume that only the state posterior s t at the latest time is updated at each time t; thus, no message pass exists from s tþ1 to s t . The state posterior at time t-1 is retained in the previous value. This treatment corresponds to the Bayesian filter, as opposed to the smoother usually considered in active inference schemes.
Equations (16) and (17) are analogue to a two-layer neural network that entails recurrent connections in the middle layer. In this analogy, s t1 and δ t1 are viewed as the middle-and output-layer neural activity, respectively. Moreover, ln A Á o t , ln B y Á s tÀ1 , and ln C y Á s tÀ1 correspond to synaptic inputs, and ln D and ln E relate to firing thresholds. These priors and posteriors turn out to be identical to the components of canonical neural networks, as described in the Results and Methods section 'Neural networks'.
Furthermore, learning optimises the posterior beliefs about the parameters θ ¼ fa; b; cg by minimising variational free energy. The posterior beliefs are updated by the gradient descent on F, _ θ / À∂F=∂θ. By solving the fixed point ∂F=∂θ ¼ O of Eq. (14), the posterior beliefs about parameters are provided as follows: Note that denotes the outer product operator, and hÁi indicates the average over time, e.g. ho t s t i :¼ 1 t ∑ t τ¼1 o τ s τ . These posterior beliefs are usually obtained as an average over multiple sessions to ensure convergence. Here, a; b; c are the prior beliefs, which are of order 1 and thus negligibly small relative to the leading order term when t is large. Thus, the posterior expectations of any parameters Θ (i.e. Θ and ln Θ) are obtained using Eqs. (12) and (18). These parameter posteriors turn out to correspond formally to synaptic strengths (W; K; V) owing to the equivalence of variational free energy and neural network cost function.
Neural networks. In this section, we elaborate the one-to-one correspondences between components of canonical neural networks and variational Bayes, via an analytically tractable model. Neurons respond quickly to a continuous stimulus stream with a timescale faster than typical changes in sensory input. For instance, a peak of spiking responses in the visual cortex (V1) appears between 50 and 80 ms after a visual stimulation 62,63 , which is substantially faster than the temporal autocorrelation timescale of natural image sequences (~500 ms) 64,65 . Thus, we consider the case where the neural activity converges to a fixed point, given a sensory stimulus. We note that the present equivalence is derived from the differential equation (Eq. (6)), but not from its fixed point; thus, the equivalence holds true irrespective of the time constant of neurons. To rephrase, neural networks with a large time constant formally correspond to Bayesian belief updating with a large time constant, which implies an insufficient convergence of the posterior beliefs.
Updates of neural activity are defined by Eq. (6). When the neural activity asymptotically converges to a steady state, the fixed point of Eq. (6)-i.e., x and y that give _ x ¼ 0 and _ y ¼ 0-is provided as follows: The adaptive firing thresholds are given as functions of synaptic strengths, sense. The form of Eq. (19) is identical to the state and decision posteriors in Eqs. (16) and (17) of the variational Bayesian formation. Further details are provided in Supplementary Methods 5.
Considering that neural activity corresponds to the posterior beliefs about states and decisions, one might consider the relationship between synaptic strengths and parameter posteriors. As we mathematically demonstrated in the Results section, owing to the equivalence between variational free energy F and the neural network cost function L, i.e. Eq. (5) versus Eq. (7), synaptic strengths correspond formally to parameter posteriors. The ensuing synaptic update rules-derived as the gradient descent on L-are expressed in Eqs. (8) and (9). They have a biologically plausible form, comprising Hebbian plasticity accompanied by an activity-dependent homoeostatic term. The product of ΓðtÞ and the associated term modulates plasticity depending on the quality of past decisions, after observing their consequences, leading to minimisation of the future risk. In other words, the modulation by ΓðtÞ represents a postdiction that the agent implicitly conducts, wherein the agent regards its past decisions as preferable when ΓðtÞ is low and memorises the strategy. Conversely, it regards them as unpreferable when ΓðtÞ is high and forgets the strategy.
In particular, when ϕ and ψ are constants, the fixed point of synaptic strengths that minimise L is expressed analytically as follows: for simplification, we employ the notation using ω i , pre i , post i and n i , as described in the Results section. The derivative of firing threshold n i with respect to synaptic strength matrix ω i yields the sigmoid function of ω i , i.e. ∂n i =∂ω i ¼ Àsigðω i Þ. The fixed point of synaptic strengths ensures ∂L=∂ω i ¼ O. Thus, from Eqs. (8) and (9), it is analytically expressed as for i ¼ 1; ; 4, and for i ¼ 5; 6 using the element-wise division . They are obtained as an average over multiple sessions. Equations (20) and (21) correspond to the posterior belief about parameters A; B; C, which are shown in Eq. (18). We note that when neural activity and plasticity follow differential equations, without loss of generality, we can consider a common cost function for a sufficiently long training time t. When neural activity and plasticity are expressed as From this cost function, one can derive gradient descent rules for both neural activity and plasticity in the limit of a large t. This owes to the different time scales of updating xðtÞ and W, known as an adiabatic approximation. Therefore, for a sufficiently large t, the existence of cost functions that satisfy assumption 1 is a mathematical truism. It should be emphasised that unless f matches g, the neural and synaptic dynamics result in biased inference and learning; therefore, the cost function defined in Eq. (7) is apt.
Simulations. In Fig. 4, the neural network of the agent was characterised by a set of internal states φ ¼ fx; y; W; K; V; ϕ; ψg, where W means the set of W 1 and W 0 . Neural activity (x, y) was updated by following Eq. (6), while synaptic strengths (W, K, V) were updated by following Eqs. (8) and (9). Here, we supposed that neural activity converges quickly to the steady state relative to the change of observations. This treatment allowed us to compute the network dynamics based on Eqs. (19)(20)(21), which could reduce the computational cost for numerical simulations. Synaptic strengths W were initialised as a matrix sufficiently close to the identity matrix; whereas, synaptic strengths K and V were initialised as matrices with uniform values. This treatment served to focus on the policy learning implicit in the update of V. The threshold factors (ϕ; ψ), which encoded the prior beliefs about hidden states (D) and decisions (E), were predefined and fixed over the sessions. In Fig. 4e, we varied E to show how performance depends on these priors. Namely, ; E right ; E left ; ; E left ; E up ; ; E up ; E down ; ; E down Þ T 2 ½0; 1 256 was characterised by four values E right ; E left ; E up ; E down 2 ½0; 1, where E right indicates the prior probability to select a decision involving the rightward motion in the next step.
Data analysis. When the belief updating of implicit priors (D, E) is slow in relation to experimental observations, D and E (which are encoded by ϕ and ψ) can be viewed as being fixed over a short period of time, as an analogy to homoeostatic plasticity over longer time scales 66 . Thus, ϕ and ψ can be statistically estimated by a conventional Bayesian inference (or maximum likelihood estimation, given a flat prior) based on empirically observed neuronal responses. In this case, the estimators of ϕ and ψ are obtained as follows: Note that we suppose constraints expðϕ 1 Þ þ expðϕ 0 Þ ¼ 1 ! and expðψ 1 Þ þ expðψ 0 Þ ¼ 1 ! . This assumption finesses the estimation of implicit priors (Fig. 5a)-owing to the equivalence ϕ ¼ ln D and ψ ¼ ln E (see Fig. 3 and Table 2) -and the ensuing reconstruction of variational free energy, which was conducted by substituting Eq. (24) into Eq. (15). Equation (24) indicates that the average activity encodes prior beliefs, which is consistent with the experimental observations; in that the activity of sensory areas reflects prior beliefs 67 .
Furthermore, given a generative model, the posterior beliefs of external states were updated based on a sequence of sensory inputs by minimising variational free energy, without reference to neural activity or behaviour. Thus, the reconstructed variational free energy enabled us to predict subsequent inference and learning of the agent, without observing neural activity or behaviour (Fig. 5b). In Fig. 5, for simplicity, the form of the risk function was supposed to be known when reconstructing the cost function. Although we did not estimate ϕ in Fig. 5, the previous work showed that our approach can estimate ϕ from simulated neural activity data 22 .
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All relevant data are within the paper. Figures 4 and 5 were generated using the authors' scripts (see Code Availability).