Habits and goals in synergy: a variational Bayesian framework for behavior

How to behave efficiently and flexibly is a central problem for understanding biological agents and creating intelligent embodied AI. It has been well known that behavior can be classified as two types: reward-maximizing habitual behavior, which is fast while inflexible; and goal-directed behavior, which is flexible while slow. Conventionally, habitual and goal-directed behaviors are considered handled by two distinct systems in the brain. Here, we propose to bridge the gap between the two behaviors, drawing on the principles of variational Bayesian theory. We incorporate both behaviors in one framework by introducing a Bayesian latent variable called"intention". The habitual behavior is generated by using prior distribution of intention, which is goal-less; and the goal-directed behavior is generated by the posterior distribution of intention, which is conditioned on the goal. Building on this idea, we present a novel Bayesian framework for modeling behaviors. Our proposed framework enables skill sharing between the two kinds of behaviors, and by leveraging the idea of predictive coding, it enables an agent to seamlessly generalize from habitual to goal-directed behavior without requiring additional training. The proposed framework suggests a fresh perspective for cognitive science and embodied AI, highlighting the potential for greater integration between habitual and goal-directed behaviors.


Introduction
In cognitive science, intelligent agents like humans and mammals, are thought to engage in two types of behavior: habitual and goal-directed (Dickinson & Balleine, 1994;Redgrave et al., 2010;Balleine & O'doherty, 2010;Dolan & Dayan, 2013).Habitual behavior addresses the actions that are performed automatically, without conscious thought or intention, in order to maximize the agent's benefits (rewards), such as seeking for food and avoiding danger.Habitual behavior is typically model-free (MF), meaning that it does not require the agent to consider the detailed consequences of their actions.On the other hand, goal-directed behavior explains the actions that are performed with the aim of achieving a specific goal1 , such as going to a certain place.Goal-directed behavior is model-based (MB), and it is typically more flexible and responsive to changes in the environment, as it involves conscious decision-making and planning using a environment model (Gläscher et al., 2010;Lee et al., 2014).(c) More details about how to use goal-directed intention by predicting future observations, in which the goal-directed intention (?z) is continuously being inferred to minimize the free energy w.r.t. the goal while the prediction model (the "brain") is fixed.
The two types of behaviors have also been extensively studied in machine learning and deep learning, especially in decision-making and control problems (Bellman, 1957).Reinforcement learning (RL) (Sutton & Barto, 1998) is a computational paradigm that considers to learn a policy (i.e., the strategy to choose actions) that maximize rewards.Model-free RL (MFRL), which does not involve an environmental model, aligns particularly well with acquiring habitual behavior (Botvinick et al., 2020).On the other hand, the active inference (AIf) theory (Friston et al., 2010) appears as a computational framework explaining goal-directed behavior since they both minimized the divergence between the desired goal and the model's prediction conditioned on actions2 .
Conventionally, habitual (MF) and goal-directed (MB) behavior have been treated as two independent problems in both cognitive science and machine learning.Although there are behavioural studies considering a hybrid schema to explain animal or human behavior, the most common way is to simply model the behavior as a linear combination of habitual and goal-directed ones (Gläscher et al., 2010;Smittenaar et al., 2013;Lee et al., 2014).In machine learning, one of practical reason underlies such an separation is that the inputs are different -In goal-conditioned control or decision making (Liu et al., 2022), the goal is usually an input to the model.Thus, the model for goal-directed behavior has additional input of the goal compared to that for habitual behavior.Therefore, it is common that two separate models are designed for these two behaviors when both are considered (Chebotar et al., 2017;Mendonca et al., 2021).
However, we argue that these two systems should not be isolated from each other.Although it has not been fully understood how interactions between these two systems occur in the brain, both habitual and goal-directed behaviors share the same downstream neural pathways such as the brainstem (Redgrave et al., 2010).The conjecture is that habitual and goal-directed behavior share the low-level motor skills, therefore each system may leverage the well-developed actions learned by the other.Then, how to realize the skill sharing while considering their difference?
In this work, we re-innovate the scheme of behavior from a variational Bayesian (Fox & Roberts, 2012) perspectivewe introduce a novel theoretical framework, referred to as Bayesian behavior.The proposed framework centers around a probabilistic latent variable z to which we intuitively refer as the "intention" of an agent (Figure 1).We describe habitual and goal-directed behavior as the prior and posterior distributions3 of z, respectively.In other words, the two behaviors are both drawn from the intention z (and contextual information, such as other brain states), while the difference is that goal-directed behavior is additionally conditioned on the goal than the habitual ones: habitual action ← z prior (and contextual information) goal-directed action ← z post (and contextual information) where ← denotes the function (neural pathway) to generate action from z and contextual information.The function is shared for both habitual and goal-directed actions.The prior distribution of z can be any fixed distribution, as it does not depend on the goal.In contrast, the posterior distribution incorporates hindsight information about the agent's future, reflecting the intuitive notion that the current goal-directed action is relevant to a goal to achieve in the future.This additional conditioning differentiates the posterior distribution from the prior distribution, allowing for goal-directed behavior (Figure 1a).
In this work, we will demonstrate the aptness of our Bayesian behavior framework to address the following critical questions in cognitive neuroscience by conducting simulated experiments with an embodied robot agent: 1 How does an agent acquire diverse while effective habitual behavior? 2 How to bridge the gap between habitual and goal-directed behavior?
3 How does an agent generates actions to reach a goal that has not been trained to accomplish?
We propose that the key neural substrates to address these questions involve predictive coding (Huang & Rao, 2011) with the latent Bayesian variable z as a compact representation of current and future sensations.In particular, this is realized by minimizing the variational free energy4 (Friston et al., 2006): To minimize: free energy = observation prediction error z post accuracy The first term reflects the basic idea of predictive coding by learning an internal model about the environment (Figure 1b).The environment model predicts future sensory observations given the agent's intention z.A key insight from predictive coding (Huang & Rao, 2011) is that z should be much more compact than the original sensory observation by encoding only those information varying with the agent's own intention, as the fixed patterns of the environmental observations should be acquired in the internal model.The compactness of z is crucial for efficient goal-directed planning because it makes plausible an internal searching for proper z that lead to a certain goal (Figure 1c).
The second term is the Kullback-Leibler divergence (KL-divergence) (Kullback & Leibler, 1951) between the posterior and prior distribution5 of z, which provides a theoretical foundation to link those two behaviors by bounding their difference (see Section 3.4 for mathematical explanation).Intuitively, the KL-divergence term balances the fit of the model to the data with the complexity of the latent representation z.More discussion can be found in Section 6.6.
While predictive coding (Huang & Rao, 2011) and the Bayesian brain (Doya et al., 2007) have long been discussed in a variety of studies, it has not been used to address the relationship between habitual and goal-directed behavior with detailed examples.Here, we perform proof-of-concept simulated experiments to demonstrate that the neuronal stochasticity does not harm behavior efficiency while enhances diversity.Furthermore, predictive coding enables highly flexible goal-planning capacity of the agent.
The rest of this article is arranged as follows.Section 2 introduces the basic knowledge about reinforcement learning (RL), predictive coding (PC), free energy principle (FEP) and active inference (AIf) theory.Next, Section 3 details the computational methods of the proposed framework, followed by the simulated experiment results in Section 4.Then, we briefly clarify how our work relates to and distinguishes from existing methods in machine learning in Section 5. Finally, we make extensive discussions and conclude this work in Section 6.

Reinforcement learning
Reinforcement learning (RL) (Sutton & Barto, 1998) centers around an agent learning to take actions in an environment to maximize its cumulative rewards.Typically, an RL problem is modeled using a Markov decision process (MDP) (Bellman, 1957), which is characterized by a set of states, actions, transition probabilities, and rewards.At each time step, the agent takes an action based on its current state and the MDP's transition probabilities, receiving a reward for the action taken.The ultimate objective of reinforcement learning is to identify the optimal policy for the agent to take actions in the MDP that maximize the expected cumulative reward over time.
There are two main categories of learning methods: model-based (MB) and model-free (MF) (Gläscher et al., 2010).Model-based RL involves learning a model of the environment's dynamics and rewards, which is then used to plan actions and compute the optimal policy.Essentially, the agent makes predictions about how the environment will respond to actions and plans accordingly with a computational model (Ha & Schmidhuber, 2018).On the other hand, model-free RL does not involve this explicit modeling of the environment.Instead, the agent learns a direct mapping from states to actions through trial and error, updating its policy based on the rewards received from its interactions with the environment.Popular model-free algorithms include deep Q-network (Mnih et al., 2015) and soft actor-critic (Haarnoja et al., 2018a).In both approaches, the agent must balance exploration and exploitation to discover the optimal policy.

Predictive coding
Predictive coding (PC) (Rao & Ballard, 1999) is a theoretical framework that suggests the brain employs an internal model, known as a generative model, to generate predictions about incoming sensory information (Shipp, 2016).According to this theory, neural circuits in the brain learn the statistical patterns present in the natural world, reducing unnecessary information by extracting predictable elements from the input and only transmitting what cannot be predicted Huang & Rao (2011).In a hierarchical manner, top-down predictions are refined by higher-level cortical areas, while bottom-up processing sends prediction errors upwards to improve the internal model.The brain uses Bayesian inference to compare its predictions with incoming sensory data, adjusting its internal representation of the world to minimize prediction error.
This process allows the brain to be highly efficient and rapidly adapt to changing environmental conditions.For example, if you have experience with a particular object, your brain will use that experience to generate predictions about what the object should look like, and these predictions will rapidly adjust if the object changes in some way (e.g., if it moves or changes color).By rapidly updating its internal models of the world in this way, the brain can maintain a stable and accurate representation of the environment (Ahmadi & Tani, 2019;Wirkuttis et al., 2023).

Free energy principle and active inference
The free energy principle (FEP) (Friston et al., 2006;Friston, 2010) is a Bayesian framework for understanding how biological systems, such as the brain, function.It suggests that biological systems are driven to minimize their free energy, which is a measure of the uncertainty or surprise that the system experiences when it tries to predict the future.The principle is based on the idea that living systems are self-organizing and self-sustaining, and that they use their internal models of the world to make predictions about future events.The free energy principle is related to predictive coding in the same field that states that the brain seeks to minimize prediction error, or the difference between its predictions and incoming sensory data (Friston & Kiebel, 2009;Apps & Tsakiris, 2014).The free energy principle can be seen as a unifying principle for predictive coding, as it provides a framework for understanding how the brain adapts to its environment and generates adaptive behavior.
Active inference (AIf) (Friston et al., 2010(Friston et al., , 2017) is a form of Bayesian decision-making that is based on the free energy principle.It involves using the brain's internal models to make predictions about the future, and then acting in a way that reduces the prediction error, or the difference between the predicted and actual outcomes.This process allows the brain to actively seek out new information and experiences that reduce uncertainty and increase the precision of its internal models.

Methods
We consider the case that the agent first learns habitual behavior and then conduct goal-directed planning (See Sec.6.5 for more discussion about this).

Model details
Our model6 leverage the variational Bayesian method that commonly used in deep learning (Kingma & Welling, 2014;Chung et al., 2015;Hafner et al., 2018).The core of our model is a 2-dimensional latent variable z t , also referred to as intention in this paper.A visualized diagram can be checked in Figure 2. The main recurrent neural network (RNN) is a 1-layer gated recurrent unit (GRU) (Cho et al., 2014) which, at step t, takes z t as input and predict the current and subsequence observation (x t and x t+1 ).We denote the RNN state of the GRU as h t .The decoder which maps h t to x t is de-convolutional neural network specified in Table 2. Another network with the same structure is used for predicting x t+1 from h t .For generating motor actions, a policy network7 is used and trained.
The input to the main RNN, z t , should be paid attention to.z is a Bayesian variable that can be sampled with either its prior or posterior distribution.Correspondingly, we also have prior h p t and posterior h q t of the RNN state by receiving prior and posterior z respectively.The policy network may take the prior h p t as input and give out the habitual action (Figure 2a); or take the posterior h q t as input and give out the goal-directed action (Figure 2c).In other word, the habitual and goal-directed behavior share the same policy network.In our implementation, the policy network is a 1-layers GRU followed by a 2-layer multi-layer perceptron (MLP).
Like in the variational auto-encoder (Kingma & Welling, 2014), the prior distribution of z is simply diagonal unit Gaussian N (0, I) since the agent should have no information about the goal in habitual behavior.The posterior distribution of z is also modeled as diagonal Gaussian, of which the mean µ q and square-root variance σ q are computed with hindsight information depending on situation, specified in the following sections.Unless specified, the width of each hidden layers is 256 and the activation functions in fully-connected layers are the ReLU function.

Behaving
We consider a typical episodic8 RL setting (Mnih et al., 2015) for learning habitual behavior, which is a reasonable description of how an animal explores a new environment and learns by trial-and-error.As Figure 2a shows, at environment step t, the agent interacts with the environment by computing an action a t using h p t .More specifically, the policy network computes a stochastic policy parameterized by µ a t and σ a t , and a t is given by a t = tanh (µ a t + t • σ a t ), where • is the Hadamard (element-wise) production and t follows diagonal unit Gaussian distribution.For better exploration, t is given by pink noise as suggested by Eberhard et al. (2023).Then, the agent precepts new observation xt+1 after executing the action a t and computes z q t to update its RNN state h q t (Figure 2a).In particular, µ q t and σ q t are computed by a 2-layer MLP with hyperbolic tangent activation: where φ is a CNN (Table 1).The posterior latent variable z q t is sampled from the diagonal Gaussian distribution N (µ q t , σ q t ).The agent also receive a scalar reward r t at each step.The environment also provides a termination signal done t , and the episode (trial) is reset when done t = True.
The agent stores its experience (x t , xt+1 , a t , r t , done t ) in a replay buffer for experience replay in training (Kapturowski et al., 2018) after each step.The replay buffer can store up to 2 12 sequences of length up to 60.The oldest experience will be replaced by a new one when the replay buffer is full.

Training
As in typical deep RL with experience replay (Mnih et al., 2015), the neural network models of the agent is updated using stochastic gradient descent each N environment steps, where N = 10 in our case.At each update (Figure 2b), a batch (40 sequences of length 60) of experience is randomly sampled from the replay buffer, and all the networks are trained in one gradient step in an end-to-end manner using the following loss function (here using t to denote the step in the recorded sequence, and the loss is averaged over the whole batch): posterior prediction error + βzDKL [q(zt|x1:t+1) p(zt)] complexity free energy where β x , β z , β a are the coefficients that determine the balance among these terms.The term E q(z) [L policy ] is the loss function of policy learning using any RL algorithm conditioned on posterior z, where the posterior z q t is computed in the same way as in behaving (Equation 2).Note that although the policy loss is expected over the posterior distribution of z in this term, it indeed enhances the performance of habitual actions (using prior distribution of z) if considering the complexity term together (see Section 3.4).We here use soft actor-critic (SAC) (Haarnoja et al., 2018a,b) as the base RL algorithm.In actor-critic algorithms, the value functions, which estimate long-term cumulative rewards over a policy, also need to be learned.We use value networks independent from the main model to learn the Q-function of SAC (the last term L value (x 1:t )).Each value network is a 1-layers GRU followed by 2-layers MLP.Note that the input to each value network is the original observation encoded by a convolutional neural network (CNN), thus the value network is independent from the main RNN and will only be used in training (Figure 2b) (Pinto et al., 2018).The hyper-parameters of SAC are selected following Haarnoja et al. (2018b) except that the we change the temperature coefficient to 1.2 to adapt to our environment.
The log-likelihood terms in Equation 3 ] are the posterior prediction errors of the current observation x t and the subsequent one x t+1 , respectively (Figure 2b).In our work, as image observations are considered, we model each pixel value (ranged in [0, 1]) as the probability of a Bernoulli distribution independent from other pixels as in Kingma & Welling (2014).Therefore, the expectation of the log-likelihood E q(z) ln p(x t = xt |z 1:t ) can be computed as the analytic form: solve the same task.The agent is reset to its initial state when accomplishing the task or reaching step limitation, and begins a new episode.
This also applies to the prediction error for the subsequent observation E q(z) ln p +1 (x t = xt |z 1:t−1 ).The KLdivergence term D KL [q(z t |x 1:t+1 ) p(z t )] can be analytically given as both the prior and posterior follow Gaussian distribution (Kingma & Welling, 2014).In particularly, as the prior follows N (0, I), the KL-divergence is computed as − 0.5.We choose hyper-parameters β x = 0.1, β z = 100 and β a = 30, obtained by grid search.

Goal-directed behavior
Supposing that the agent has learned diverse habitual behavior (e.g., the agent may move with different routes), our framework allows it to perform zero-shot goal-directed planning w.r.t. a given goal.Here, "zero-shot" (Wang et al., 2019) means the agent need no additional experiences (than the existing experiences of habitual behavior) to perform goal-directed behavior.That is to say, although the agent has not been trained to accomplish a specific goal in habitual learning, it can generate goal-directed actions if a goal is given by the programmer (How the agent can autonomously select a goal is beyond our scope, see Section 6.4 for more discussion).
Predictive coding and active inference makes this kind of zero-shot goal-directed planning possible, as our recurrent model predicts futures observations with z t , z t+1 , • • • (Figure 2c, suppose the current step is t).Since the intention z is a low-dimensional vector, it is not hard for the agent to infer the goal-directed intentions (z AIf t , z AIf t+1 , • • • ) that lead a future toward the goal (Figure 2c), which reflects the idea of active inference.More specifically, we fixed the model weights and bias while treat , and optimize them to minimize the variational free energy loss function w.r.t. the goal (the current step is denoted by t and the actual current observation is xt ): where the planning horizon N = 16 in our implementation.Since it is unknown that how many steps does the agent need to take to reach the goal, there are also trainable parameters c 1 , • • • , c N , where c τ is a real number in [0, 1] denoting the probability of reaching the goal after τ steps from now (step t), and we have In practice, we optimize real numbers ς 1 , • • • , ς N , and c τ = e ςτ / N i=1 e ςi (softmax).Note that since the current observation is known, and it should be used to constrain z AIf t .One may have found that this loss function does not involve RL loss comparing to that during habitual learning (Equation 3).This is reasonable by assuming the goal is possible to be achieved by habitual behavior, and thus there is no need to re-train the existing motor skills.In practice, we use a batch (32) of planning sequences and optimize them in parallel, using an RMSProp optimizer (Hinton et al., 2012) with decay rate 0.9 and learning rate 0.3.The planning sequence in the batch with the lowest loss after 100 optimization steps will be used.To avoid the random bias brought by sampling z AIf t from (µ AIf t , σ AIf t ), we use the mean µ AIf t to approximate z q t in rolling the RNN and in the predictive error terms, i.e., the practical AIf loss function is given by After the optimization steps explained above, we have obtained the goal-directed intention z AIf t with lowest free energy w.r.t. the goal from the training results.Then the RNN state h q t is computed as h q t = GRU(h q t−1 , z AIf t ), and the action is given by a t = tanh(µ a t (h q t )).The active inference process is conducted at each environment step, except that the beginning step of each episode (t = 1) is the same as in habitual behavior for warming-up.
Notably, the goal here can be of highly flexible choice.The most basic case is that the agent is provided with its visual image as the goal.In this case, the observation prediction error term can be computed in the same way as training (Equation 4).The goal can also be a part of the image by masking the other parts in the prediction error.Furthermore, the goal may also be a specific color, and the prediction error is computed as the difference between predictive future observation and an image full of such a color.In practice, suppose the goal color is G (RGB-value), we replace the prediction error (for both p and p +1 in Equation 6) with i,j,c exp[−(x pred i,j,c − G c ) 2 /0.5],where x i,j,c is the predicted pixel value at the the ith row, jthe column and color channel c.Inversely, we may minimize the negative of such a different so that the goal is to observe less of this color by replacing the prediction error with i,j,c exp[−(x pred i,j,c − G c ) 2 /0.005]/10.In these two cases, since the goal is about all future steps, we set c τ = 1 for all future steps.In sum, any goal that can be reflected by a loss function about the observation prediction can be used.

On the complexity term in learning habitual behaviors
Here we mathematically show that the KL-divergence between posterior and prior z bridges the gap between the actions based on posterior and prior distributions of z.Consider that we want to maximize the logarithm of the expected likelihood of the action computed from prior z p t is equal to the optimal action a * t (assuming the optimal action is know or can be estimated using the learned value function).The loss function can be written as: By Jensen's inequality, we have = −E q(zt|x1:t+1) [ln P (a t = a * t |z t )] posterior policy loss Thus, minimizing Equation 7 leads to lower L habitual behavior policy .This derivation is similar to the derivation of variational lower bound (Kingma & Welling, 2014).The process can be intuitively understood as the learning with posterior z as oracle information to guide habitual behaviors (Han et al., 2022).
We can see that the posterior policy loss and complexity terms in Equation 7 are contained in the total loss used in habitual learning (Equation 3), while interestingly, the complexity term jointly exists with free energy.Equation 7explains that why Equation 3 is actually optimizing the habitual behavior (given by prior z p t ) despite the policy loss term is expected over posterior z q t .

Environment
We focus on a relatively simple yet important navigation environment in a T-shape maze, or simply T-maze (Figure 3a).
The T-maze environment is a common behavioral paradigm used in cognitive science to study learning, memory, and decision-making processes in animals (O' Keefe & Dostrovsky, 1971;Olton, 1979).Here we consider a variant of the T-maze, in which the objective of habitual behavior is to escape from the maze as soon as possible, assuming that an enemy is chasing the agent.There are two exits in the top-left and top-right corners (Figure 3a).In each trial (episode), the agent start from a fixed initial position in the bottom (Figure 3a) of the maze.If the agent reaches an exit, it will receive a reward of amount 100 and this trial is finished.Hitting the wall once will bring to a negative reward (-10) to the agent.We consider a discount factor of 0.9 (Sutton & Barto, 1998) for RL so that escaping with fewer steps is of higher value and encouraged.
To make the environment more realist like for a biological agent, the observation of the agent is visual perceptiona 360 • RGB camera center at the agent with resolution 16 by 64 (Figure 3b).We consider continuous-valued motor action: the agent can decide its horizontal movement at each step (represented by a 2-dimensional action vector) with a speed limit.
We focus on this environment for two primary reasons.First, the T-maze environment serves as a straightforward and widely adopted paradigm in cognitive science, enabling researchers to investigate various cognitive processes such as spatial learning and memory, working memory, and decision-making (O' Keefe & Dostrovsky, 1971;Olton, 1979).Its simplicity and versatility make it a popular choice for many researchers.Second, we are interested in understanding how an agent can autonomously develop diverse and effective behaviors through trial-and-error.The environment offers a clear illustration of various habitual behavior strategies, such as choosing between the top-left or top-right exits, which requires a decision branching at some point.Since the agent's actions are continuous and it must balance exploration and exploitation trade-offs in a simple yet meaningful context, the autonomous development of this decision branching presents a significant challenge that is under-explored in previous studies.Our framework aims to address this problem with the Bayesian (stochastic) latent intention z which enables the diversity in abstracted action space while keeping the low-level action efficient.

Learning diverse and effective habitual behavior
The neural network model of the agent is trained by RL and free energy minimization in learning habitual behavior (detailed in Section 3.2).After abundant training (400,000 environment steps), the agent acquired diverse (randomly escaping from top-left or top-right) and effective (using few steps without hitting the wall) habitual motor actions.Figure 3c shows the aerial view of one example agent's habitual behavior in six different trials after training, with no randomness in the motor action.The outcome of which exit to escape from is implicitly decided by the neuronal noise of the intention z (Equation 2) in the first several steps of each trial.
It is also interesting to look into the internal representations of the neural network model.In particular, we visualize the main RNN state h q t using its principle components (PC) (F.R.S., 1901) based on multiple trials of habitual behavior of the example agent.Figure 3d shows the moving trajectories of the agent in these trials, where red and blue trajectories correspond to the option of escaping from left and right, respectively.Figure 3e shows the first 3 PCs of h q t , where the color is consistent with Figure 3d.It can be seen that a branching of the h q t representation occurs after the first several steps.This branching is introduced by the randomness of the 2-dimensional latent variable z, and we also visualize the z q t , z p t , and h q t (first 2PCs) in each respective step in Figure 3f, where the color indicate the final escaping exit.An interesting result is that the branching is not happened at a single step, which can be seen from the fact that there is no clear separation of red z p and blue z p at any single step (Figure 3f) Nonetheless, the branching is fully decided by z, which will be verified in Section 4.3.We also plot the development of agent' habitual behavior in the progress of learning in Figure 3g, which demonstrate the agent gradually develops diverse and efficient habitual behavior using RL (trial and error).
As our framework involves a number of mechanisms, we conducted ablation study to understand the roles of three crucial parts of our framework learning habitual behaviors.We compute the diversity and efficiency metrics to quantify the performance (Figure 4).The results show that these components are crucial for developing more diverse and efficient habitual behavior.

Flexible goal-directed behavior immediately transferred from habitual behavior
In habitual learning, the agent has acquired the behavior that randomly going to left or right exit.The experiences in habitual learning also helped the agent to form an internal predictive model of visual observations in the T-maze environment.An intelligent embodied AI should also have the ability to perform the goal-directed behavior without extra training, which is indeed allowed by our framework using the method detailed in Section 3.3.
Conventional goal-conditioned RL (Liu et al., 2022) (Section 5.4) treat the goal as an input to the model and outputs goal-directed action.Such a treatment has two crucial limitations.First, the goal needs to be explicitly involved in training, in which the agent is rewarded when it achieves the given goal.(Section 5.4).Second, the full state of the goal needs to be given at each trial.For example, if the desired goal is to "observe more red colors", it is unknown how to provide the input.Simply using an all-red image may not be proper if there is no such an state.In contrast, our framework, which is based on PC, considers the goal by using the variational free energy as a loss function to minimize (Equation 5).The free energy w.r.t. the goal is much more flexible.Also consider the case that the desired goal is observing more red colors, we can simply replace the prediction error term in the free energy (Equation 5) by a loss function reflecting how red the predicted image is.Furthermore, our framework enables the agent to achieve a goal that is not explicitly involved during RL, i.e., zero-shot transfer from reward-maximizing, habitual control to reward-free, goal-directed planning.The key mechanism to emerge such an ability is a compact representation of behaving strategies using the latent intention z which is also bound to future observations.Here, goal-directed behavior is generated by active inference (Friston et al., 2016): inferring z that minimize the free energy w.r.t. to the goal (Equation 5), i.e., reducing the gap between predicted future outcome and goal while annealing the complexity of z (this z is denoted as z AIf ).Then, z AIf t can be thus to update the main RNN state10 h q t and compute the action from h q t using the policy network.Note that the policy network is the same one used in habitual behavior, which means it tends to output actions that lead to more rewards.In this environment, the actions correspond to reaching any exit faster and avoiding hitting the wall.While going to the left or the right exit from the initial position takes the same amount of efforts, the goal will decide which way to go.Also, the constrain by the KL-divergence between the prior and posterior in the free energy (Equation 5) can be intuitively understood as playing the role to ensure that the inferred z AIf will not be too much different from the prior distribution (corresponding to the reward-maximizing habitual behavior), so that the agent should not execute inappropriate actions (e.g., hitting the wall).
To demonstrate the flexibility of goal-directed behavior using our framework, we performed experiments with providing three kinds of goals to the agent: (1) going to a place so that the agent's visual observation is the provided image (Figure 5a) (2) observing a color as much as possible (Figure 5b,c) (3) avoiding observing a color (Figure 5d,e).The goal-directed behaving trajectories of the same agent in Figure 3.Each panel shows one kind of goal (detailed methods are explained in Section 3.3).And the agents performs goal-directed behavior with high success rate (a: 97.9%, b: 97.4%, c: 100%:, d: 99.5%, e: 100%, tested using 32 agents and 6 episodes for each agent), where a goal-directed trial Figure 6: Left: the moving trajectory of the example agent (the same one as in Figure 5), plotted in the same way as in Figure 5; and the goal (or disliked) observation image (Section 4.3).Right: the first row shows the post-hoc future observation from the current step t.The second row shows the predicted visual observation.is considered successfully if the agent reaches the place near the goal in (a), the agent goes to top-left (b, d), or the agent goes to top-right in (c, e).
Figure 6 provides more details about the inferred z AIf for three kind of goals (see Section 3.3 for detailed methods).First, Figure 6a shows the future observations predicted with the inferred goal-directed intention (Section 3.3), at the second step in this trial.The agent makes accurate prediction about more than 10 future steps to reach the place where the agent can observe the goal image.It can be also seen that in the case of the goal is to observe more red colors (Figure 6b), the agent also makes reasonably realistic future predictions containing more red colors.Figure 6c demonstrates another interesting case: the goal is to avoid observing blue colors.The agent also succeed in going to the exit according to this goal.
The predictive ability demonstrated in Figure 6 is not specific for a few agents, but common in most of the agents.It can be seen that the goal-directed behavior in these cases is approximately covered by the habitual behavior given that the learned habitual behavior contain the diverse options of escaping from either the left exit or the right one.This is one of the most important concepts in our framework -goal-directed planning is constrained by habits in terms of low-level motor skills.For example, if a person want to move a mug from a place to another, usually the person's habitual gesture to hold the mug will be used instead of other gestures that may also hold the mug.We consider this an essential property for efficient goal-directed planning because the already shaped habits largely reduce the search space of possible actions for the goal (Parr & Russell, 1997;Konidaris & Barto, 2009).Thus, active inference in our framework conduct a search of only "good" (reward-maximizing) actions that may lead to future observations close to the goal.Although there is a limitation that the agent may not be able to achieve a goal that is non-reachable by its habitual behavior, this property is crucial to address the efficiency-flexibility dilemma.

Related work in AI
Looking into each part of our framework, many, but not all are well-formulated in literature, or sharing similar ideas.However, our framework is not an incremental study on any existing one of a straightforward combination of them.This section clarifies the novelty of our methodologies (problem define, optimizing algorithm and model architecture) given existing ones.

Embodied artificial general intelligence
Let us first look at position papers about how to create a general, intelligent embodied artificial agent (also known as the foundation model (Bommasani et al., 2021), which refers to AI models that can perform a wide range of tasks and adapt to new challenges).While no work has achieved this goal yet, there are countless articles addressing the problem.Here, we discuss two particularly interesting and related ideas.
The one big net framework proposed by Schmidhuber (Schmidhuber, 2018) stems from the observation that humans and other animals have one large neural network (i.e., their brain) that can efficiently learn and perform a wide range of tasks.Schmidhuber envisions that such a network would be able to continually learn and adapt to new tasks by reusing and transferring knowledge across tasks, making the learning process more efficient.A shared key idea between our framework and the one big net is that the center part of the model is an RNN.The RNN models the physical dynamics of the world, which is intrinsically invariant w.r.t.time, and maintains its internal states which can theoretically encode information from infinitely long history (Hochreiter & Schmidhuber, 1997).However, it is not explicitly pointed out how prediction plays a role in goal-directed planning.
Another recent perspective from Lecun is the so called autonomous machine intelligence framework (LeCun, 2022), which shares some common high-level ideas with ours.Lecun emphasizes that the world model, which plays the two-fold role of planning for future and estimating missing observation, should be a energy-based model.In the context of goal-directed behavior, the model takes the current state, the goal and the action to take as input, and outputs a scalar energy value to describe their "consistency".Similar to our ideas, the model and internal states are optimized for energy minimization with gradient methods.A key difference is that our model makes explicit predictions about sensation, and the uncertainty of sensation is handled by the stochastic latent variable z with variational Bayesian methods.

Model-based RL
Model-based RL (MBRL) approaches train the agent(s) based on a mathematical model that can predict upcoming state or observation given current and previous observations and actions.Usually, the model is used either for dreaming, i.e., generating imaginary experiences to train the agent (Deisenroth & Rasmussen, 2011;Ha & Schmidhuber, 2018;Kaiser et al., 2019;Hafner et al., 2019) or planning, i.e., inferring the policy that leads to maximum returns in the future (Hafner et al., 2018;Ke et al., 2019;Schrittwieser et al., 2019).Notably, there are also methods that used the model to extract information from the environment to serve model-free RL (Igl et al., 2018;Han et al., 2020;Lee et al., 2020), which share the similar idea with our framework when learning habitual behavior.While Igl et al. (2018); Han et al. (2020); Lee et al. (2020) also used variational RNNs, they focused on single-goal tasks.By contrast, the stochasticity in our model reaches its potential to enable the agent to randomly pursue one of multiple goals.The planning phase of our framework used active inference (Friston et al., 2010), which infers the policy using the model like in MBRL, while not to maximum returns but minimize free energy w.r.t. the goal (see Sec. 5.5).

Variational Bayes in deep learning and RL
Variational Bayesian (VB) approaches in deep learning has been popular since the introduction of the variational auto-encoder (VAE) (Kingma & Welling, 2014;Sohn et al., 2015).The core idea is to maximizing the variational lower bound of an objective function of a probabilistic variable so that we can replace the original distribution with an variational approximation (Alemi et al., 2017).Chung et al. (2015) complemented VAE with recurrent connections by proposing variational RNNs.The variational RNN and its variants were later used in deep RL, e.g., as the world model in MBRL (Hafner et al., 2018(Hafner et al., , 2019;;Han et al., 2020;Lee et al., 2020) and for approximating unobservable environmental states (Igl et al., 2018;Han et al., 2022).One critical reason to use VB models in RL is that they are believed to extract useful representation of environmental state from raw observations (Alemi et al., 2017) and are robust in training.The acquired representation is then incorporated for the original RL task, i.e. maximizing rewards.While our framework can be considered as a new member of the VB family that handles decision-making/control tasks, our idea of modeling habits and goals using variational Bayes has not been discussed by previous deep learning studies.

Goal-conditioned RL
Goal-conditioned RL (GCRL) (Liu et al., 2022) addresses scenarios in which a goal is given at each episode and needs to be achieved.The goal can be a property or feature (Florensa et al., 2018), observation or state (Andrychowicz et al., 2017), or language description (Luketina et al., 2019).The main difference between GCRL and our framework is that in GCRL, goals are given during learning and agents are only rewarded when the given goal is achieved.In contrast, our framework does not assign any specific goals during training, it is only maximizing rewards for habitual behavior and minimizing free energy expectations.In simple terms, the training and testing problems are consistent in GCRL, but different in the proposed framework.
Deep AIf shares the similar idea with MBRL in terms of inferring actions to achieve desired outcome using a environment model, while the main difference is that AIf maximizes the likelihood to achieve a certain state, while MBRL maximizes expected rewards.Another notable difference is that AIf is a probabilistic framework, while MBRL does not have to be.
Goal-directed planning in our framework employed the idea of active inference.However, our model does not directly infer actions to achieve the goal, but the latent state in the model that encodes the intention.The latent state can be also understood as the high-level action from the perspective of hierarchical RL (Sutton, 1984;Eppe et al., 2022).

Control as probabilistic inference
Control as probabilistic inference (CPI) (Levine, 2018) proposes using probabilistic inference to compute the optimal control action instead of designing a deterministic control policy, by casting the control task as a probabilistic inference problem over latent variables that describe the state of the system.Although the basic idea shares insights with model-based learning, CPI does not consider detailed outcomes of action but only maximizes rewards.Therefore, the practical implementation of CPI turned out to be model-free algorithms, such as soft actor-critic (SAC) Haarnoja et al. (2018a,b).In our implementation, SAC is used as the base RL algorithm to learn the habitual behavior.Readers are also encouraged to refer to Millidge et al. (2020); Hafner et al. (2020) for the in-depth discussion on the relationship between probabilistic inference and decision-making/control.

Generalization in RL
There are also a number of studies done on a problem known as generalization in RL (Kirk et al., 2021).They consider the cases where training and testing tasks are different in terms of state distributions (Cobbe et al., 2019), dynamics (Ni et al., 2022), observation (Cobbe et al., 2019) and reward functions (Yu et al., 2020).Our framework generalizes habitual control to goal-directed planning.However, as far as we are aware, there has not been any study that handles learning without goals but testing with goals, i.e. our work may be considered as a novel setting of generalization in RL.

Self-discovery of goals or skills
A class of methods known as variational skill discovery (VSD) (Achiam et al., 2018) aims to discover action primitives in reinforcement learning (RL) by optimizing an unsupervised or self-supervised objective function based on information theory.These methods use a latent variable, z, to label action primitives, which can be either discrete (Gregor et al., 2016;Achiam et al., 2018;Eysenbach et al., 2019) or continuous (Sharma et al., 2020;Xu et al., 2020).The policy model, π(a|z, s), where a is action and s is state, is then trained using pseudo (intrinsic) rewards and RL.These pseudo rewards reflect an information-theoretic objective that encourages the skills to be diverse and predictable by states, such as the variational lower bound of the mutual information between the set of skills and skill termination states.
A particularly related work is Mendonca et al. (2021), which, like us, considers both control without and with a goal.However, they explicitly train a goal-achieving agent to achieve goals using RL by designing a "goal achievement reward" in addition to training an agent for exploration; While in our framework, there is no training for goal-achieving.

Discussion
6.1 Computing the Bayesian latent state z Some readers may still be confused about the Bayesian variable z -how to compute z in different cases?In this work, the prior of z always follows a diagonal unit Gaussian distribution, which corresponds to diverse and unconditioned choice of goals at current step.In contrast, the posterior of z is aligned with the future states.To be clear, there are two ways to compute the posterior distribution, respectively used in learning habitual behaviors (optimizing the model weights and biases) and goal-directed planning (optimizing z AIf ).
During executing and learning the habitual behaviors (Figure 2a, b), the way to compute the posterior z q t is to use the previous RNN state h q t−1 , the current observation xt and the post-hoc, next-step observation xt+1 to as input (a forward process, Equation 2).This is a crucial mechanism to better bind the intention z with the transition of environment state, without a goal being assigned.Since achieving a goal in the future depends on a chain of environment state transitions.Thus, the main RNN using the intention z as input give raises to the capacity for later goal-directed planning.The forward computing way also ensures the computational efficiency in habitual behaviors.
The other way, used in goal-directed behavior (Figure 2c), is active inference, or searching posterior z AIf that minimizes the free energy w.r.t. the goal (a backward process, Equation 5).Such way of inferring z AIf is opposite to the conventional algorithms of goal-directed control (Liu et al., 2022) in which the goal is encoded as an input (forward process).One key difference is that active inference allows the training (habitual here) and testing (goal-directed) problems to be different, since it can utilize the learned world model with compact intention representation z to do imaginary future planning (Figure 2c).In contrast, the conventional goal-as-input approach needs to keep assigning the goal in training.Another significant advantage of our approach is that it is flexible to set various properties of observation as the goal, such as seeing more red colors (Section 4.3), in contrast to the goal-as-input approach which requires a full goal involved in training.Meanwhile, active inference needs much more computations to conduct goal-directed planning (optimizing z AIf ), which is unavoidable and also well known in animal behaviors (Redgrave et al., 2010).

Toward understanding behavior
Recall the three questions raised in Introduction, we now provide our answers to them after demonstrating our experimental results: How does an agent acquires diverse while effective habitual behavior?We need stochasticity in the hidden layer (z) of policy network rather than stochasticity in motor actions.With proper learning process that regularizes the stochasticity, the neural network can self-develop diverse intentions, thanks to the randomness of z; while execute effective actions with the well-formed low-level motor skills (mapping from z to motor action).
How to bridge the gap between habitual and goal-directed behavior?We propose to consider habitual and goaldirected behavior as unconditioned and goal-conditioned distribution of the latent variable z (or intuitively called intention in this paper).The habitual side corresponds to the prior distribution of z, without conditioned on any goal, and the other side includes the hindsight goal-related information to determine the posterior of z.The KL-divergence term between the prior and posterior z, contained in the variational free energy, acts as the bridge between habitual control and goal-directed planning.Variational Bayes methods provide theoretical foundation to enable to sharing of motor skills (Sec.3.4).
How does an agent generates actions to achieve a goal that has not been trained to accomplish?The agent should have a internal predictive model about the environment, thus it can perform a "mental search" of motor patterns that results.Importantly, it is too much costly to search all possible motor actions since.Instead, intention or abstracted action (z here) should be inferred -the generated motor actions from the intention, regularized by a prior distribution (i.e., by AIf), is usually effective since well-developed low-level motor skills have already been formed.This explains why there are infinite hand gestures to hold a cup, yet we typically use the same gesture repeatedly.This share the same idea with the pre-training paradigm of the recent advances in AI such as GPT (Brown et al., 2020) and CLIP (Radford et al., 2021).In particular, designing a learning objective function for a given goal is like designing a prompt11 for language models.However, a key feature of our framework is that its training process does not need to involve goals, which is different from the training of GPT where prompts are used in training.

The frame problem of AI
The frame problem in AI (McCarthy & Hayes, 1981) refers to the challenge of determining which aspects of an environment are relevant when making decisions or solving problems.It arises due to the vast number of potential factors an AI system must consider, making it computationally infeasible to model all possibilities.More practically, computational models that takes the relevant information as input are the vulnerable to the frame problem.Goal-directed behavior is more susceptible to the frame problem, as goals can be highly diverse and complex even with a single sensory modality like vision, not to mention that biological agents possess multiple sensory inputs.Rather than treating the goal as a direct input to the model (Andrychowicz et al., 2017), our framework employs a backward process to infer goal-directed intentions using predictive coding.This approach presents a potential solution for the frame problem in complex environments.

Where comes the goal
While the proposed framework answered the question that how the goal-directed behavior using habitual skills, there remains a more fundamental question: where does the goal come from?In our simulations and many other machine learning studies (Andrychowicz et al., 2017;Mendonca et al., 2021), the goal is assigned by the programmer per task need.But how about in humans and animals?It would be interesting to consider modelling the intrinsic mechanisms of goal-selection (Reinke et al., 2020) in future research.A potential mechanism is to learn or evolve a "meta"-habitual behavior of goal-selection that enhance the fitness of the agent and the population.

Mutual conversion between habitual and goal-directed behaviors
A prevalent concept in developmental psychology is that individuals initially exhibit goal-directed behavior when adapting to new tasks or situations, which eventually evolves into more habitual and automatic responses (Dolan & Dayan, 2013).For example, consider a person who has just moved to a new house: initially, they need to consciously navigate using a map or directions to find their way home.Over time, as they become familiar with the route, the process of getting home transforms into a habitual and automatic behavior, requiring little conscious thought (Yin & Knowlton, 2006;Tricomi et al., 2009).In this paper, we consider a reversed schema.Although seemingly controversial, our schema is indeed also common if readers consider the scenario that a sport player tries to win the game with certain point difference (the habitual behavior is trying to win every point).
Nonetheless, it is straightforward to address how a repetitive goal converts to habits with our framework.The idea is similar to amortized inference (Gershman & Goodman, 2014), which is a technique in machine learning that speeds up the process of inference by spreading the computational cost.Amortized inference involves training a simpler model to approximate cost-heavy inference calculations (e.g., inferring goal-directed intention), resulting in faster and more efficient predictions compared to the original inference methods.In our case, we can train a feedforward neural network z amortized (h) to predict the goal-directed intention from the current RNN state of the agent.The network is trained using the agent's experience during goal-directed behaviors with a repetitive goal.Our experiment shows that using z amortized (h) to replace the original prior and posterior intention in habitual behaving leads to fast-computing habitual behavior that only pursues the corresponding goal (See Appendix.A.2).

Predictive coding and information bottleneck
The theory of predictive coding (PC) suggests that brain learns to identify patterns and reduce unneeded information by removing items from the input that can be predicted based on these patterns in the natural world (Huang & Rao, 2011).In our experiments, information about the environment and goal is encoded with a 2-dimensional vector z, while the model can still make reasonable predictions of future observations, which reflects the key idea of PC.This compact encoding together with the complexity constrain (the KL-divergence between posterior and prior), enables effective active inference to compute the optimal value of goal-directed intention with a constrained, small search space.
Another perspective from machine learning theory is that given the prior of z is unit Gaussian in our cases, minimizing the expected free energy (in Equation 3), or the negative variational lower bound (Kingma & Welling, 2014), can be considered as a special case of the information bottleneck (IB) objective (Tishby & Zaslavsky, 2015;Alemi et al., 2017).The IB objective tends to minimize the mutual information between the input (vision here) and the latent encoding z and maximize the mutual information between z and the model's prediction (Alemi et al., 2017).This idea is consistent with PC and provides a mathematical understanding of how minimizing the free energy in training relates to PC.

Hierarchy
Hierarchy is a crucial property in PC because it enables efficient processing of sensory information by allowing the brain to make predictions at different levels of abstraction.E.g., neurons in the primary visual cortex have simple and complex receptive fields, while higher-level visual areas have increasingly complex receptive fields that allow for sophisticated processing of visual information, ultimately leading to object recognition (Rao & Ballard, 1999).In our case, the RNN state h and intention z can be considered as two levels of task representation, where z is of a higher-level abstraction (with only 2 dimensions compared to 256 dimensions of h).Nonetheless, further work should consider the intrinsic hierarchy of the model by borrowing ideas from brain science studies, such as the multiple timescale property found in cortical layers (Murray et al., 2014), or from deep learning models, such as the Swin Transformer (Liu et al., 2021).

Arbitration between habitual and goal-directed behaviors
A handful of studies consider modeling the agent's actual behavior as a mixture of goal-directed (model-based) and habitual (model-free) ones (Gläscher et al., 2010;Smittenaar et al., 2013;Lee et al., 2014).The most common way is to use a linear combination of actions computed by two separate systems handling goals and habits respectively.However, this is not plausible if considering the habitual behavior is going to left and the goal-directed behavior is to right in our T-maze task -a linear combination of motor actions will lead to hitting the wall in the middle.Our Bayesian behavior framework suggests a natural way of taking advantage of both behaviors, for example, the mixed behavior can be computed using the intention z given by If we look at Equation 8, it is easy to see that the z t (prior or goal-directed) with smaller variance (σ 2 t ) is more dominant.The Bayesian property of z elegantly connect the uncertainty with σ t .Thus, the intention with lower uncertainly will be preferred.Interestingly, a state-of-the-art variational autoencoder model (Sønderby et al., 2016;Child, 2021) also used similar method to compute the latent variable z.

Insights for neurological diseases
Our framework poses a novel variational Bayesian understanding about goal-directed and habitual behaviors, which may also provide valuable insights for understanding and treating neurological diseases like Parkinson's disease (PD) (Dolan & Dayan, 2013) and autism spectrum disorder (ASD) (Van de Cruys et al., 2014).
For PD, previous research has suggested that patients with PD have difficulty with goal-directed behavior, as they tend to rely more heavily on habitual than goal-directed actions as their goal-directed planning ability is impaired (Wunderlich et al., 2012b,a).As we have discussed the arbitration between the two kind of behaviors in Section 6.8, such impairment may be explained by large uncertainty of goal-directed intention.It might be worth investigating how to reduce the uncertainty of goal-directed intention through medicine / deep brain stimulation (changing internal states) (Cotzias et al., 1969;Perlmutter & Mink, 2006) or sensory stimulus (changing brain inputs) (Azulay et al., 1999;Muñoz-Hellín et al., 2013) for improving the motor control ability of PD patients.
It is well known that repetitive behavior is a key characteristic of ASD and the abnormal predictive coding in ASD people is a popular explanation (Pellicano & Burr, 2012;Wild et al., 2012;Van de Cruys et al., 2014;Palmer et al., 2017).In particular, the pathology of ASD can be computationally explained by that over-weighting of the complexity term in free energy (Equation 1) impairs cognitive-behavioral flexibility when adapting changing environment (Soda et al., 2023).Our framework can be used as a computational tool to help understand how the stochasticity of z affects behavioral diversity (Appendix A.3).

Conclusion
In this article, we proposed the Bayesian behavior framework, suggesting a novel paradigm to consider habitual and goal-directed behavior.Our contributions are two-fold -technical and conceptual.
Technically, we propose a novel Bayesian framework that enables seamless transfer between fast, inflexible habitual behavior and slow, flexible goal-directed behavior.The proposed Bayesian behavior framework is based on two core ideas -modeling habits and goals with prior and posterior distributions of a Bayesian variable, respectively; and utilizing the predictive internal model for goal-directed planning.Based on these two idea, we particularly employed model-free deep RL to learn motor skills for habitual behavior, since deep RL catches many key features of embodied learning of biological agents (Botvinick et al., 2020).We then apply AIf which enables flexible goal-directed behavior (Friston et al., 2010).Despite the common belief that AIf and RL are contradictory (Friston et al., 2009), we argue that combining the two methods results in more efficient learning and reuse of motor skills.Our framework provides a concrete methodology that elegantly unit these all together.
For the habitual behavior, we have shown the emergence of smooth bifurcating behavior of an embodied AI through trial and error (i.e.online deep RL) using the proposed framework (Section 4.2).Simultaneously, the agent learns an internal model to predict future observations conditioned on the intention z.Then, we have demonstrated the flexibility of goal-directed behavior with the framework by leveraging the predictive internal model (Section 4.3).Although having not being trained with goals, the agent is able to perform planning for a given goal observation or partial goal properties.
Conceptually, our work proposes potential underlying neural mechanisms that are essential to flexible and efficient behavior of intelligence biological agents.To briefly summarize the take-away messages of this work for cognitive science and neuroscience researchers: • The stochasticity in neural activities with proper regularization enables diverse and effective motor actions.
• Habitual and goal-directed behavior, though having different conditions, can share low-level motor skills using variational Bayesian methods.• Predictive coding enables flexible goal-directed behavior by inferring what values of neural state z result in the desired goal properties.
• The variational free energy's complexity term bridges the gap between habitual and goal-directed behaviors.
Our framework has certain limitations.We mainly focus on proof of concept using a fundamental T-maze task, which is nevertheless challenging due to the high-dimensional nature of the first-person vision used for observation.We have yet to tackle more complex motor control tasks, which are commonly seen in real animal behavior (Mattar & Lengyel, 2022).
Another limitation is that the agent may not be able to reach a state or place that is not covered by its habitual behavior.This scenario is relatively rare, but it may occur (e.g., an intentional loss to a weaker opponent in a sport game).To overcome this limitation, it may be necessary to conduct additional learning or to search raw actions rather than relying on the learned motor skills, which can be more time-consuming and resource-intensive.
As for future research, intrinsic generation of goals (Sec.6.4) appears an important direction to answer the ultimate questions of how autonomous agents can self-develop (LeCun, 2022).Another important change lies on hierarchy model structures (Section 6.7).Moreover, significant improvement may come from integrating more modalities such as nature language and sound.The format of the goal could be much more flexible if introducing pretrained models like GPT (Brown et al., 2020) and CLIP (Radford et al., 2021).These models provides extensive possibilities of incorporating different modalities into goal-directed intention.and vice versa (Higgins et al., 2017).We sweep a range of β z (1, 10, 100 (used in the paper), 1,000 and 10,000) to check the impact of it to the learned habitual behavior.
It can be seen that (Figure 8) agents with β z = 100 overall have the best overall behavior efficiency and diversity.In particular, when β z is very large, the agent tend to acquire habitual behavior with very low diversity.

Figure 1 :
Figure 1: The idea of our Bayesian behavior framework.(a) Behaving using either habitual or goal-directed intention z to generate actions.(b) Minimizing the free energy w.r.t. the desired observation leads to goal-directed intention z post .(c) More details about how to use goal-directed intention by predicting future observations, in which the goal-directed intention (?z) is continuously being inferred to minimize the free energy w.r.t. the goal while the prediction model (the "brain") is fixed.

Figure 2 :
Figure 2: Detailed workflows of our model in (a) habitual behaving (b) training and (c) goal-directed behaving.See Section 3.3 for the explanation of c 1 , c 2 , • • • in (c).

Figure 3 :
Figure 3: Habitual behaviors in the T-maze environment.(a) rendering of the environment in PyBullet physics engine.The agent is abstracted as a black ball-shaped robot.The agent is rewarded if it escapes from the maze via the top-left or top-right exit.(b) Visual observation of the agent at the initial position, which is the robot's first-person view of the environment using a 360 • RGB camera.(c) Moving trajectories of an example agent's habitual behavior in six different trials, without noise in motor actions (aerial view).The gray square denotes the initial position of the agent and each arrow denotes one step.(d) the example agent's habitual moving trajectories of multiple trials.The color indicates the final escaping exit (red or blue) and step from the start (lightness).(e) Visualizing internal representations of the agent using PCA of the RNN state h q of the agent.The first 3 PCs are plotted, where data and colors are consistent with (d).(f) visualization of z q , z p , h q in each step.For h q , PCA is conducted for the data in individual steps, and the first 2 PCs are plotted.The markers correspond to final exit (red triangles: left, blue squares: right).(g) Changing of habitual behaviors from the beginning of learning to convergence.The dark purple and light green curves indicate the trajectories of deterministic and stochastic motor actions in multiple trials, respectively.

Figure 4 :
Figure 4: Ablation studies of habitual behavior.(a) Diversity and efficiency.Efficiency is defined as 1/mean(steps to reach exit) and diversity is computed as min(n left /n right , n right /n left ), where n left and n right are the number of trials successfully exiting from the top-left and top-right corner, respectively, in 60 testing episodes.The models are explains as follows: No Bayes: This model is the same as the original one except that z is a fully deterministic variable, and there is no prior distribution of z thus the complexity term is not involved.No prediction: This model does not perform observation prediction, otherwise it is the same as the original one.No next obs.:In this case, it differs from the original model in terms that the posterior z q t does not predict and depend on the subsequent observation x t+1 .(b) Number of trials exiting from left or right for each agent (random seed).For (a) and (b), 32 random seeds are used where the motor actions are deterministic.(c) and (d) show the results using stochastic motor actions.

Figure 5 :
Figure 5: Moving trajectories of the example agent's goal-directed behavior in different trials, without noise in motor actions (aerial view), plotted in the same way as in Figure 3c.The light gray square in (a) indicates the goal area (the goal image is the visual observation in the center of the goal area).

Figure 8 :
Figure 8: Diversity and efficiency of the learned habitual behavior using different β z , plotted in the same way as Figure 4a,c.