Introduction

The brain couples its structure with the outside world via sensorimotor experiences1. A posteriori development of neural processing gradually forms our perceptual phenomena, with a priori nature determined by genes. It yields well-defined self-experience and helps to confidently situate self in relation to others. In general, we face two primary types of sensorimotor experience, self-movements and sensory events in the outside world, which may or may not be correlated. Recognition of the difference is thought to underlie self-other distinctions, sense of self, and subsequent attribution of agency2,3, but the difference cannot be known a priori. The brain may acquire the capacity to modulate neural responses via sensorimotor learning, depending on the context.

A phenomenon called sensory attenuation, is recognized as one of the bases of the sense of self, especially the sense of agency2,4. Sensory attenuation refers to an experience in which an exteroception produced by a self-movement is less salient than one produced externally5. A perfect example is the difficulty of tickling ourselves. In addition, the ability to ignore visual changes during eye or head movements is thought to help maintain stability of the visual scene. At the neural level, sensory attenuation is observed as reduced brain responses, such as blood oxygen level-dependent (BOLD) responses, especially in sensory areas6,7,8. A line of research shows that sensory attenuation is associated with predicted sensorimotor correlation and is diminished by temporal or spatial mismatch between a movement and the resultant exteroception (e.g., a temporal delay or a trajectory rotation)9,10,11. There are some suggestions of underlying neural functions, such as an efference copy of the motor command9,12 and neuromodulation (e.g., dopaminergic transmission)13. However, despite intensive work for decades, a fundamental question remains unclear: Is sensory attenuation enabled innately or is it acquired through learning?

Here, we provide a computational explanation, suggesting that a mechanism for sensory attenuation can be self-organized through learning. We focus on the following factors. If a self-movement and a sensory event in the outside world are not correlated, the resultant proprioception and exteroception occur separately. The brain may efficiently use individual sensory areas to represent these individual sensations. On the other hand, if proprioception and exteroception are correlated, they can be best explained as being generated by the same latent cause, i.e., self, rather than being generated by two independent latent states, i.e., self and other. In that case, if the brain can develop a predictive model about the spatio-temporal sensorimotor correlation, it is reasonable to represent the sensorimotor coupling using a multimodal association area, and neural responses in individual sensory areas can be attenuated. That is, through sensorimotor learning, the brain may develop the capacity to predictively modulate sensory-information flow and processing inside hierarchical neural networks, depending on the context, i.e., self generated versus other generated. To test this hypothesis, we conducted a robotic simulation experiment using a variational recurrent neural network (RNN) model based on Bayesian brain theory or free-energy minimization1,14,15,16,17.

Results

Computational model

Our daily sensorimotor behavior generally has some sort of regularity, e.g., a set-point of body posture and dynamic movement patterns, and randomness, e.g., free movement with fluctuations18. Therefore, we considered how a robot agent controlled by an RNN develops sensory attenuation through such spontaneous sensorimotor experiences (Fig. 1). A robot repeatedly generated a random behavior (imprecise movement) and then returned to the set-posture (precise movement) within five seconds (20 time steps for the robot), where the robot’s motion and a movement of an external red object were correlated (self-produced context) or uncorrelated (externally produced context). The received sensations were 3-dimensional proprioception for joint angles and 2-dimensional exteroception for the position of the object. Previous brain imaging studies show that sensory attenuation involves hierarchical interactions among multiple brain regions, including sensory areas, an association area (parietal area), and a higher cognitive area (prefrontal or supplementary motor area)19,20,21. The RNN inside the robot has a corresponding hierarchical structure, referred to as sensory (exteroceptive and proprioceptive), association, and executive areas (Fig. 1a). The hierarchical RNN is a generative model that predicts sensations as well as inferring hidden causes of sensations via variational free-energy minimization16,17. Each area has individual latent variables \(\varvec{z}_{\varvec{t}}\) representing each level of belief about the hidden cause of sensation in the form of Gaussian distributions. Each latent variable has prior \(\varvec{z}^{\varvec{p}}_{\varvec{t}}\) and posterior \(\varvec{z}^{\varvec{q}}_{\varvec{t}}\) distributions that correspond to estimated hidden causes before and after observing sensations, respectively. The priors and posteriors of latent variables are time-varying in sensory and association areas, while the executive area maintains a constant posterior state during sequential prediction generation within a time window (Supplementary Fig.  S1– S2). We assumed that the executive area controls (and switches) sequential patterns of lower-level network behavior, like the prefrontal or supplementary motor area in a biological brain22,23. At each time step, the RNN generates predictions \(\varvec{\hat{x}}_{\varvec{t}}\) about the next exteroceptive and proprioceptive sensations \(\varvec{x}_{\varvec{t}}\) from latent variables via recurrent units \(\varvec{d}_{\varvec{t}}\), representing dynamics of the environment. Variational free-energy \(F_{t}\) can be calculated as the sum of prediction error and Kullback-Leibler divergence between the posterior and prior at each lth network level.

$$\displaystyle F_{t} =\underbrace{ \frac{1}{2} \left( \varvec{x_{t}}-\varvec{\hat{x}_{t}} \right) ^{2}}_\mathrm{{Prediction \, error}} + \sum _{l=1}^{3}W^{(l)}\underbrace{D_{KL}[q(\varvec{z_{t}^{(l)}}|\varvec{e_{t:T}})||p(\varvec{z_{t}^{(l)}}|\varvec{d_{t-1}^{(l)}})]}_\mathrm{{Influence \, of \, prior \, beliefs}}.$$
(1)

This describes integration of sensory information, i.e., prediction error \(\varvec{e}_{\varvec{t}}\), and prior knowledge, i.e., prior belief \(\varvec{z}^{\varvec{p}}_{\varvec{t}}\), upon a posterior \(\varvec{z}^{\varvec{q}}_{\varvec{t}}\) update, although synaptic weights are also updated during the learning process. The decomposition of variational free-energy in Eq. (1) can also be thought of in terms of accuracy (prediction error) and complexity (the divergence between posterior and prior beliefs). A posterior is only adjusted to minimize prediction errors under the influence of the prior. A weak prior (with low precision) leads to a sensitive posterior update in response to a prediction error, while a strong prior (with high precision) leads to discounting of the prediction error. In other words, prediction errors have a greater effect on posterior beliefs in the presence of a weak, i.e., imprecise prior. The strength of the prior in each area is parameterized by the standard deviation of a Gaussian prior that is dynamically modulated to minimize the path integral of free-energy. In other words, we equip the model with the capacity to infer differences in prior precision in a way that lends the influence of prior beliefs a context sensitivity. From the available physiological evidence, it may engage local GABAergic mechanisms or neuromodulators, e.g., acetylcholine or dopamine14,24,25. In addition, we introduce a hyper-parameter called the meta-prior \(W^{(l)}\), which controls the meta-level balance between the prediction error and the divergence between the posterior and the prior at each network level. Previous studies have shown that a large meta-prior causes an intrinsically strong prior, whereas a small meta-prior causes an intrinsically weak prior16,17. The inclusion of meta-priors may seem superfluous, given that the precision, i.e., inverse variance, of various beliefs are inferred adaptively. However, meta-priors offer an opportunity to model a failure to infer context sensitive changes in precision. This particular failure of sensory attenuation may be an important explanation for several neuropsychiatric conditions: see below. We assumed that the meta-prior represents an innately equipped characteristic in the biological brain. In the baseline model, we set the meta-prior to the same value for all network levels (\(W^{(1)}=W^{(2)}=W^{(3)}=0.005\)) to avoid bias in the prediction-error flow.

Emergence of sensory attenuation

This experiment comprises learning and test phases. In the learning phase, the robot learned to reproduce the two types of sensorimotor patterns (Fig. 1b). Target sensorimotor sequences were prepared in advance from human spontaneous behaviors through manual manipulations of a physical robot. We prepared 24 training sets for each set of self-produced and externally produced contexts. We ensured that each of the 24 movement patterns of the robot was the same for both contexts (see Fig. 1b and “Learning” section in the “Methods”). The RNN updated posteriors and synaptic weights to minimize the path integral of free-energy (Supplementary Fig.  S1).

In the test phase, the trained robot was required to generate actions by itself and to recognize an environmental context by updating only posteriors to minimize the free-energy, with fixed synaptic weights (Fig. 2a; Supplementary Fig. S2). Action generation of the robot was performed by proportional-integral-derivative (PID) control. A PID controller receives proprioceptive predictions as target joint angles and it changes the joint angles (proprioception) to minimize the error between the current state and target. In this regard, the PID controller implements active inference under the free-energy principle by realizing posterior predictions about movement trajectories15. Figure 2b–e shows an example of a test trial (Supplementary Video S1). The test trial consists of a self-produced context during time steps 0–100 and an externally produced context during time steps 100–200.

In the self-produced context, the robot moved the external object by itself, and the object position matched the robot’s hand position (Fig. 2e). The trained RNN successfully reproduced the stochastic property of learned sensorimotor sequences by modulating the association-level latent mean \(\varvec{\mu }^{\varvec{(2)}}\), such that the estimated sigma \(\varvec{\sigma }^{\varvec{(2)}}\) was high during random behavior generation, but low when returning to the set-posture (Fig. 2c). Here, \(\varvec{\mu }\) and \(\varvec{\sigma }\) correspond to the mean and standard deviation of a Gaussian posterior or prior. In other words, they play the role of sufficient statistics of approximate posterior beliefs and prior beliefs that are optimized during learning. Modulation of the latent mean in exteroceptive and proprioceptive areas \(\varvec{\mu }^{\varvec{(1)}}\) was small, with low estimated sigma \(\varvec{\sigma }^{\varvec{(1)}}\) (Fig. 2d), suggesting reduced prediction-error flow into sensory areas with high precision in its prior. These results indicate that the RNN minimized prediction errors produced by self-movements mainly by adjusting the posterior in the association area, since the posterior in sensory areas cannot be easily adjusted because of the strong prior.

Then, at time step 100, the environment was shifted to externally produced contexts, where the object position was given from test data that were not used in the learning phase. The object position and hand position became uncorrelated (Fig. 2e). The environmental change caused a stepwise change in the executive-level latent state \(\varvec{\mu }^{\varvec{(3)}}\) (Fig. 2b). At the same time, in sensory areas, modulation in the latent mean \(\varvec{\mu }^{\varvec{(1)}}\) was amplified and sigma \(\varvec{\sigma }^{\varvec{(1)}}\) increased, showing periodic modulation (Fig. 2d), like the association-level latent state. This shows that the RNN minimized prediction errors from externally produced sensations by adjusting the posterior in sensory areas, as well as in the association area.

Collectively, the hierarchical RNN attenuated neural responses in sensory areas for the self-produced context and amplified them for the externally produced context by proactively controlling precision of the prior at each network level in which the posterior in the executive area worked as the information hub for switching the lower-level precision structure and the prediction error flow (Fig. 3). Furthermore, free-energy converged into different states for distinct sensorimotor contexts (Fig. 3). This suggests that a particular free-energy minimum was developed for each sensorimotor context and the transition of the free-energy state in the network, induced by the abrupt context shift, underlay the qualitative change in the free-energy minimization.

For quantitative analysis, we prepared 10 trained networks with different initial synaptic weights and conducted 8 test trials for each trained network. Figure 4a shows the change in the sensory-level posterior mean \(\varvec{\mu }^{\varvec{(1),q}}\) per time step for the two contexts. In this study, it is referred to as a sensory-level posterior response. A paired t-test reported that the posterior response was significantly smaller in the self-produced context than in the externally produced context (\(t(9)=-3.38, p=0.0082\)). In addition, a paired t-test showed that the sensory-level prior sigma \(\varvec{\sigma }^{\varvec{(1),p}}\) in the self-produced context was significantly lower than that in the externally produced context (\(t(9)=-3.32, p=0.0089\)) (Fig. 4b).

Furthermore, we analyzed how attenuation of neural responses in sensory areas developed during the learning process. The RNN first increased sensory-level posterior responses to reconstruct target sensorimotor sequences (Fig. 4c). Then, sensory-level posterior responses in the self-produced context were gradually attenuated in both exteroceptive and proprioceptive areas. The sensory-level prior sigma diminished more in the self-produced context than in the externally produced context through the learning process (Fig. 4d). We confirmed that posterior responses in the association area were similar in self-produced and externally produced contexts (Supplementary Fig. S3a,b), indicating reduced total neural response in the self-produced context. This sensory attenuation was accompanied by recognition of both contexts in the executive area (Supplementary Fig. S3c,d). These results demonstrate emergence of sensory attenuation through a learning process via free-energy minimization. Additional analyses indicated that development of sensory attenuation was diminished by removing neurons in the association or executive area (Supplementary Fig. S4), suggesting the importance of a higher-level representation of sensorimotor correlation.

In addition, we investigated effects by modulation of the meta-prior at each network level (Supplementary Fig. S5S7). In particular, a small meta-prior in sensory areas or a large meta-prior in the association area led to a deficit in attenuation of the sensory-level posterior response, as well as the sensory-level prior sigma, in the self-produced context. This suggests that innately decreased prediction-error flow into the association area compared to sensory areas disrupted development of sensory attenuation.

Discussion

The current model study, using a hierarchically organized variational RNN, illustrated the possibility that a sensory attenuation mechanism can develop through learning. In the learning task, the robot alternately repeated imprecise movement (random behavior) and precise movement (returning to the set-posture) in both self-produced and externally produced contexts. The RNN developed a hierarchical generative model about how proprioceptive and exteroceptive inputs are generated from the latent causes and also represented the stochastic property by dynamically modulating precisions of hierarchical latent variables, which are individually allocated to each area of the RNN. We found that for dealing with two distinct types of sensorimotor contexts, namely self-produced and externally produced contexts, the network developed two distinct free-energy states (minima) through learning, wherein each free-energy state corresponds to each sensorimotor context (Figs. 3, 4c,d). In the developed network, the top-down and bottom-up pathways functioned as follows. In the top-down pathway, the posterior in the executive area predicted the prior precision of the association area and sensory areas. In the bottom-up pathway, prediction error determined the posterior of sensory areas, the association area, and the executive area under the constraint of the prior precision expected at each area by the top-down pathway. Here, we see that the top-down pathway and the bottom-up pathway created a closed circuit in which the posterior in the executive area served as the information hub. In the closed circuit, a particular free-energy state corresponded to the characteristic top-down precision control and bottom-up prediction-error flow inside the hierarchical RNN.

In the self-produced context, sensory attenuation was achieved by minimizing the free-energy to the corresponding free-energy state. In this context, the posterior in the executive area developed to a particular value, and induced high prior precision in sensory areas and low prior precision in the association area, which resulted in less adaptation in its posterior in sensory areas and more adaptation in the association area using the bottom-up error signal. Less adaptation in the posterior in sensory areas corresponds to sensory attenuation. The detail mechanism is as below. The hierarchical RNN recognized that proprioceptive and exteroceptive inputs were generated from the same latent cause, i.e., self, and the latent cause, including its stochastic property, was represented in the higher-level areas (association area and executive area). In this case, sensory areas were not needed to represent the latent cause at all. Indeed, Fig. 2 shows that the RNN can reconstruct sensory inputs based on the dynamic modulation of latent variables only in the association area. Then, high precision (nearly 0 sigma) of latent variables in the sensory areas is thought to be developed to minimize variational free-energy by reducing prediction error due to random sampling and reducing KL divergence between the posterior and the prior.

In an externally produced context, the process was the other way around, wherein the posterior in sensory areas adapted greatly because of the low precision in its prior regulated by the executive area. This corresponds to sensory amplification. In this context, proprioceptive and exteroceptive inputs were generated from individual causes, i.e., self, and other. The hierarchical RNN needed to represent not only the higher-level context, i.e., externally produced context, using the association and executive areas, but also unresolved lower-level information due to individual causes using individual sensory areas. That is why, precisions of latent variables in the sensory areas as well as the association area were modulated dynamically.

Furthermore, the error induced by the change of the sensorimotor context flowed bottom-up to the executive area and determined the posterior in the executive area with its prior set to a neutral value. It triggered the transition from one free-energy state to another, i.e., from sensory attenuation to sensory amplification and verse-versa. In short, precision structures for sensory attenuation in self-produced contexts and sensory amplification in externally produced contexts were self-organized in one hierarchical RNN and were switched via executive control. This suggests that the hierarchical RNN developed a type of functionality of switching between quasi-deterministic and highly stochastic dynamics in each local area, in which sensory attenuation was characterized by quasi-deterministic processing (nearly 0 sigma) in the sensory areas and highly stochastic processing in the association area, while sensory amplification was characterized by highly stochastic processing in both the sensory areas and the association area. This sort of development and transitions of distinct free-energy states provides insights into how perceptual phenomena emerge from dynamic brain-body-environment interaction in the face of uncertainty.

Sensory attenuation observed in our model is an emergent property based on variational free-energy minimization, rather than a consequence guaranteed by the equations used in the proposed model. Indeed, a small meta-prior in sensory areas led to a deficit in development of sensory attenuation, in which prediction error and free-energy for training data were smaller than those of the baseline model (see Supplementary Fig. S5and S8). This shows that sensory attenuation was not necessarily developed even if proprioceptive and exteroceptive sensory inputs are correlated and the model precisely reconstructs the sensory inputs based on variational free-energy minimization. Instead, development of sensory attenuation depends on a generative model of the world self-organized by the hierarchical RNN. In particular, the following factors suggest that development of sensory attenuation required a sort of abstract-level recognition that proprioceptive and exteroceptive inputs are generated from the same latent cause, i.e., sensorimotor coupling, which was achieved by a balanced interaction between the top-down prediction with precision and the bottom-up prediction error. First, reduced numbers of latent variables in the association area or executive area disrupted development of sensory attenuation (Supplementary Figure S4). Second, a large meta-prior in the association area as well as a small meta-prior in the sensory areas, i.e., reduced prediction-error flow into the association area, led to a deficit in development of sensory attenuation (Supplementary Fig. S5S6). These show the importance of higher-level structure learning for developing sensory attenuation. Furthermore, the difference in the dimensions of proprioception (3-dimensional joint angle) and exteroception (2-dimensional object position) required an abstract-level representation for recognizing the sensorimotor coupling. Such an abstraction of sensorimotor information may be a reasonable consequence given the efficient coding, but it is not necessarily obvious given the self-organization by the hierarchical RNN itself through learning. This kind of emergent phenomenon self-organized through learning has not thoroughly been investigated in previous studies based on the free-energy principle. Thus, we think that this complex property emphasizes the significance of our neurorobotic approach.

There have been prior model proposals to account for sensory attenuation. One proposal13 postulates that sensory attenuation is caused by reducing the precision of the prediction error bottom-up to the sensory area during movement by following the free-energy principle. This model, however, does not explain the involvement of the higher executive area, as evidenced by19,20. We confirmed that removal of the executive area diminished the development of sensory attenuation (Supplementary Fig. S4b), emphasizing the contribution of the frontal function. This is consistent with biological studies suggesting that signals from the frontal area, such as the supplementary motor area, predictably control the relative precision or intensity of sensations. Their disruption causes diminished sensory attenuation19,26. Furthermore, the previous model intermixes two phenomena: sensory attenuation and sensory gating. Sensory attenuation compares the intensity of self-produced sensations with externally produced sensations, or the distinction between the self and others. On the other hand, sensory gating refers to a suppression process in which exteroceptions feel weaker during movement than at rest27. In our learning experiment, movements of the robot in self-produced contexts and externally produced contexts were the same, avoiding the confusion with sensory gating. In this sense, our model considered only sensory attenuation (self-other distinction).

According to another leading hypothesis, the pathway of an efference copy of the motor command is thought to originate from top-down signals19,26. In contrast, our model suggests that signals from the executive (frontal) area may represent predictive signals for controlling prediction-error flow inside the hierarchical network, rather than a copy of the motor command. We showed that the functionality, in which top-down signals hierarchically control bottom-up prediction-error flow inside the network (that modulates top-down signals), can be self-organized through learning.

The perspective that the sensory attenuation mechanism is a consequence of learning instead of an innate function, may be indirectly supported by a recent study suggesting that a target stimulus of sensory attenuation can be adaptively changed through rapid learning28. In addition, our result (Fig. 4c) explains increased sensory attenuation with age in adults20. Furthermore, our model suggests that proprioception, as well as exteroception, can be attenuated when a self-movement produces exteroception. In fact, a neuroimaging study observes less cerebellar activity when a movement produces a tactile stimulus, than when it does not6.

There have been some recent advances in our understanding of movement-related suppression processes, including the finding of a difference between sensory attenuation and sensory gating. One of the discussions is about a difference between “physiological sensory attenuation” and “perceptual sensory attenuation”. A recent study suggests that “physiological sensory attenuation” and “perceptual sensory attenuation” have different neurophysiological correlates29. In a recent study, “physiological sensory attenuation” was measured as a decrease in the amplitude of primary and secondary components of the somatosensory evoked potential (SEP) by comparing it during movement and at rest; thus, it may be related with sensory gating. On the other hand, “perceptual sensory attenuation” was measured in a force-matching paradigm and was suggested to be related with a decrease in prediction-error-related neural activity, such as gamma-oscillatory activity. Importantly, “perceptual sensory attenuation” negatively correlated with scores of delusional ideation (a measure of schizotypy) while no significant correlations were found between attenuation of SEP components and scores of delusional ideation. In our experiment, we focus on an attenuation of prediction-error-related response of the posterior; thus, our model may explain “perceptual sensory attenuation” and its neurocomputational mechanism. In addition, our model of sensory attenuation suggests that a decrease in prediction-error-related activity and “perceptual sensory attenuation” may represent self-other distinction, which has implications for mechanisms underlying delusional ideation.

In addition, a recent study demonstrated an enhancement, not suppression, of the intensity of predicted action outcomes by investigating effects of action on intensity of tactile stimulus reported by participants in the force-judgment paradigm30. First, the authors replicated typical findings that self-produced tactile stimuli are rated as less intense than externally produced stimuli when an active contact with a button generates a tactile stimulus. However, this effect reversed when there was no active finger contact with a button. In additional experiments, they controlled the predictability of tactile action outcomes and found that expected events were perceived more, not less, intensely than unexpected events. Note that this additional experiment compared cases with more predictable and less predictable outcomes produced by own action. Since this comparison did not compare stimuli produced by own action and other’s action, this additional experimental result does not conflict with our claim. These results in the previous study30 may show that when sensory attenuation does not appear, an action can enhance expected touch. This enhancement effect is consistent with the basic Bayesian idea that a predictive brain focuses on precise (predictable) stimuli to ignore uninformative (uncertain) noise. In this sense, the previous study suggests that a theory of sensory attenuation requires an additional thought beyond the basic Bayesian or active inference framework. Here, our model provides a particular mechanism for sensory attenuation beyond the basic Bayesian theory. Specifically, our results suggest that sensory attenuation requires an abstract-level recognition that proprioceptive and exteroceptive inputs are generated from the same latent cause, i.e., self, which is developed through context-sensitive optimization of hierarchical latent variables. This emphasizes the importance of developmental aspects of hierarchical predictive processing.

Alterations in development of sensory attenuation were induced by modulation of the meta-prior, a hyper-parameter determining the intrinsic strength of the prior compared to the prediction error at each network level. In the basic Bayesian model, the practitioner must manually set the prior, including its precision, to compute the posterior. However, in our model, the prior in each network area is epigenetically self-organized through learning, with the influence of the higher meta-level parameter, i.e., the meta-prior. We assumed that the meta-prior is an innate characteristic determining developmental features of the prior and prediction-error flow in the biological brain, although its neural substrate remains unspecified. In our experiment, a large meta-prior in the association area, i.e., an intrinsically strong prior in the association area, led to reduced sensory attenuation, i.e., a weak prior in sensory areas, while a small meta-prior in the association area, i.e., an intrinsically weak prior in the association area, did not impair development of sensory attenuation (Supplementary Fig. S6). These results may provide insights into relationships between strong prior hypotheses and reduced sensory attenuation in schizophrenia9,13,31,32 and between weak prior hypotheses and normal sensory attenuation in autism spectrum disorder33,34,35,36,37. It will be important to seek corresponding neurological evidence of the proposed computational mechanism for development of sensory attenuation and to explore its relevance to psychiatric disorders in future studies.

In summary, we have shown that sensory attenuation is an emergent property of free-energy minimization in active inference. Sensory attenuation is a ubiquitous phenomenon in psychology and psychophysics. Furthermore, it is receiving increasing attention in computational psychiatry, in which a failure of sensory attenuation may explain several neuropsychiatric conditions, ranging from Parkinson’s disease to schizophrenia and autism13,34. Technically, we have simulated behavior in terms of active inference, which can be thought of as a generalization of control and planning as inference. Crucially, we trained our robots to perform active inference by optimizing connection weights (\(\varvec{w}\)) and adaptive variables (\(\varvec{a}\)) in hierarchically organized RNNs, which played the role of a hierarchical generative model. The implicit learning of connection strengths can be thought of as “amortization” or “learning to infer”. In other words, by minimizing (the path or time integral of) variational free-energy, robots were able to learn the mapping from sensory inputs to posterior beliefs about the causes of their sensory inputs. These posterior beliefs recognize the context in which they are operating and generate proprioceptive predictions that are realized by a PID controller. Because certain beliefs concern precision of inferred sensorimotor trajectories, this enables a context sensitive optimization of certain precisions that underwrite sensory attenuation. Put simply, when robots recognized that the proprioceptive and exteroceptive inputs are best explained by externally generated movement, they reduced the precision afforded representations of self generated sensations. Conversely, when proprioceptive and exteroceptive sensory inputs were best explained by self-generated sensations, precision of these explanations increased, thereby reducing the influence of sensory prediction errors. This is the basis of sensory attenuation. Note that this context-sensitive optimization of precision, i.e., encoding of uncertainty, is an emergent property of minimizing the path integral of free-energy. In other words, sensory attenuation emerges from a generative model that provides the best explanation for proprioceptive and exteroceptive inputs, when accumulating free-energy or model evidence (a.k.a., marginal likelihood) over time.

Methods

Neural network model

To simulate development of sensory attenuation, we utilized a predictive-coding-inspired variational recurrent neural network (PV-RNN), which represents a generative process of sensation from a hidden cause in the environment, based on the free-energy principle (FEP)16,17. It consists of sensory (exteroceptive and proprioceptive), association, and executive areas in which there are deterministic neurons and latent (stochastic) neurons. Latent neurons represent belief about the cause of sensation as Gaussian distributions (for simplicity). Each latent state has prior and posterior probability distributions that correspond to estimated hidden causes before and after observing sensations, respectively. Based on the latent states, the PV-RNN generates predictions about next sensations in a top-down way. Here, deterministic neurons transform latent states into sensory predictions via synaptic connections that represent relationships between sensations and their causes. The PV-RNN uses a multiple timescale RNN (MTRNN)38 as the transformation function. An RNN represents temporal processing of the brain in that neural activity is determined by the past history of neural states. Owing to their capacity to learn to reproduce complex dynamic behaviors, RNNs have been used in computational modeling and developmental neurorobotic studies to understand cortical processing and cognitive functions, including psychiatric symptoms38,39,40,41,42,43. In addition, MTRNNs have a multiple timescale property in neural activation, which enables them to represent a temporal hierarchy in the environment, as observed in the biological brain44. By using a PV-RNN, we can learn complex sensorimotor behaviors that have both temporal regularity and stochasticity16,17. To perform spontaneous behaviors like daily human behaviors, a neural network-controlling robot was required to process short-term random movements and a long-term periodic pattern, as in the experiment in which the multiple timescale property is thought to facilitate sensorimotor learning. In the following sections, we describe the mathematical details of top-down prediction generation and bottom-up parameter updates by the PV-RNN (Supplementary Fig. S1S2).

Prediction generation

Prediction generation is performed in a top-down way through the network hierarchy. The internal state \(h_{t,i}^{(s)}\) and output \(d_{t,i}^{(s)}\) of the ith deterministic neuron of the sth target sequence at time step t \(\left( t \ge 1\right)\) is calculated as

$$\begin{aligned} h_{t,i}^{(s)} = \left\{ \begin{array}{ll} \displaystyle \frac{1}{\tau }\Biggl (\sum _{j\in I_\mathrm{Ad}}w_{ij}d_{t-1,j}^{(s)} +\sum _{j\in I_\mathrm{Az}}w_{ij}z_{t,j}^{(s)}+\sum _{j\in I_{\mathrm{Cz}}}w_{ij} z_{t,j}^{(s)}+b_{i}\Biggr )+\Biggl (1-\frac{1}{\tau }\Biggr )h_{t-1,i}^{(s)} \;\left( i \in I_{\mathrm{Ad}}\right) , \vspace{10truept}\\ \\ \displaystyle \frac{1}{\tau } \;\Biggl (\sum _{j\in I_\mathrm{Ed}}w_{ij}d_{t-1,j}^{(s)} +\sum _{j\in I_\mathrm{Ez}}w_{ij}z_{t,j}^{(s)}+\sum _{j\in I_\mathrm{Ad}}w_{ij}d_{t,j}^{(s)}+b_{i}\Biggr )+\Biggl (1-\frac{1}{\tau }\Biggr )h_{t-1,i}^{(s)} \; \left( i \in I_{\mathrm{Ed}}\right) , \vspace{10truept}\\ \\ \displaystyle \frac{1}{\tau }\Biggl (\sum _{j\in I_\mathrm{Pd}}w_{ij}d_{t-1,j}^{(s)} +\sum _{j\in I_\mathrm{Pz}}w_{ij}z_{t,j}^{(s)}+\sum _{j\in I_\mathrm{Ad}}w_{ij}d_{t,j}^{(s)}+b_{i}\Biggr )+\Biggl (1-\frac{1}{\tau }\Biggr )h_{t-1,i}^{(s)} \; \left( i \in I_{\mathrm{Pd}}\right) . \end{array} \right. \end{aligned}$$
(2)
$$d_{t,i}^{(s)}= {} \tanh {\left( h_{t,i}^{(s)}\right) } \left( i \in I_{\mathrm{Ed}},I_{\mathrm{Pd}},I_{\mathrm{Ad}}\right) .$$
(3)

Here, \(I_{\mathrm{Ed}}\), \(I_{\mathrm{Pd}}\), and \(I_{\mathrm{Ad}}\) are index sets of deterministic neurons in the exteroceptive area, proprioceptive area, and association area, respectively. \(I_{\mathrm{Ez}}\), \(I_\mathrm{Pz}\), \(I_{\mathrm{Az}}\), and \(I_{\mathrm{Cz}}\) are index sets of latent neurons in the exteroceptive, proprioceptive, association, and executive areas, respectively. \(w_{ij}\) is the weight of the synaptic connection from the jth neuron to the ith neuron; \(z_{t,j}^{(s)}\) is the output of jth latent (posterior) neuron at time step t; \(\tau\) is the time constant of the neuron; and \(b_{i}\) is the bias of the ith neuron. A deterministic neuron with a small time constant \(\tau\) has a tendency to change its activity rapidly, while that with a large time constant has a tendency to change its activity slowly. We set the initial internal states of the deterministic neurons \(h_{0,i}^{(s)} \left( i \in I_\mathrm{Ed},I_{\mathrm{Pd}},I_{\mathrm{Ad}}\right)\) to 0 (\(d_{0,i}^{(s)}\) is also 0).

The latent variable \(\varvec{z}\) in each area is assumed to follow a multivariate Gaussian distribution with a diagonal covariance matrix, meaning \(z_{t,i}^{(s)}\) and \(z_{t,j}^{(s)}\) are independent (\(i,j \in I_{\mathrm{Ez}},I_{\mathrm{Pz}},I_{\mathrm{Az}} \wedge i \ne j\)). Here, the mean \(\mu _{t,i}^{(s),p}\) and sigma (standard deviation) \(\sigma _{t,i}^{(s),p}\) of the prior distribution \(p(z_{t,i}^{(s)})\) in the exteroceptive, proprioceptive, and association areas are calculated from the previous deterministic state (prior experience) of the same area.

$$\displaystyle p(z_{t,i}^{(s)})= {} p\left(z_{t,i}^{(s)}|d_{t-1,j}^{(s)}\right)= \mathcal {N}\left(z_{t,i}^{(s)};\mu _{t,i}^{(s),p},\sigma _{t,i}^{(s),p}\right).$$
(4)
$$\mu _{t,i}^{(s),p}= {} \tanh {\left( \sum _{j}w_{ij}d_{t-1,j}^{(s)} \right) }, \vspace{10truept}$$
(5)
$$\sigma _{t,i}^{(s),p}= {} \exp {\left( \sum _{j}w_{ij}d_{t-1,j}^{(s)} \right) }.$$
(6)

Here, \(\left( i \in I_{\mathrm{Ez}} \wedge j \in I_{\mathrm{Ed}}\right) \vee \left( i \in I_{\mathrm{Pz}} \wedge j \in I_{\mathrm{Pd}}\right) \vee \left( i \in I_{\mathrm{Az}} \wedge j \in I_{\mathrm{Ad}}\right)\). Note by optimizing the weights \(\varvec{w}\), with respect to the path integral of free-energy, we are effectively optimizing prior beliefs about sensorimotor contingencies and contexts. This can be regarded as a form of structure learning through experience. The executive area has a prior distribution as \(\mathcal {N}(0,1)\) only at the initial step (\(t=1\)) because it has a constant posterior state during sequential prediction generation, with the objective of assigning a specific executive-level posterior to each target sequence.

The posterior distribution in each area is calculated as,

$$\begin{aligned} q(z_{t,i}^{(s)}|\varvec{e_{t:T^{(s)}}^{(s)}}) = \left\{ \begin{array}{ll} \displaystyle q(z_{t-1,i}^{(s)}|\varvec{e_{t-1:T^{(s)}}^{(s)}}) = q(z_{1,i}^{(s)}|\varvec{e_{1:T^{(s)}}^{(s)}})=\mathcal {N}(z_{1,i}^{(s)};\mu _{1,i}^{(s),q},\sigma _{1,i}^{(s),q}) \; \left( i \in I_\mathrm{Cz}\right) , \\ \\ \displaystyle \mathcal {N}(z_{t,i}^{(s)};\mu _{t,i}^{(s),q},\sigma _{t,i}^{(s),q}) \; \left( i \in I_\mathrm{Ez}, I_\mathrm{Pz},I_\mathrm{Az}\right) . \end{array} \right. \end{aligned}$$
(7)
$$\mu _{t,i}^{(s),q} = \tanh {\left( a_{t,i}^{(s),\mu } \right) }, \vspace{10truept}$$
(8)
$$\sigma _{t,i}^{(s),q} = \exp {\left( a_{t,i}^{(s),\sigma } \right) }, \vspace{10truept}$$
(9)
$$z_{t,i}^{(s)} = \mu _{t,i}^{(s),q} + \sigma _{t,i}^{(s),q} \times \epsilon .$$
(10)

Here, \(T^{(s)}\) represents the length of the sth target sequence. \(\varvec{a}\) is the adaptive internal state of neurons representing posterior distributions and it is updated at each time step and for each target sequence during the learning process (or each time step through online inference). Note that Eqs. (8) and (9) mean that the adaptive variables implicitly encode both posterior expectations and precision, such that optimizing the adaptive variables with respect to variational free-energy implicitly optimizes posterior expectations and the posterior confidence or precision afforded those expectations. Adaptive variables \(\varvec{a}_{\varvec{t}}\) are determined by prediction error signals \(\varvec{e}_{\varvec{t:T}}\) propagated by a back-propagation-through-time (BPTT) algorithm, meaning that the posterior of the latent state can be considered a prediction-error-related neural state. Adaptive variables \(\varvec{a}\) are initialized by the corresponding initial internal states of the neurons representing prior distributions before the learning or inference process. Based on the posterior calculation, the latent state \(z_{t,i}^{(s)}\) is obtained by sampling \(\epsilon\) from \(\mathcal {N}(0,1)\).

Finally, predictions about exteroceptive and proprioceptive sensations are individually generated from exteroceptive and proprioceptive areas, respectively.

$$\begin{aligned} \hat{x}_{t,i}^{(s)} = \left\{ \begin{array}{ll} \displaystyle \tanh {\left( \sum _{j\in I_{\mathrm{Ed}}}w_{ij}d_{t,j}^{(s)} \right) } \left( i \in I_{\mathrm{Eo}} \right) , \vspace{10truept}\\ \\ \displaystyle \tanh {\left( \sum _{j\in I_{\mathrm{Pd}}}w_{ij}d_{t,j}^{(s)} \right) } \left( i \in I_{\mathrm{Po}}\right) . \end{array} \right. \end{aligned}$$
(11)

Here, \(I_{\mathrm{Eo}}\) and \(I_{\mathrm{Po}}\) are index sets of output neurons for exteroceptive and proprioceptive predictions, respectively.

Parameter updates via free-energy minimization

The concept of FEP14 derives from the fundamental fact that self-organizing biological agents must maintain a limited repertoire of sensory states to remain alive, e.g., a human stays on the ground, not in the sea. Based on information theory, this notion can be formulated as suppression of the surprise (or the negative log-evidence) for sensations \(\varvec{x}\) over time. In the PV-RNN, the surprise over all time steps and target sequences can be written as:

$$\begin{aligned} \displaystyle \sum _{s \in I_{\mathrm{S}}}-\log p(\varvec{x_{1:T^{(s)}}^{(s)}})= & {} \sum _{s \in I_{\mathrm{S}}}\left[ -\log \int p(\varvec{x_{1:T^{(s)}}^{(s)}},\varvec{z_{1:T^{(s)}}^{(s)}})d\varvec{z_{1:T^{(s)}}^{(s)}}\right] \end{aligned}$$
(12)
$$\begin{aligned}= & {} \sum _{s \in I_{\mathrm{S}}}\left[ -\log \int \prod _{t=1}^{T^{(s)}} \left[ p(\varvec{x_{t}^{(s)}}|\varvec{d_{t}^{(s)}})p(\varvec{z_{t}^{(s)}}|\varvec{d_{t-1}^{(s)}})\right] d\varvec{z_{1:T^{(s)}}^{(s)}}\right] . \end{aligned}$$
(13)

Here, \(I_{\mathrm{S}}\) denotes the index set of target sequences. However, surprise cannot be directly evaluated by the agent because it needs to know all hidden states \(\varvec{z}\) of the environment that cause sensations, as described on the right side of Eq. (12). Here, FEP introduces a tractable quantity, free-energy, that bounds the surprise, and minimization of the surprise is replaced by minimization of the free-energy. The bound of the surprise in the PV-RNN can be derived by utilizing Jensen’s inequality for a concave function f: \(f(E[x])\ge E[f(x)]\). For clarity, summation over target sequences is temporarily omitted in the following equations. Then, Eq. (13) can be deformed as follows by introducing a dummy distribution for \(\varvec{z}_{\varvec{1}}\), \(q(\varvec{z}_{\varvec{1}})\).

$$\begin{aligned} \displaystyle -\log p(\varvec{x_{1:T}})= & {} -\log \int _{\varvec{z_{1}}}d\varvec{z_{1}}q(\varvec{z_{1}}) \int _{\varvec{z_{2:T}}}d\varvec{z_{2:T}} \frac{1}{q(\varvec{z_{1}}) }\prod _{t=1}^{T} \left[ p(\varvec{x_{t}}|\varvec{d_{t}})p(\varvec{z_{t}}|\varvec{d_{t-1}})\right] \end{aligned}$$
(14)
$$\le {} - \int _{\varvec{z_{1}}}d\varvec{z_{1}}q(\varvec{z_{1}}) \log \int _{\varvec{z_{2:T}}}d\varvec{z_{2:T}} \frac{1}{q(\varvec{z_{1}}) }\prod _{t=1}^{T} \left[ p(\varvec{x_{t}}|\varvec{d_{t}})p(\varvec{z_{t}}|\varvec{d_{t-1}})\right] .$$
(15)

In Eq. (15), we use Jensen’s inequality for a logarithmic function. The same procedure is done for \(t=2:T\).

$$\begin{aligned} \displaystyle -\log p(\varvec{x_{1:T}}) \le - \int _{\varvec{z_{1}}}d\varvec{z_{1}}q(\varvec{z_{1}}) \cdots \int _{\varvec{z_{T}}}d\varvec{z_{T}}q(\varvec{z_{T}}) \log \prod _{t=1}^{T} \left[ \frac{p(\varvec{x_{t}}|\varvec{d_{t}})p(\varvec{z_{t}}|\varvec{d_{t-1}})}{q(\varvec{z_{t}})} \right] \end{aligned}$$
(16)
$$= - \int _{\varvec{z_{1}}}d\varvec{z_{1}}q(\varvec{z_{1}}) \cdots \int _{\varvec{z_{T}}}d\varvec{z_{T}}q(\varvec{z_{T}}) \sum _{t=1}^{T} \left[ \underbrace{\log p(\varvec{x_{t}}|\varvec{d_{t}})}_{\mathrm{{A}}} - \underbrace{\log \frac{q(\varvec{z_{t}})}{p(\varvec{z_{t}}|\varvec{d_{t-1}})}}_{\mathrm{{B}}} \right] .$$
(17)

The first term in expression (17) is the expected negative log-likelihood under q given that \(\varvec{d_{t}}\) depends on \(\varvec{z_{1:t}}\).

$$\displaystyle \mathrm {A}= {} - \int _{\varvec{z_{1}}}d\varvec{z_{1}}q(\varvec{z_{1}}) \cdots \int _{\varvec{z_{T}}}d\varvec{z_{T}}q(\varvec{z_{T}}) \sum _{t=1}^{T} \log p(\varvec{x_{t}}|\varvec{d_{t}})$$
(18)
$$= {} - \sum _{t=1}^{T} E_{q(\varvec{z_{1:t}})} \left[ \log p(\varvec{x_{t}}|\varvec{d_{t}})\right] .$$
(19)

In addition, the second term can be deformed into forms of Kullback–Leibler divergence (KLD) between \(q(\varvec{z_{t}})\) and \(p(\varvec{z_{t}}|\varvec{d_{t-1}})\) while being careful to ensure that \(\varvec{d_{0}}\) is independent from \(\varvec{z_{1:T}}\).

$${\mathrm{B}} = \int _{\varvec{z_{1}}}d\varvec{z_{1}}q(\varvec{z_{1}}) {\cdot}{\cdot}{\cdot} \int _{\varvec{z_{T}}}d\varvec{z_{T}}q(\varvec{z_{T}}) \sum _{t=1}^{T} \log \frac{q(\varvec{z_{t}})}{p(\varvec{z_{t}}|\varvec{d_{t-1}})}$$
(20)
$$= {} \int _{\varvec{z_{1}}}d\varvec{z_{1}}q(\varvec{z_{1}})\log \frac{q(\varvec{z_{1}})}{p(\varvec{z_{1}}|\varvec{d_{0}})} + \int _{\varvec{z_{1}}}d\varvec{z_{1}}q(\varvec{z_{1}}) \cdots \int _{\varvec{z_{T}}}d\varvec{z_{T}}q(\varvec{z_{T}}) \sum _{t=2}^{T} \log \frac{q(\varvec{z_{t}})}{p(\varvec{z_{t}}|\varvec{d_{t-1}})}$$
(21)
$$= {} D_{KL}[q(\varvec{z_{1}})||p(\varvec{z_{1}}|\varvec{d_{0}})] + \sum _{t=2}^{T} E_{q(\varvec{z_{1:t-1}})} \left[ D_{KL}[q(\varvec{z_{t}})||p(\varvec{z_{t}}|\varvec{d_{t-1}})] \right] .$$
(22)

A dummy distribution \(q(\varvec{z_{t}})\) can be replaced by the posterior distribution determined by the back-propagated prediction error \(q(\varvec{z_{t}}|\varvec{e_{t:T}})\). Thus, using Eqs. (19) and (22), the bound of the surprise can be written as,

$$\begin{aligned} -\log p(\varvec{x_{1:T}}) \le&- \sum _{t=1}^{T} E_{q(\varvec{z_{1:t}}| \varvec{e_{t:T}})} \left[ \log p(\varvec{x_{t}}|\varvec{d_{t}})\right] + D_{KL}[q(\varvec{z_{1}} | \varvec{e_{1:T}} )||p(\varvec{z_{1}}|\varvec{d_{0}})] + \\&\sum _{t=2}^{T} E_{q(\varvec{z_{1:t-1}} | \varvec{e_{t-1:T}})} \left[ D_{KL}[q(\varvec{z_{t}} | \varvec{e_{t:T}})||p(\varvec{z_{t}}|\varvec{d_{t-1}})] \right] . \end{aligned}$$
(23)

In the experiment, we resolve calculation of the expectations of negative log-likelihood and KLD under q by considering single sampling to reduce computational costs.

$$\displaystyle -\log p(\varvec{x_{1:T}})\le {} \sum _{t=1}^{T} \left( -\log p(\varvec{x_{t}}|\varvec{d_{t}}) + D_{KL}[q(\varvec{z_{t}} | \varvec{e_{t:T}})||p(\varvec{z_{t}}|\varvec{d_{t-1}})] \right) .$$
(24)

Eventually, by introducing the meta-prior and considering the summation over different target sequences, the free-energy F (the bound of the surprise) in the PV-RNN is formulated as:

$$\displaystyle F= {} \sum _{s \in I_{\mathrm{S}}} \sum _{t=1}^{T^{(s)}} F_{t}^{(s)}$$
(25)
$$\begin{aligned} F_{t}^{(s)}= & {} - \underbrace{ \log p(\varvec{x_{t}^{(s)}}|\varvec{d_{t}^{(s)}})}_{\mathrm{{Accuracy}}} + \sum _{l=1}^{3}W^{(l)} \underbrace{ \left( D_{KL}[q(\varvec{z_{t}^{(l),(s)}} | \varvec{e_{t:T}^{(s)}})||p(\varvec{z_{t}^{(l),(s)}}|\varvec{d_{t-1}^{(l),(s)}})] \right) }_{\mathrm{{Complexity}}}. \end{aligned}$$
(26)

Here, \(W^{(l)}\) denotes the meta-prior at lth network level. The first term in expression (26) (negative accuracy term) is the negative log-likelihood. For simplicity, we assume that each sensory state follows a Gaussian distribution with unit variance. Then, the first term becomes just the prediction error between the real \(\varvec{x_{t}^{(s)}}\) and predicted \(\varvec{\hat{x}_{t}^{(s)}}\) sensations (plus a constant term, omitted here).

$$\displaystyle -\mathrm{{Accuracy}} = \sum _{i \in I_{\mathrm{Eo}} \vee I_\mathrm{Po}} \frac{1}{2} \left( x_{t,i}^{(s)}-\hat{x}_{t,i}^{(s)} \right) ^{2}.$$
(27)

This assumption sets the precision of the prediction error to a constant value and makes it easy to consider the relative precision of prior beliefs compared to prediction errors. In the experiment, we divided the accuracy term by the dimension of each exteroceptive and proprioceptive sensation.

On the other hand, the second term (complexity term) is the KLD between the posterior and prior distributions of the latent variables. Note that variables of the posterior are updated through both the accuracy and complexity terms, but those of the prior are updated only through the complexity term. Therefore, the complexity term represents only the influence of prior beliefs, which are controlled by the meta-prior W. Under the assumption that the prior and posterior distributions follow a multivariate Gaussian distribution with a diagonal covariance matrix, as described above, the KLD is computed analytically as16,17,

$$\begin{aligned} \displaystyle \mathrm{{Complexity}} = \sum _{i} \left( \log \frac{\sigma _{t,i}^{(s),p}}{\sigma _{t,i}^{(s),q}} + \frac{(\mu _{t,i}^{(s),p} - \mu _{t,i}^{(s),q})^2 + (\sigma _{t,i}^{(s),q})^2}{2(\sigma _{t,i}^{(s),p})^2} - \frac{1}{2} \right) . \end{aligned}$$
(28)

Here, \(i \in I_{\mathrm{Ez}} \vee I_{\mathrm{Pz}}\) (if \(l=1\)), \(i \in I_\mathrm{Az}\) (if \(l=2\)), and \(i \in I_{\mathrm{Cz}}\) (if \(l=3\)). In the experiment, we divided the complexity term by the dimension of latent variables for each area. Note that if \(l=3\) (executive area), the complexity term exits only at \(t=1\), although we write the equation in a general way for simplicity.

In the learning phase (Supplementary Fig. S1), synaptic weights \(\varvec{w}\) and adaptive variables \(\varvec{a}\) are updated to minimize the free-energy over all time steps and target sequences as,

$$F = \sum _{s \in I_{\mathrm{S}}} \sum _{t=1}^{T^{(s)}}F_{t}^{(s)}.$$
(29)

In the test phase after learning (Fig. S2), only adaptive variables \(\varvec{a}\) are updated, while synaptic weights are fixed. In this phase, the free-energy within a short time window H is summed as

$$F = \sum _{t^{'}=t-H+1}^{t}F_{t^{'}}.$$
(30)

Using the summed free-energy, adaptive variables \(\varvec{a_{t-H+1:t}}\) within the time window of all areas are updated, whereas the time window slides as the network time step t is incremented.

In both learning and test phases, we used the Adam optimizer45 for parameter updates, where the partial derivative of the free-energy with respect to each parameter is calculated by the BPTT algorithm.

Experimental environment

We set a 3-axis robotic arm in a simulated square space \([-1,1]\times [-1,1]\). The lengths of the robot’s links are 0.1, 0.3, and 0.5. Each joint angle was limited to range from 0 to \(\pi\) [rad] and normalized to range from \(-0.8\) to 0.8 to match the range of the neural network output. In addition, the PV-RNN receives the 2-dimensional position of a red object as an exteroceptive sensation. During task execution, the robot receives 5-dimensional sensations every 250 ms.

Learning

The PV-RNN learned to reproduce target sensorimotor sequences prepared in advance. First, we recorded 24 sequences of joint angles while the experimenter manually manipulated the left arm of a physical robot (Rakuda, developed by Robotis). For each sequence, the experimenter performed 10 repetitions generating a random movement and returning to the set posture within five seconds (20 time steps). Therefore, the length of each sequence is 200 time steps. We used joint angles from left shoulder pitch, left shoulder roll, and the left elbow of Rakuda as 3-dimensional proprioception data of the simulated robot arm. Next, we prepared exteroception data for a self-produced context by setting the object position as the hand position of the simulated arm robot that was calculated by forward kinematics using the recorded joint angle data. In this fashion, we obtained 24 target sequences for self-produced contexts in which exteroceptive and proprioceptive sensations are correlated. Finally, we prepared target sequences for externally produced contexts by shuffling the combination of exteroceptive and proprioceptive sequences. By doing this, we obtained 24 target sequences for externally produced contexts in which exteroceptive and proprioceptive sensations are not correlated. This shuffling procedure ensures that total numbers of changes in sensations are the same for self-produced and externally produced contexts in the learning phase. The PV-RNN learned to reproduce the prepared 48 training datasets via free-energy minimization.

Online inference

We prepared an additional 8 exteroceptive sequences as test data that were used in externally produced contexts in the test phase. Before a test trial, initial states of adaptive variables \(\varvec{a_{1}}\) in all areas were set to the median values obtained for 24 training datasets of the self-produced context developed through learning. Based on the initial posterior states corresponding to self-produced contexts, the robot first moved the object by itself during time steps 0–100. The robot controlled its joint angles via active inference using the PID controller, for which proprioceptive predictions were used as target joint angles. Then, during time steps 100–200, the environmental context was shifted to the externally produced context, although the robot kept generating spontaneous behaviors by itself via active inference. In the externally produced context, the object position was set from test data. The goal of the robot was to flexibly recognize the environmental change by updating adaptive variables via free-energy minimization. The online inference process was performed based on an interaction between top-down prediction generation and bottom-up posterior updates. In the top-down prediction generation process, the PV-RNN generated sensory predictions \(\varvec{\hat{x}_{t-H+1:t}}\) corresponding to time steps from \(t-H+1\) to t, based on the posterior of latent states \(\varvec{z_{t-H+1:t}}\). In the bottom-up modulation process, the free-energy at each time step in time window H was calculated using prediction errors for exteroception and proprioception, for which target sensations were the real object position and joint angles from \(t-H+1\) to t. Adaptive variables \(\varvec{a_{t-H+1:t}}\) in the time window were updated to minimize the free-energy summed over time steps, and sensory predictions within the time window were re-generated using the updated posterior states. By repeating top-down prediction generation and bottom-up posterior updates for a certain duration, the PV-RNN generated the prior of latent states for time step \(t+1\). The generated prior was used to initialize the posterior for time step \(t+1\), and predictions about sensations for time step \(t+1\) were generated from the posterior. Using proprioceptive predictions for time step \(t+1\) as the target joint angles, the robot moved the joint angles using the PID controller. At the same time, the robot received exteroceptive sensations at time step \(t+1\). After that, the robot’s time step was incremented and the online inference process was performed for the newly received sensations. This inference process, in which recognition and prediction in the past are reconstructed from current sensations, corresponds to a “postdiction” process.

Parameter settings

The dimension of latent variables \(\varvec{z}\) in the exteroceptive and proprioceptive areas was 1, and that in the association area was 3. Therefore, the total dimension of latent neurons in sensory and association areas was the same as the sensory dimension. The dimension of latent variables in executive area was 1. A preliminary experiment showed that a smaller number of latent neurons in the association area led to a lower level of sensory attenuation, supporting the idea that sensory attenuation is a consequence of representing sensorimotor correlation in the association area (Supplementary Fig. S4a). In addition, we confirmed that no executive-level latent state resulted in a highly decreased level of sensory attenuation, suggesting an important role of the executive-level latent state (Supplementary Fig. S4b). The numbers of deterministic neurons in the exteroceptive, proprioceptive, and association areas were all 15. In a preliminary experiment, we evaluated development of sensory attenuation for settings of 10, 15, or 20 deterministic neurons and confirmed that the setting of 15 neurons showed the largest sensory attenuation (Supplementary Fig. S9a). We set the time constant \(\tau\) to half the number of deterministic neurons (8 neurons) to 2 and that of the other neurons (7 neurons) to 4, as the simplest multiple timescale setting. We set the same multiple timescale property for both the sensory and association areas. This is because we thought that the PV-RNN controlled which levels of the network hierarchy should be used to represent sensations, depending on the context. Actually, the experimental results show that the PV-RNN mainly used the association area in the self-produced context and both the sensory and association areas in an externally produced context, in which the timescale property included in sensations was the same for the two contexts. In a preliminary experiment, we confirmed that the multiple timescale setting led to greater sensory attenuation compared to the single timescale setting \(\tau =2\) for all deterministic neurons (Supplementary Fig. S9b). Synaptic weights were initialized with random values using the default method implemented by Pytorch. Biases of deterministic neurons were initialized with and fixed to random values following a Gaussian distribution N(0, 10), for which the variance of biases is close to the firing threshold variability found in biological neurons46, as well as the optimal value in a spiking neural network model47 and a recurrent neural network model41. We trained 10 networks with different initial synaptic weights for each hyper-parameter setting for quantitative evaluations. In the learning phase, parameters including synaptic weights \(\varvec{w}\) and adaptive variables \(\varvec{a}\) were updated 200,000 times with the Adam optimizer. We used the same parameter setting of Adam as in the original paper: \(\alpha = 0.001\) (learning rate), \(\beta _{1}=0.9\), and \(\beta _{2}=0.999\). In the test phase, adaptive variables \(\varvec{a}\) were updated 50 times at each time step for a time window of length \(H = 10\) with \(\alpha = 0.09\). We chose the optimal parameter setting in the test phase from the combinations of \(\alpha =\{0.001, 0.005, 0.01, 0.03, 0.05, 0.07, 0.09, 0.1, 0.2, 0.3, 0.4, 0.5\} \times H=\{10, 15\}\) by evaluating levels of prediction error in the baseline meta-prior setting (\(W^{(1)}=W^{(2)}=W^{(3)}=0.005\)).

Statistical analysis

We used paired t-tests for statistical analyses of network behaviors, such as the posterior response and the level of the prior sigma. All statistical tests were two-tailed, and the significance level was set at \(p<0.05\). The current study conducted an original unprecedented computational simulation experiment. Thus, it is difficult to estimate the effect size, and no statistical methods were used to pre-determine sample size. Considering the high reproducibility of computational simulation, we set the minimum sample size that seemed statistically testable (10 samples) . Indeed, even for 10 samples, paired t-tests reported clear differences between self-produced context and externally produced context (e.g., \(t(9)=-3.38; p=0.0082\) for posterior response and \(t(9)=-3.32; p=0.0089\) for prior sigma). Therefore, we concluded that a larger sample size would not have significantly influenced our main result. Data analyses were conducted using R software (version 3.3.2).

Figure 1
figure 1

Training setting. (a) Hierarchical recurrent network updating of synaptic weights and posterior beliefs in all areas via minimization of free-energy determined by the prediction error (PE) and Kullback-Leibler divergence (KLD) between the posterior and the prior. (b) Spontaneous sensorimotor patterns learned by the recurrent network.

Figure 2
figure 2

Test setting and an example of the result. (a) Online inference of latent states and action generation by a PID controller, with synaptic weights fixed. (b) Executive-level posterior distribution. (c) Association-level posterior and prior distributions. (d) Sensory-level posterior and prior distributions in exteroceptive and proprioceptive areas. (e) Prediction error (PE) and real sensations for exteroception and proprioception. For clarity, proprioception is represented as the 2-dimensional hand position, although the actual proprioception comprises 3-dimensional joint angles.

Figure 3
figure 3

The self-organized mechanism for the transition between sensory attenuation in self-produced contexts and sensory amplification in externally produced contexts. For each sensorimotor context, a particular free-energy minimum in the network was developed through learning, which was characterized by the corresponding top-down precision control and bottom-up prediction-error (PE) flow. Consequently, in self-produced contexts, the recurrent network mainly adjusted only the posterior in the association area \(\varvec{z}^{\varvec{(2),q}}\) to minimize prediction errors by increasing precision of priors in sensory areas \(\varvec{z}^{\varvec{(1),p}}\) and decreasing precision of prior in the association area \(\varvec{z}^{\varvec{(2),p}}\). On the other hand, in externally produced contexts, it adjusted posteriors in both sensory and association areas (\(\varvec{z}^{\varvec{(1),q}}\) and \(\varvec{z}^{\varvec{(2),q}}\)) by decreasing precision of priors in both sensory and association areas. The shift of the sensorimotor context triggered modulation of the posterior in the executive area \(\varvec{z}^{\varvec{(3),q}}\) as well as the transition from one free-energy state to another in the network (bottom time series). This induced a qualitative change in the network behavior and a shift between attenuating and amplifying prediction-error-induced responses in sensory areas. The black line in the time series of the free-energy describes the 20-step moving average (values within the first 19 steps are calculated as averages from the initial to the current step). The red line is the average over the first 100 steps, i.e., self-produced context, or the last 100 steps, i.e., externally produced context. Values are averages over 80 test trials performed by 10 different networks.

Figure 4
figure 4

Sensory-level neural response in the test phase and its development in the learning phase. (a) Box-and-whiskers plots showing the difference in sensory-level posterior response between self-produced and externally produced contexts. (b) Box-and-whiskers plots showing the difference in the sensory-level prior sigma between the two contexts. In (a,b), each plot is an average of 100 time steps, 2 sensory areas, and 8 test trials by each of 10 trained networks (center line, median; cross, mean; box limits, upper and lower quartiles; whiskers, \(\times\)1.5 interquartile range). **\(p<0.01\). (c) Development of the posterior response for the two contexts in the exteroceptive and proprioceptive areas. (d) Development of sensory-level prior sigma. In (c,d), the value is averaged over 200 time steps, 24 training datasets, and 10 different networks in each learning epoch.