Exploration-based model learning with self-attention for risk-sensitive robot control

Model-based reinforcement learning for robot control offers the advantages of overcoming concerns on data collection and iterative processes for policy improvement in model-free methods. However, both methods use exploration strategy relying on heuristics that involve inherent randomness, which may cause instability or malfunction of the target system and render the system susceptible to external perturbations. In this paper, we propose an online model update algorithm that can be directly operated in real-world robot systems. The algorithm leverages a self-attention mechanism embedded in neural networks for the kinematics and the dynamics models of the target system. The approximated model involves redundant self-attention paths to the time-independent kinematics and dynamics models, allowing us to detect abnormalities by calculating the trace values of the self-attention matrices. This approach reduces the randomness during the exploration process and enables the detection and rejection of detected perturbations while updating the model. We validate the proposed method in simulation and with real-world robot systems in three application scenarios: path tracking of a soft robotic manipulator, kinesthetic teaching and behavior cloning of an industrial robotic arm, and gait generation of a legged robot. All of these demonstrations are achieved without the aid of simulation or prior knowledge of the models, which supports the proposed method ’ s universality for various robotics applications.


INTRODUCTION
The kinematics and the dynamics of robots are essential elements for precision control that guarantee robust and stable task completion [1][2][3] .Most robot control schemes rely on models for different tasks, including motion optimization, task scheduling, prioritized control, and the construction of accurate models has thus been one of the top priorities in conventional control.The models for rigid-bodied robots typically rely on the geometries and the energy method, which estimate their shapes and positions 1,[4][5][6][7] .However, calculating the dynamic features of every component is often tedious and always involves errors in the model [8][9][10][11][12][13][14][15] .Alternatively, the techniques of black-box model identification using machine intelligence have been proposed for obtaining models without considering mechanical characteristics.Moreover, its control strategy can also be trained by employing reinforcement learning (RL).Meanwhile, model-based RL methods are frequently studied in robotics applications due to the efficiency of data usage and avoiding computational burdens induced by iterative policy evaluation and improvement [16][17][18][19][20][21][22][23][24] .
Despite the advantages, the control policy generated using model-based methods still gives rise to unexpected motions or undesired output behaviors during policy learning.Thus, it is risky to apply RL directly to real-world robots without preprocessing in a simulated environment 15,[25][26][27][28][29] .This has led to the enhancement of RL methods that can be implemented in real-world applications, including simulation-to-real (sim2real) methods 13,30 , uncertainty-based modeling 31 , exploration techniques [32][33][34] , and robust optimal control 35,36 .For instance, adding Kullback-Leibler divergence to the optimization objective limits sudden changes in the desired trajectory 18 .Employing a Gaussian process-based model provides the degrees of uncertainties during model learning for determining the exploration level or the robustness of the controller 37,38 .Domain randomization 39,40 or adversarial training 41 also provides robustness by learning from various situations or incorporating worst-case scenarios [40][41][42] .
In this paper, we demonstrate how the self-attention mechanism can be employed in model learning for real-world robotics applications without simulation.Two types of models, namely the kinematics and the dynamics models, depict the behavior of robot motion.Our method utilizes a self-attention layer in both models to overcome the issues that can arise from the direct application of RL methods in a real-world environment.Specifically, the selfattention layer in the kinematics model determines the exploration region by adjusting the cost function for the robot's movement range.On the other hand, the self-attention in the dynamics model detects possible perturbations during the learning process and manages the quality of the dataset.
Assuming trajectory tracking control using model-based RL, the cost function can be defined as the sum of errors between the reference and the executed trajectory.During the initial optimization phase, the resulting control policy could be inaccurate due to a poorly learned model.Using the control policy directly on the robot may be inadequate since the output behavior of the control is uncertain.To address this limitation, we propose the incorporation of a self-attention layer into the kinematics model, aiming to enhance the control performance and provide more predictable behaviors 43,44 .Although the kinematics model is not explicitly dependent on time, constructing a self-attention chain based on time allows us to look closer on the data with low self-referential rates, which are considered as abnormal data or regions that are insufficiently learned.The scaling of the desired trajectory can be determined from the self-attention matrix of the kinematics model.
Additionally, by constructing a self-attention chain for dynamics, we can detect perturbations in the robot caused by unintended external forces.The dynamics model learned in our proposed method considers the relationship between the control inputs and the robot's configuration states while excluding the effects of external forces [45][46][47] .Consequently, the addition of external forces leads to a decrease in self-referential rates within the self-attention matrix of the dynamics model.
To implement the proposed self-attention mechanism, the time series of the input is connected to the encoder network, which is connected to the decoder in series, using feedforward neural networks (FNNs).The self-attention layer exists between the encoder and the decoder networks, as shown in Fig. 1b.If a single FNN connecting the input and the output at a single timestep (Fig. 1a) is cloned and arranged in parallel over time, and the selfattention layer is connected between the outputs of the networks (Fig. 1b), then the self-attention matrix will eventually be a diagonal matrix when the model is fully trained (Fig. 1c).In other words, each input component primarily refers to itself.Now, imagine the procedure mentioned earlier, i.e., the first-time exploration and confronting perturbations.The corresponding input-output pair will never be shown to the network models.In this case, referring to itself will not be beneficial since the abnormal input data are not sufficiently learned to represent the hidden features.Instead, the self-attention mechanism will refer to the other inputs to extract meaningful features; in other words, a macroscopic change will occur in the self-attention matrix (Fig. 1d).Thus, by analyzing the trace values of the self-attention matrix for the kinematics and dynamics models, we can adjust the exploration task space or detect disturbances.That is, a sudden change in the kinematics attention matrix results in freezing the searching space and commanding the robot to move in its proximity.On the other hand, a change in the dynamics attention matrix prevents the current dataset from being learned, since it may have been contaminated by disturbances.
We evaluated the performance for each network structure and learning method through an ablation study in a robot simulation environment.In addition, we applied the proposed algorithm to actual robot systems, showing that they were able to learn the local models in real time and complete target tasks.More specifically, our algorithm was used to teach an industrial robot arm to play the piano, control the motion of a soft robotic manipulator and generate a waking gait of a quadrupedal robot.Each robot was able to perform its desired task within 5 min in real time without any assistance from the simulated environment or prior knowledge of the robot.The results indicate that the proposed algorithm can be used universally in various robotic applications.

Algorithm overview
The overall architecture of the robust online model identification algorithm consists of various types of neural networks and their derivative operations with detection modules.Figure 2 schematically summarizes the flow of the algorithm, indicating that it consists of a clockwise outer loop and multiple branch lines.
When dealing with the limited knowledge of a robot's kinematics and dynamics, the algorithm aims to effectively control the robot to achieve a predefined task.However, there are three essential pieces of information that must be accessible either through sensors or computations.First, information regarding the energy used for robot actuation, such as the magnitude of the applied force, the input current to the motor, or the input pressure to the pneumatic or hydraulic system should be available.Second, the robot's joint state resulting from actuation, commonly referred to as the configuration-space state, needs to be measured using encoders or similar sensors.Lastly, the task-space state should correspond directly to the ongoing task being performed.The goal of the algorithm is to ensure that the robot's task state is in line with the desired trajectory, using these three types of information, typically obtained through path planning.Throughout the rest of this paper, we will specifically refer to these three components as the control input, the configuration state, and the task state, respectively.The aforementioned three types of information are used as variables of the kinematics and the dynamics models.Specifically, the kinematics encompasses the correlation between the configuration state and the task state, whereas the dynamics model captures the interplay between the control input and the configuration state.Both of these models adopt a neural network architecture, as shown in Fig. 1a.At each timestep, new input and state information is acquired, which is then stored in a replay buffer to further train the neural network approximations of the kinematics and the dynamics models.
Detailed computation process of outer loop of Fig. 2 is explained from now.First, the robot receives control inputs and executes them for a single timestep (Fig. 2a).Then, the agent observes the robot's states, namely the configuration and task states (Fig. 2b).Using the observed states and given inputs, the kinematics and dynamics equation components (matrices J, M, C, and g) are computed (Fig. 2c).The agent then prepares the next target task state that was predefined from path planning (Fig. 2d).Finally, the control inputs for the next timestep are determined from the model-based feedback control law, reflecting the desired task state, the kinematics model and the dynamics model (Fig. 2e).The next target task states should be prepared in advance from planning algorithms, such as tree-based searching or demonstration by kinesthetic teaching 48,49 .
When constructing a control law in the outer loop (Fig. 2e), we employ the approximated kinematics and dynamics models.A twostep control process is used; first, from the desired task state x t+1,des , a desired configuration state q t+1,des is derived from the kinematics model, and then the actual control input τ t+1,des is obtained from the dynamics model.However, the approximated model remains uncertain due to insufficient learning, particularly due to their blackbox nature.This may yield peculiar output values in regions that are insufficiently trained, or with the existence of perturbation.
Two branch lines indicating internal operations, colored green and blue, is the realization of incorporating safe-aware control design.The blue line updates the kinematics and dynamics models.
Input-output tuples for training are sampled from the replay buffer to avoid temporal correlations and for regularization (Fig. 2f).The two branch lines are shown in green, considering both the kinematics and the dynamics self-attention models, as shown in Fig. 1b.The self-attention matrix of the kinematics model constrains the task space by adjusting the damping factor of the kinematic Jacobian matrix (Fig. 2g).This process changes the possible range of q t,des even if the range of desired x t,des is intact.On the other hand, the self-attention matrix of the dynamics model detects any external perturbations and passes the result to the blue line, determining whether the current input-output tuple is valid or not (Fig. 2h).If an anomaly is detected, then the current input-output tuple is rejected and not included in the replay buffer.
In summary, a total of four neural networks are used in this algorithm.There is one network for modeling the kinematics and another for modeling the dynamics.Additionally, there are selfattention networks of the kinematics and the dynamics that have been expanded to be used for determination of the task space determination and robust masking, respectively.In the following sections, we will provide a detailed explanation of these networks.

Network structures and objectives
Kinematics model.The kinematics model in the algorithm (Fig. 1) were fitted with a function approximator based an FNN.To explain the model, we note that three measurable variables are used: the input τ 2 R m , the configuration state q 2 R m , and the task state x 2 R m .
The rigid-body kinematics make a connection between the configuration state and the task state.Their forward kinematics can be expressed as x = f(q), where f : R m !R n maps from the configuration state to the task state.Its differential form in continuous-time domain is generally given as where an overdot means the derivative with respect to time, and J 2 R n m is the kinematic Jacobian matrix.We use a discretized Fig. 2 The overall schematic of the algorithm that consists of the main loop for execution of the robot (yellow line) and the calculation through the agent (red line).The kinematics and the dynamics network are trained using a radial basis neural network sampled from the replay buffer (blue line).The self-attention network for both models is utilized for the robust masking and constraining of the task space.a Control input is applied to the real-world robot.b Input, state, and output are collected and then sent to the replay buffer.c Forward model prediction is done for calculating self-attention matrix.d Next desired task-space trajectory is given.e Control law is generated, reflecting the updated model from f. From self-attention matrices, g task-space determination and h rejection of perturbed dataset are possible.
time-domain kinematics as where T 2 R þ is a sampling interval for discretization that does not raise aliasing.Rearranging the discretized form in terms of x t+1 yields where we replace first terms as J 0 for the final result.
Results of the discretized form of Kinematics is approximated by employing a radial basis neural network (RBNN) that contains a single hidden layer with a radial-basis activation function in its structure.The proposed structure can be interpreted as a concatenation of a local linear model.We analytically define the hidden features as where S 2 R m m is the positive-definite covariance matrix, c i 2 R m is the center of basis, and a total of N features are given so that Φ 2 R N where its components are given with a subscript i.
The center c i is a trainable variable which is first initialized by sampling from the uniform distribution, where the covariance matrix is defined as a constant matrix S ¼ 2I (see Supplementary Document 1, Appendix A for analysis).The output layer of the RBNN is the weighted sum of the hidden features.Now, the Jacobian is expressed as where ϕ J;k 2 R n N for k ¼ k2Z f j0 k mg is trainable weights for the functions J and J 0 .The usage of the radial basis activation function also prevents unexpected peaks during the initial phase of online learning.The in-depth analysis will be discussed in a later section.We use the subscript expression J ΦJ and J 0ΦJ so the Jacobian model is approximated using the trainable weights.
A replay buffer is used to store the data τ; q; x and train the model for every timestep.Using the stored variables q t and q t+1 and the kinematics model Eq.(3), prediction xtþ1 can be derived.The objective for kinematics model is a mean-squared-error (MSE) between xtþ1 and x t+1 , as Dynamics model.Similar to the kinematics model, dynamics model shows the evolution of the configuration state from the input.The vector equation of dynamics model is given as where an overdot means the derivative with respect to time, M 2 R m m is the inertia matrix, C 2 R m m is the Coriolis matrix, and g 2 R m is the vector term for compensating gravity, in which M; C, and g are unknown and need to be identified using an FNN.Acceleration and velocity terms are also transformed into discretetime format.In particular, the following model is used for learning: where we do not make the equation into affine format like kinematics model.Since the Coriolis term uses both q t and q t−1 for the input, the feature vector Φ used in the kinematics model is expanded to handle both terms as where S 0 2 R 2m ð Þ ð2mÞ is the positive-definite covariance matrix, and c 0 i 2 R 2m is the center of basis.Finally, the approximation for M; C and g is given as where N are trainable weights for matrices M, C, and g, respectively.
The optimization objective for the dynamics model is given as where τtþ1 is defined in Eq. ( 8), and τ t+1 ,q t ,q t−1 ,q t+1 are restored from the replay buffer.
Kinematics self-attention model.We define the self-attention network for the kinematics, which considers the original kinematics model but expand its network structure to utilize time correlation using self-attention structure.After then, the self-attention score of the kinematics model is used for determining the task space, constraining the robot's reachable area as shown in Fig. 2g.The self-attention network for the kinematics uses the configuration state q as an input that passes through the encoder layer, as shown in Fig. 1a.Then, a total of k duplicates of the encoder layer are constructed, in which the input of each encoder layer becomes an input q from time t to time t − k, as shown in Fig. 1b.The duplicated encoder layers share the hidden weights and the gradients.In the matrix form, a total of k + 1 recent configuration states are concatenated to be expressed as a matrix q tÀk:t 2 R m ðkþ1Þ .The matrix passes through a fully connected encoder layer (θ E,kin ) and a position-wise encoding layer (θ PE ), indicating that each input originates from a different timestep.The weight parameters for the fully connected encoder layer and the position-wise encoding layer are given as where θ PE is not trainable but originates from the transformer structure, indicating the relative position of each vector component 39 .The results of encoding are then given as (13) in which the encodings are for query, key, and value, respectively, for the self-attention layer, and the operator Cat k+1 means that the input column vector is concatenated k + 1 times so that Cat kþ1 b θE;kin À Á becomes a matrix.In a typical transformer structure, different weight matrices are multiplied to the encoding of query, key, and value to prevent an excessive self-attention score, given to itself.In contrast, our approach aims to give an excessive selfattention score to itself since the kinematics model can accurately infer the output from its own input without time-correlation.
The encoding matrix now passes through a self-attention structure.The self-attention matrix and the output of the selfattention layer are given as Attn Score;kin q enc;Q ; q enc;K À Á ¼ softmax Attn Out;kin q enc ð Þ ¼ q enc;V Attn Score;kin q enc;Q ; (14) which is a dot-product attention with a scalable parameter d 50 .Finally, the self-attention output passes through the decoder layer which consists of two fully connected layers, predicting the task state given as (15) where ReLU refers to the rectified linear unit activation function 51 .Similar to the encoder layer, the duplicated decoder layers share the weights and the gradients.
Optimization objective for the kinematics self-attention network is composed of two terms.First, an MSE is used to calculate difference between the predicted xtÀk:t and the ground truth x t −k:t .Next, discrepancy between the self-attention score matrix and the identity matrix is added, written as Lðx tÀk:t ; x tÀk:t Þ :¼ jjvecðx tÀk:t À x tÀk:t Þjj 2 þtr ðAttn Score;kin ðq enc;Q ; q enc;K Þ À IÞðAttn Score;kin ðq enc;Q ; q enc;K Þ À IÞ > h i ; (16) where the operator vec(⋅) is a vectorization of the given matrix.Minimizing only the MSE for output prediction ultimately means learning the kinematics model without considering self-attention connection.When using only the former term as the learning criterion, the inclusion of a self-attention layer becomes redundant.Consequently, the learning outcome of the self-attention network solely based on MSE minimization can yield multiple local optima.Our objective is to converge towards a scenario where all self-attention results become self-referential, representing one of the local optimal solutions.As previously discussed, this is accomplished by ensuring that the encoding weights for query, key, and value are uniformly applied, prioritizing higher self-attention scores when referring to the same time step rather than other time steps.Furthermore, if the selfattention score decreases, i.e., if the input that depends on the information from other time steps rather than one's own encoding information is entered, we treat it as an abnormal situation.
Dynamics self-attention model.The aforementioned selfattention structure can be applied to the dynamics model as well.In the dynamics model, we map from the input τ to the configuration state q.In particular, Eqs. ( 12)-( 16) can be reorganized for the dynamics self-attention network as in which the input of the encoding layer is now changed into τ.
Then, the attention score and the attention output can be written as (19) and the prediction layer as (20) where we predict q for the output of the dynamics self-attention network.Finally, the optimization objective of the dynamics selfattention network is given as Utilizing self-attention structure for control and learning Task space determination by a self-attention kinematics network.
We aim to utilize the changes in values of the self-attention matrix over time to monitor the learning status of the kinematics network and to detect abnormal situations, such as incomplete learning or perturbations.This will help us expand or narrow the task space of the robot accordingly.The kinematics of the robot is given as x t+1 = J 0 (q t ) + J(q t )q t+1 under an assumption of discretization, the current state x t is only dependent on the variables x t−1 , q t , and q t−1 .
Returning to the self-attention network, if the model is sufficiently learned minimizing the objective Eq. ( 16), then the self-attention matrix is similar to the diagonal matrix (see Fig. 1c).In other words, a query of the specific time does not need information far from itself.However, if the robot moves slightly out of the area where it was previously moving, the new data that have not been seen in the training dataset become part of the input of the neural network.As a result, the encoded feature for the new input may produce a random output value since the new input has not been trained before, and the self-attention matrix will then no longer be diagonal.This outcome now serves as an indicator of the new region in the task space that has not been frequently visited.
Utilizing the property of the self-attention matrix, we design a way of constraining the task space region to ensure safe exploration and avoid unexpected movements of the robot.Let our desired task trajectory be given as x 1:T;des 2 R n T Our optimization objective is to make the robot follow the desired trajectory, but depending on the learning process, we restrict the robot's range of motion using the following setting: min q tþ1;des where we define λ t 2 R þ as a safety factor.Setting λ t → 0 yields the desired configuration state to recover the desired task state.On the other hand, leveraging λ t gives penalty when the desired configuration state becomes larger (i.e., the robot moves in a larger region).A solution for the optimization is then given as which is a linear-quadratic optimization solution 2 .We notice that the larger λ t makes the task space the smaller, and the smaller λ t makes the robot the more easily move to the other space.Thus, simply setting λ t inversely proportional to the trace value of the self-attention matrix can be a solution.However, this does not handle a long-term effect, where the safety factor should finally decrease to zero to complete the task, but the trace value of the selfattention matrix has an upper limit of k + 1 (i.e., the softmax operator makes the sum of the column be equal to 1).In other words, the trace value walks sideways but is still capable of detecting the outliers.To address this issue, we suggest the following update rule: α t :¼ tr Attn Score;kin q enc;Q ; q enc;K À Á À Á ; where a constant γ ¼ fγ 2 Rjγ< À 1g is called a leveraging factor.For example, if α t increases by 1%, then λ t decreases by |γ|%, following the inverse relationship.
The proposed update rule expresses a long-term decay of λ t by the volatility drag effect.Since modeling the behavior of α t is nontrivial, making it difficult to predict the behavior, but will be related to the current value, let α t follow the geometric Brownian motion with a stochastic differential equation (SDE) as with μ; σ 2 R, and W t is Wiener noise.Then, the N-step solution of the λ t , considering the SDE and Eq. ( 24), is given as where T 2 R þ is the real-time interval between timesteps.The solution indicates that even if the index value α t goes sideways with a rectangular pattern, the long-term value of λ t will decrease and asymptotically converge to zero.The safety factor still captures the local behavior of α t that observes the abnormal data in the task space.
Perturbation detection by a self-attention dynamics network.We considered the dynamics self-attention network to filter out the perturbed training data.Note that this method captures inputoutput outliers and is used to train the kinematics and the dynamics networks but not related to robust control.The filtering method uses a simple threshold.If the trace value of the dynamics attention score matrix exceeds a certain threshold, the dataset of that time is included in the replay buffer as where δ is the perturbation buffer time and ϵ is the threshold to determine the outlier.If it does not exceed the threshold, it is considered to have been affected by external disturbances, such as external forces, which correspond to the term in the dynamics equation that cannot be observed, as likely given f ext in If a perturbation is detected, then δ samples after the perturbation are neglected since the robot needs time to recover to the stable point.In addition, since the kinematics attention score α t is also affected by the perturbation, we do not update the safety factor using Eq. ( 26) while the perturbation is detected.New data is stored in the replay buffer, which has a predetermined fixed size.Once the stored data exceeds the predetermined size, the oldest data is replaced with the new incoming data in a sequential manner.If a threshold is set and only the datasets that meet the condition are added to the replay buffer, many data may not be stored in the replay buffer in the very early stages of training.However, since the same encoding feature is used for query, key, and value, even in the early stages where the learning is not well-established, the trace value is expected to reach a certain level.As a result, the threshold value ϵ in Eq. ( 27) is determined to be at the average level of the initial short period.
Control strategy.From the planned desired task state x t,des , we find the desired configuration state q t,des from Eq. ( 23) (Fig. 2e).Then, we define the state error e t ≔ q t − q t,des and the augmented velocity error r t :¼ _ e t þ Λe t with a positive-definite gain matrix Λ 2 R m m .Then, we can define a passivity-based control as ) with a positive-definite gain matrix K, to make the closed-loop dynamics as where Γ is the residual term induced by the modeling error of the neural network approximation 2 .If Γ = 0, then the error exponentially converges to zero.However, the existence of the model error guarantees the boundedness of the error, given the boundedness of Γ.Hence, feedback control will be heavily dependent on the feedback terms in Eq. ( 29).To avoid unexpected movements induced by the feedforward terms, we initialize ϕ M ,ϕ C , and ϕ g to zero.Finally, the algorithm described in Fig. 2 Obtain self-attention matrices.
α t ¼ tr Attn Score;kin q enc;Q ; q enc;K À Á À Á ; Save the result into the replay buffer using Eq. ( 27) for i samples {x,q,τ} from the replay buffer do Lkin;attn ¼ jjvecðxtÀk:t À xtÀk:tÞjj 2 þtr½ðAttnScore;kinðq enc;Q ; q enc;K Þ À IÞðAttnScore;kinðq enc;Q ; q enc;K Þ À IÞ > L dyn;attn ¼ jjvecðq tÀk:t À q tÀk:t Þjj 2 þtr½ðAttn Score;dyn ðτ enc;Q ; τ enc;K Þ À IÞðAttn Score;dyn ðτ enc;Q ; τ enc;K Þ À IÞ > Optimize L kin , L dyn , L kin,attn , L dyn,attn and update parameters.end end Validation in the simulation A virtual robot control environment (PyBullet) was used to show the effectiveness of our approach in the simulation 52 .The robot was asked to make two different motions.One is to travel back and forth between two points (Task 1, Fig. 3a), and the other is to follow a predefined circular trajectory (Task 2, Supplementary Fig. 1a).The overall evolutionary trajectories over time for both cases are displayed in three-dimensional plots (50 s in simulation time), as shown in Fig. 3b and Supplementary Fig. 1b, where the darker curves indicate the more recent activities.The result of path tracking control with simultaneous model learning is shown in Fig. 3c-e, displaying the three-axis positions of the end-effector, also with the mean-squared tracking error in Fig. 3f. Figure 3g shows the values of λ t and α t over time.λ t decays and converges near zero over time and α t shows gradual increase with fluctuations.From the auxiliary indicators (e.g., λ t and α t ), the desired configuration state q t,des is determined to move more widely after time passes, as shown in Fig. 3h. Figure 3i, j shows the learning curve of four kinds of neural networks.Zero-regulation results of orientation (i.e., roll, pitch, and yaw angles) is given in Supplementary Fig. 2.
The effect of the outlier detection algorithm is shown in Fig. 4. In this task, we applied the same task as Task 1 to the robot, in which the tracking results with errors are shown in Fig. 3a-d.However, impact perturbations are applied at a certain timestep, as shown with the shaded areas in the curve.Factors λ t and α t are also influenced by the perturbation, as shown in Fig. 4e, in which λ t increases when a perturbation is applied to the robot.Figure 4f shows the value of β t , indicating that the value decreases almost to the initial value.Nevertheless, the learning curve of the selfattention networks are not interrupted since the outliers are eliminated from the replay buffer (Fig. 4g).Finally, to verify the feasibility of learning the kinematics, Fig. 4h shows the difference between the approximated Jacobian J ϕ J and the robot's real Jacobian J provided from the simulated environment by calculating the Frobenius norm.As the kinematics network converges relatively fast, as shown in Fig. 3j, learning the Jacobian matrix can approximate the real Jacobian matrix when passing through the desired trajectory.
Application 1: trajectory tracking of a soft robotic manipulator A custom-built soft robotic manipulator was used to test the proposed algorithm as an application.The soft manipulator is composed of three actuator modules, and each module is made of a pneumatic origami muscle actuator (POMA) encapsulated by a flexible 3D-printed mesh structure, as shown in Fig. 5a 53,54 .A laser pointer was attached to the tip of the manipulator so that the tip position was projected on a two-dimensional surface distant from the tip.A red-green-blue (RGB) camera was used to read the position of the laser projection on the screen.The target task was to track an 'S'-shaped curve on the screen in this application.
The robot was able to follow the reference trajectory, as shown in Fig. 5b.In the first trial, the robot barely made a movement.When the kinematics and the dynamics models were sufficiently trained, the robot was able to follow the full trajectory in real-time.Figure 5c-e show the secondary indicators: the kinematics manipulability (i.e., det(JJ ⊤ )), the tracking error, and the safety factor λ t for determining the exploration region, respectively.The safety factor λ t decreases over time, resulting in an increase in the kinematic manipulability.The results of real-time trajectory tracking with model learning and its input-output data logs are provided in Supplementary Materials (Supplementary Video 2 and Supplementary Data 1, respectively).
A comparison study was conducted to determine the effect of the proposed algorithm.Soft actor-critic (SAC), a widely used RL method, was employed as a comparison algorithm 55 .A reward function for the RL setting was given as where the first term is related to the tracking error, the second term penalizes rapid volatility of the configuration state, and the numbers 2, 0.05, and 10 are determined by the best-performance combination.We compare three use cases: our method, SAC with no prior, and SAC with a previously initialized network using our Fig. 3 Validation results in simulated environment of Task 1. a Round-trip task between two points (The model of the robot is from ref. 63 ).
b Overall positions of the end-effector over time for the trajectory tracking, where the darker color indicates the more recent time.c-e Threedimensional position of the end-effector in x, y, and z axes, respectively.f Mean-squared error between the reference and the actual trajectories.g Auxiliary indicators λ t and α t .h One of the desired configuration states q t,des , showing that the desired state expands over time.i, j Learning curve of four neural networks, self-attention kinematics and dynamics for determining α t and β t , and kinematics and dynamics network for determining q t,des and τ t .
method.The reward obtained by each method over time is shown in Fig. 5f.The result shows that SAC took a long period of time to receive a high reward, while our method quickly obtained a high reward.The tracking performance of the RL algorithm over a long period of time is visualized in Supplementary Materials (Supplementary Fig. 3).Merging the two methods eventually surpassed the performance of the two individual cases when the SAC network was sufficiently trained.The result implies that our algorithm can be used as an accelerator for the RL algorithm to solve high-level tasks, such as grasping, interaction, or tasks involved with model-based RL problems.
Application 2: autonomous manipulation of a robotic arm An industrial robotic arm (UR5e, Universal Robots) was used for another application, that is, play the piano.In this task, it was assumed that there was no information on the kinematics of the robot available.Instead, the kinematics model was learned by the proposed algorithm and used to perform the task.However, the dynamics model was not trained, since the robotic arm used in this experiment did not support real-time force control.The desired trajectory for playing the piano was obtained by kinesthetic teaching; a human expert moved the manipulator and demonstrated how to play the melody, and the orientation of the end-effector was sequentially saved in the buffer.A single task episode carried 200 timesteps, and 25 episodes (total 5000 timesteps) were executed.The overall result is summarized in Fig. 6. Figure 6a shows the global coordinate and the experimental setting for the kinesthetic teaching.The human expert's trajectory is provided in Supplementary Materials (Supplementary Data 2).Figure 6b shows the contour analysis of the robot movement in timesteps t = 1000, 2000, and 3000.A contour volume of the robot at t = 0 was valued at V 0 , and the volume ratio increased over time and reached 1.50 at t = 3000.In other words, the robot initially moved in the vicinity of its origin but finally broadened the task space.The full data can be found in Supplementary Materials (Supplementary Data 3). Figure 6c shows the reference and actual motions of the last episode.Figure 6d, e show the safety factor and the tracking error, respectively, both of which decreased over time.The overall process of kinesthetic teaching, model learning, and performance testing is visualized in Supplementary Materials (Supplementary Video 3).
The analysis of the results in detail shows that there exists a small peak in the actual trajectory, shown in Fig. 6c, i.In addition, the actual trajectory lagged behind the schedule of the reference, as shown in Fig. 6c, iii, meaning that the closed-loop feedback system had a boundedness property.Let the governing kinematics equation for the robot be x t+1 = f(x t ) + g(x t )q t , which is the first-order approximation for the kinematic Jacobian.However, we know the approximated functions f and ĝ where the discrepancy errors are defined as f :¼ f þ δ and ĝ :¼ g þ ϵ, respectively.Then, the feedback control becomes q t,des = (g + ϵ) + (x t+1,des − f − δ) using the approximated neural network model.The expected execution is given as x t+1,exp = f + gq t,des .The tracking error of the next state is now given as As a result, the tracking error is proportional to the model inaccuracies, δ and ϵ.If we choose a sufficiently rich neural network model, then we can minimize the model inaccuracy by the online model update algorithm.However, there exists a tradeoff between the tracking error and the operating frequency, since a complex neural network model degrades the performance of the real-time controller.In conclusion, the proposed algorithm accomplished the task of online tracking of a complex trajectory with the assistance of a human expert.Although a commercial robotic arm was used in this demonstration, the overall approach can be extended to any unknown or custom-built robot.

Application 3: gait training of quadruped robot
Controlling legged robots, such as quadrupedal robots or humanoids, often requires high performance controllers that consider the stability of the center of mass, the walking balance, and phase matching 17,56 .Unfortunately, our approach based on kinematics and dynamics did not fully utilize those properties.Instead, we copied the walking behavior of a quadrupedal creature and implemented it as the reference trajectory.We used a commercial quadruped robot (Laikago, Unitree) for a validation experiment.The goal was to make the robot walk straight.The result is presented in Fig. 7. Figure 7a, b summarize the coordinate settings and the transmission protocol for the agent.Figure 7c shows the horizontal and vertical motions of each foot.It is observed that the motions became larger as time passed, indicating that the overall trajectory was affected by the safety factor that decreased over time.This is also observed in Fig. 7d, where the vertical position gradually increased over time.The real-time operation of the robot can be found in Supplementary Materials (Supplementary Video 4), and the Cartesian position of the robot foot of a single operation can also be found in Supplementary Materials (Supplementary Data 4).
In this application, the robot was unable to sufficiently explore the kinematics due to the relatively large weight of the battery attached to the robot, and we artificially created an exploration term that did not influence the overall performance of the robot, using the nullspace exploration term.Let the feedback control be q t;des ¼ ĝþ ðx tþ1;des À f Þ.Then, we define the null-space of the Jacobian ĝ as ½I À ĝþ ĝ.We applied the desired configuration state as where we leave c t as an exploration constant decaying over time.By applying this control, the null-space exploration term eventually generates the internal motion that does not degrade the overall performance (see Supplementary Document 1, Appendix B).Last, as shown in Fig. 7a, the actual trajectories did not exactly follow a cycloid curve, as drawn at the top of Fig. 7a.We observed that this result came from the high friction between the foot and the ground.The friction effect shows a highly nonlinear behavior that is tedious to model, even with the aid of a neural network, since friction model functions are discontinuous (e.g., Newton's Fig. 5 Results of tracking control of a soft robotic manipulator.a Origami-based soft robotic manipulator for bending with pneumatic pressure inputs 50 .The manipulator was hung by fixing the base to the structure, and the laser module installed at the tip of the manipulator projected the light on the screen.A camera detects the trajectory of the projection and sends the positional information on the trajectory to the computer.b Reference and executed trajectories of the Episodes 1, 3, 5, and 7. c-e Auxiliary indicators, meaning the kinematic manipulability, the tracking error, and λ t , respectively.f The result of the comparison study (SAC).block model 57 ).This effect was inevitable in our model setting, indicating the limitation of the proposed algorithm.Nevertheless, our approach was able to successfully achieve the desired locomotion in three minutes from scratch.

Ablation study
We performed an ablation study to evaluate the effect of the four specific methods employed in the algorithm: masking for selfattention, use of the leveraging factor γ < −1, position-wise encoding for self-attention, and random initialization.The results of the study are presented in Table 1 and visualized in Fig. 8. Statistical tests on the mean and the standard deviation values were applied to the four ablation cases.
The effect of masking in the self-attention mechanism was first tested.Masking is a widely-used method to avoid referring to future inputs while training since a time series has a causality property.However, the result with masked self-attention showed no significant difference from that with non-masked self-attention because the proposed algorithm was used to find the abnormality in the datasets rather than to predict or generate them.
The next test was on the use of the leveraging factor.Updating the safety factor λ t by following the leveraged ratio of the kinematics attention score α t was used for gradually downsizing the factor to accelerate the exploration.The leveraging factor was fixed to γ = −1, and the performance was compared with that of the original algorithm.The update rule for λ t using the leveraged ratio showed significant differences in both the mean and the standard deviation values, implying the acceleration of both the exploration and convergence of the model with the leveraging strategy.In addition, leveraging improved the reproducibility of the experiment as a result of a low standard deviation value.
The effect of position-wise encoding for self-attention was also tested.In this case, the self-attention layer worked as a timerecurrency of the system and thus played an essential role in analyzing both the kinematics and the dynamics models.Therefore, position-wise encoding also showed a significant improvement in performance compared to the case without position-wise encoding.
Finally, the effect of initializing the kinematics network by sampling from the normal distribution was checked, and it was confirmed that initialization contributed to determining the starting point of exploration.Two cases of initialization, namely, random initialization and zero initialization, were compared.The result indicates that random-initialization of the kinematics network facilitated the system to reduce the tracking error and to accelerate the convergence.After all, if the network uses zero initialization, then the kinematics network is less willing to explore a new region at the initial operation.

Effect of random initialization
In this section, we explain why an RBNN was chosen rather than a multilayer perceptron structure.We describe the behavior of the output using a vanilla example.Let us consider an artificial neural network as a network that consists of multidimensional inputs, a single hidden layer with hidden units N H , and a scalar output.We can then express the mapping from the input x 2 R n to the output y2R as where b 1 2 R NH and b 2 2R are the bias terms, w 1 2 R n NH and w 2 2 R NH are the weights, and σ(⋅) is a nonlinear activation function.We denote all the weights (i.e., [b 1 ;b 2 ;vec(w 1 ); w 2 ]) by w, initialized by sampling from the prior distribution.If the sampling distribution is chosen as ) then it is straightforward to calculate the following property: 58 when a hyperbolic tangent is used as an activation function.The result indicates a kernel between x and x' and thus a Gaussian process when N H ! 1.However, when we use an RBNN, we can derive the covariance as, which is an enveloped decaying Gaussian function 58 .Comparing Eqs. ( 36) and (37), we can infer that both networks behave like a Gaussian process when their weights are randomly initialized, which is the same situation as when a robot is about to start  exploration and model learning.We showed that random initialization of the kinematics network gradually improved the tracking performance in the ablation study.However, an FNN defines the similarity between two inputs x and x′ as an inner product, where the RBNN defines the similarity as the Euclidean distance.Therefore, the covariance of the RBNN is consistent within small input evolution, where the robot should exercise caution in the initial learning phase since no information is given to the robot.Random initialization was not performed for the M, C, and g matrices, but zero-initialization was used instead.This is because we employed a two-step control strategy.The desired configuration state q t,des is first obtained using the desired task state x t,des , and then the control input τ t is further derived.When examining the equation for obtaining the control input using the desired configuration state, the inertia term and the Coriolis force term are multiplied by the acceleration and the velocity terms of the desired configuration state, respectively.During the initial stages of training, when the value of λ t is high, the acceleration and the velocity components exhibit diminished magnitudes.Consequently, the gradients associated with the matrices M and C are also small in magnitude.Therefore, the matrix g is trained relatively quickly in the initial stages.This prevents unexpected movements resulting from the inertial terms when the learning has just started.

Network structure and component
As previously mentioned in the results section, the proposed algorithm used a discretized kinematics model instead of an original continuous-time Jacobian model.By using this first-order model and the update rule by gradient descent, the overall model worked as a piecewise linear model that was able to best express the local behavior near the bin.Functions f and g are approximated by the RBNN with a predefined Gaussian activation function.When operating the robot, the kinematics model is similar to the locally linearized model near the closest center of the Gaussian activations.When updating the model, the weight update ratio considers the distance between the current state and the center.If the center is far from the current state, then the weights of the neural network corresponding to the center will rarely be updated.On the other hand, if the center is close to the current state, then the weights will be periodically updated, realizing continual learning of our algorithm.If we need to perform multiple tasks at once, then all we need is to update the network sequentially by task rather than reconstructing a new model.
Both the dynamics model and the dynamics self-attention model consider only the relationship between the control input and the configuration state, neglecting the influence of external forces and the geometric constraints inherent in the actual dynamics of a robot.While it is feasible to equip the robot with force sensors, external perturbations may appear on any locations on the robot, making them undetectable by the sensors.As a result, these outliers result in reduction in the dynamics selfattention score and are consequently excluded from the replay buffer.However, focusing solely on the relationships within the dynamics model can be influenced if any self-collisions occur during training.While the dynamics attention model can detect the onset of the self-collision through the temporal patterns of the dynamics self-attention matrix, it is difficult to distinguish these patterns from the pattern from the external perturbations.As a result, dataset exclusions from replay buffer may also occur if there are self-collisions.This implies the importance of meticulously designing the desired task to prevent such issues.In this work, the desired task trajectories are designed as conservatively as possible to ensure the robot does not enter self-collision state, and use automatic safety functions embedded in the robot platform.

Relation to previous methods
We compare our approach with previously developed RL algorithms that consider model-based methods or controloriented methods.Model-based RL methods by using a specific cost function regarding its trajectory constraints or incorporating model uncertainties have previously been developed 18,59 .In particular, an iterative linear quadratic regulator (iLQR) was used as a controller to generate guided samples 18 , and nonlinear lifting was used to depict highly nonlinear terms with model predictive control 59 .Both approaches use iterative optimization based on local linearization.Our method can be rearranged as an iLQR form by which yields iLQR recursive relation to calculate Q-function and value function as where subscripts mean Jacobians and Hessians with respect to x and q.The results finally derive the same solution as Eq. ( 23) by defining feedback control q tþ1;des ¼ q t À Q À1 qq;t Q q;t .As a result, our method follows the flow of model-based RL, where employing λ t and planning its value over time is the unique property of our method.

METHODS
This section discusses the environmental settings for the three validation experiments.All the source codes were written in Python 3, and all the hyperparameters on the experiments were determined by sweeping the parameters and selecting the best combination.

Learning implementation
For initial setting, we designated an initial position with position control and then changed it to a force control mode.We clipped the measured force, velocity, and acceleration values with a hyperbolic tangent function to prevent any undesired behaviors in the deep learning model.The learning rate for each neural network models was 0.005 with an Adam optimizer.The centers c i of the RBNNs were also updated using the gradient descent.The size of the replay buffer was set to be 6000 samples, and the model was updated every 20 timestep with minibatch size of 32.The size of the selfattention matrix was 20 × 20, in which we refer to 20 past samples for detecting the anomaly.A leveraging factor γ = −3 was determined from its best-performance combination, but we clipped the maximum volatility to be 20% of its previous value to prevent unexpected motions.Moreover, a lower limit for the safety factor λ t exists to avoid a kinematic singularity, which is often inevitable during the control manipulator (see Supplementary Document 1, Appendix C).To calculate the average value of β t for determining ϵ in Eq. ( 27), approximately 1500 ms are required in the simulated environment and 10 s in the real-world environment.We performed a gradient descent of the kinematics and the dynamics models from scratch for rapid online calculation and simplicity.On the other hand, we received the assistance of an automated back-propagation tool while optimizing the self-attention models (Adam optimizer 60 ).The selected hyperparameters were applied to both the simulated environment and the actual robot environments, but their values were slightly adjusted to account for differences in operation frequency, as summarized in Table 2.
The dynamics model requires the learned inertia matrix to adhere to the constraint of being symmetric positive-definite.To ensure this constraint, the Cholesky decomposition technique was employed during the learning process.This involves constructing a lower triangular matrix with positive diagonal elements, which can be multiplied by its transpose to yield a symmetric positivedefinite matrix.Consequently, the weight utilized in computing the inertia matrix is specifically designed to obtain a triangular matrix to match the Cholesky decomposition requirements.

Simulation environment
In the simulation for validation, we used the open-sourced PyBullet physics engine compatible with the Python environment.A six degree-of-freedom (DOF) robotic arm (KUKA) with a default setting in the PyBullet engine was used in the simulation.The overall environment was exposed to a gravitational field (À9:81 km=s 2 ).The PyBullet engine used a fixed timestep of 0.001 s in the simulation world.
Trajectory tracking was performed for the validation task.Two tasks, named Task 1 and Task 2, were executed, as shown in Fig. 3a and Supplementary Fig. 1a, respectively.In Task 1, a trajectory of back-and-forth motions composed of a sine wave over time was given, and in Task 2, a circular trajectory in the x − y plane was given.A feedback gain for force control was given as 25 in the diagonal terms of the matrix.
For the statistical test in the ablation study, we used statistics and visualization tools (Prism 9, GraphPad) for analysis and plotting.We assumed that each ablation group was independent of the others, and the dependent variables were the ordinals.Therefore, we conducted a Mann-Whitney test (i.e., Wilcoxon rank-sum test) to compare the mean values and a non-parametric two-sample t-test.In addition, two-sample F-test was performed to distinguish the repeatability of each experiment.A total of eight samples were used for each ablation case (n = 8), the asterisks indicate *P < 0.05, **P < 0.01, ***P < 0.001 and "ns" indicates nonsignificance between two groups.

Soft manipulator environment
A simple soft robotic manipulator, composed of three linear soft pneumatic actuators was used in this experiment 50 .Each actuator was made of an origami air chamber that was able to contract and expand along its length.The air chamber was fabricated with a 150-polyethylene terephthalate (PET) film and sealed with a polyamide film.Three independent linear motions allow the entire structure to bend in three different directions, eventually enabling the manipulator to span three-dimensional space.Therefore, the structure can be used for localizing the position of the end effector in three-dimensional space.For the conciseness of the system, we selected the task to track the designated path in a twodimensional plane.At the tip of the robot structure, a laser module (λ t = 650 nm, P = 5 mW, Epxfavjd, Amyta) was attached so that the manipulator projected the laser point on the test plane.The projected laser point was then captured by an RGB camera (Q2F-00014), and the data of the relative pixel positions were collected by using an open-source computer vision library (cv2) with a scaleinvariant feature transform.
A pneumatic input source for each actuator attached to the robot was controlled by a pressure regulator (ITV2011-212BS5, SMC Corp.) connected to a microcontroller (Arduino Mega).A Python environment in the laptop communicated with the microcontroller via the transmission control protocol (TCP) in real-time.The input τ 2 R 3 was an applied voltage to the pressure regulator, the configuration state q 2 R 3 was a current pressure level applied to the actuators, and the task state x 2 R 2 was the two-dimensional projection of the laser module.

Robotic arm environment
We validated the algorithm with a six-DOF industrial robotic arm.The robot was controlled in real-time by the transmission control protocol/internet protocol (TCP/IP) communication via the Python URBasic library 61 .The desired joint input was applied through the command for simultaneous position servoing in URBasic, which is a real-time torque control targeting a smoothed trajectory.The overall operation frequency of the system was 25 Hz, in which the computed torque control achieved the desired motion in the lookahead time of 0.15 s with a sliding feedback gain of 0.5.The robot performed only position control, so we neglected the dynamics model and considered position control with the kinematics model.
The desired trajectory of the task was obtained by collecting the end-effector trajectory through kinesthetic teaching.To perform teaching, admittance control (i.e., free-drive mode) was used with the least target mass and damper setting.The robot was complied with a safety setting in case of an emergency.
We demonstrated playing the piano.A human operator taught the robot to play the melody of the music (W. A. Mozart, "Ah, vous dirai-je, maman K. 265, TEMA) with a piano (Yamaha P-125).The end-effector for pressing the keyboards of the piano was made from silicone rubber (EcoFlex-0030).A tempo for teaching was 45 beats-per-minute (bpm), where the actual demonstration was in the tempo of 60 bpm by downsampling the trajectory.A contour analysis is shown in Fig. 6b, a postprocessed image using a contour analysis video processing program (ProAnalyst, Xcitex).

Quadruped robot environment
In the last application, we used a commercial 12-DOF quadruped robot (Laikago Pro, Unitree) and tried to make it walk.The robot was controlled by the torque command on each motor torque in real time.The overall communication was performed by the robot operating system (ROS) with the Python 2 environment via the user datagram protocol (UDP) communication.This communication was implemented by the Unitree libraries with an operation frequency of 500 Hz 62 .The states of the robot were also transmitted to the computer through the UDP communication.We consider each of the four legs as an individual agent.An input τ 2 R 3 , a configuration state q 2 R 3 , and a task state x 2 R 3 are the motor torque, the joint angle, and the Cartesian position of the foot, respectively.In order to account for the variation in motor torque due to the contact between the robot foot and the ground, we used both the actual torque applied to the motor and the target motor torque as the input values of the dynamics network.A cycloid-type curve was used as a reference trajectory, but it was slightly modified to overcome the relatively high friction between the foot and the ground.For the overall gait motions of the quadruped robot, we referred to the previous literature 17 .

Fig. 1
Fig. 1 Overall scheme of the kinematics network and its self-attention model for adjusting the exploration region.a The kinematics model consists of a fully connected layer with a radial basis activation function.b When we expand this model over the past k timesteps, we can correlate their encoded features using a self-attention mechanism.c If the model is normally working, the self-attention matrix will exhibit a large value in the diagonal position.The brighter color indicates the larger value.d If abnormal data are detected, the self-attention matrix will show a blurred image.

Fig. 4
Fig.4Results of perturbation test.While performing trajectory tracking, an external perturbation was applied to the end-effector.a-c Threedimensional position of the end-effector in x, y, and z axes, respectively.d Mean-squared error between the reference and the actual trajectories.e Auxiliary indicators λ t and α t .f Indicator β t determining the perturbation.g Learning curve of the self-attention network.h Meansquared error between the approximated Jacobian J ϕ J and the real Jacobian of the robot.

Fig. 6
Fig. 6 Results of autonomous manipulation of a robotic arm.a The robot is first taught how to play the piano, and it records the trajectory of the end-effector.b While training, the robot moves in the vicinity of its original point at first but gradually broadens its task space.V 0 is the initial volume of the robot body locus, and V is the current volume.c Orientation of the end-effector.d, e Auxiliary indicators: the safety factor and the tracking error, respectively.

Fig. 7
Fig. 7 Results of gait training of quadruped robot.a Overall schematic of the coordinate transformation for the agent.b Commercial quadrupedal robot communicating with the computer via TCP/IP using ROS.c Comparison between the reference and the actual trajectories.The darker color indicates the more recent time.d Vertical position of a single foot over time, together with the applied torque to the leg.

Fig. 8
Fig. 8 Results of the ablation study.Result of the ablation study showing no significance between the complete algorithm and the unmasked self-attention but showing significance among the other three ablation cases that are leveraging the safety factor, applying position-wise encoding, and randomly initializing the kinematics network.

Table 1 .
Results of the ablation study and the results of the statistical tests.

Table 2 .
List of hyperparameters used in each experiment.