When humans interact with their surrounding environment, they perform highly complex in-sequence tasks with seemingly minimal effort1,2,3. By virtue of our highly complex cognition, solving complex sequences of manipulation tasks appears to require very little effort4,5.

In contrast, observing the above from the perspective of robots as agents with embodied intelligence, achieving these physical interactions is currently far from trivial5,6 and solving complex sequential tasks over a long horizon remains an ongoing challenge7,8. Notably, a task as simple as retrieving a glass from a shelf, pouring in water and placing it onto a table may seem trivial, but from an embodied intelligence perspective remains considerably challenging. Essentially, successful sequential manipulation is achieved when (1) high-level skills are satisfied, (2) sensory events are predicted, (3) the end goals are known and (4) the sequences of different skills are conceptualized in our minds and more broadly by our nervous system3,9.

Nevertheless, robots can perform repetitive manipulation tasks with high precision, provided that these are confined to specific tasks10,11. Some of these tasks include picking and placing4,12, swing peg in hole13,14, catching in-flight objects15, insertion14,16 or solving a Rubik’s cube17. However, when it comes to solving a sequence of multiple tasks that vary in complexity, substantial challenges arise11.

To overcome these limitations, we developed the novel robotic manipulation network ROMAN, which is an event-based hybrid hierarchical learning (HHL) framework, visualized in Fig. 1, for hierarchical task learning. This mixture of experts (MoE)-based hierarchical approach is capable of solving complex long-horizon manipulation tasks. We evaluated the framework in simulation and validated its robustness during long-horizon sequential tasks against sensory uncertainties. Thereafter, we performed extensive ablation studies of the internal learning procedure, evaluated the effects of different demonstrations and benchmarked the performance of ROMAN when compared with monolithic neural networks (NNs). Our results demonstrate that, by recombining and fusing ROMAN’s core experts and skills together, our framework is able to solve considerably complex, long-horizon sequential manipulation tasks, commonly encountered in our everyday lives, with generalizing capabilities. In the remainder of this Article, we review the related work, present ROMAN’s results, discuss future work and elaborate on the technical details of our methodology.

Fig. 1: Capabilities of the ROMAN framework.
figure 1

An HHL framework for hierarchical task learning, capable of solving long-time-horizon tasks that require successful activation and coordination of diverse expert skills to solve a sequence of non-interrelated tasks commonly necessary in robotics and physical interactions. The derivation of high-level specialized experts in ROMAN allowed us to construct a gating network that is trained for elevated task-level scene understandings, for the planning of complex sequential long-time-horizon tasks and for the successful and timely activation of low-level expert networks. We studied a set of seven specialized manipulation skills that are common in daily life and can be combined to create a higher level of manipulation skills. These specialized skills included (1) pushing a button, (2) pushing, (3) picking and inserting, (4) picking and placing, (5) rotating–opening, (6) picking and dropping and (7) pulling–opening. Unlike conventional planning methods or state machines, ROMAN exhibits adaptability in (1) randomized task sequences, (2) generalization outside demonstrated cases and (3) recovery and robustness against local minima. The ability of the gating network to achieve such versatility is attributed to (1) the HHL architecture in ROMAN’s core framework and (2) the high-level task decomposition of complex sequences by the various experts in the framework, allowing the central MN, which is a gating network, to be trained on high-level scene understanding and orchestrations of experts. The system architecture is based on an MoE-based architecture, which is able to successfully adapt to environmental demands, overcome various levels of uncertainties and most importantly learn with minimal human imitation.

Real-world impact of intelligent robotics

Pre-programming robots via analytical models can lead to suboptimal solutions due to simplified modelling of real-world dynamics, and online recomputation can be expensive and unable to account for dynamically changing physical properties. Current advances in artificial intelligence and machine learning offer a promising avenue to advance robot learning and embodied intelligence12,14,18,19.

The common reinforcement learning (RL) algorithms among related work are Proximal Policy Optimization (PPO)20 and Soft Actor–Critic21. Although PPO is on policy and generally less sample efficient than off-policy algorithms such as Soft Actor–Critic, PPO is less prone to instabilities and typically requires less hyperparameter tuning than Soft Actor–Critic20,21,22. For these reasons, we chose PPO as our RL algorithm.

Imitation learning and learning from demonstration

RL algorithms face challenges in dealing with complex tasks, particularly when rewards are sparse, which exacerbates the exploration–exploitation trade-off23,24,25. One major limitation is the need to generate their own experience from scratch26,27, which can require millions of state transitions and days of training due to the absence of prior knowledge19,28.

An alternative is to use imitation learning (IL), inspired by the prior knowledge that humans possess when learning motor tasks instead of starting from scratch29, whereby agents learn to emulate the demonstrated behaviour. This is also known as learning from demonstration, showing promising results in dexterous robotic tasks that would have been impossible to pre-program or substantially difficult to learn via conventional RL, due to the required degree of exploration and the necessity to carefully craft rewards for the desired behaviour12,23,26.

Most IL and learning from demonstration approaches depend on demonstrations from human experts. While some forms of demonstration could be substituted via conventional trajectory optimization12,30 or RL31,32,33, these methods generally require carefully designed costs or rewards and considerable interaction time between the robot and the environment.

One of the main IL algorithms used in related work is Behavioural Cloning (BC), which performs supervised learning on the policy from a set of demonstrated state–action transitions, showing promising success in robotic tasks8,12,34,35. However, BC has numerous limitations when used in isolation, such as lack of exploration, limited robustness towards new non-encountered states and dependence on large, near-optimal demonstrations36.

Naively copying expert demonstrations via BC is prone to problematic performance when the agent visits states not encountered in the demonstrations due to covariate shifting errors that compound over time, which drives the need for large numbers of demonstrations36,37, leading to operator fatigue and hence degraded performance4,38. Even from a biological perspective, the sole and naive dependence on an expert to learn new skills is misguided25,27,39. Zaadnoordijk et al. provided a matching analogy whereby trial and error is a crucial part of our early lives: “Human infants are in many ways a close counterpart to a computational system learning in an unsupervised manner, as infants too must learn useful representations from unlabeled data”25. For machine learning, this suggests that learning in its core should not entirely depend on copying an ‘expert’, but rather encourage further exploration beyond imitation, to draw inspiration from a neurobiological standpoint27,39.

An alternative to overcome some of the limitations of BC is inverse RL, which infers the underlying reward function in observed demonstrations to explain the demonstrations and achieve a near-optimal behaviour36,40,41. One of the popular inverse RL algorithms is Generative Adversarial Imitation Learning (GAIL)36. In this framework, GAIL uses a second NN, known as a discriminator, responsible for distinguishing between agent- and expert-generated trajectories36.

Hierarchical learning

Solving complex tasks using monolithic NNs through RL or IL can be challenging due to (1) long-horizon problems, whereby the computational complexity of approximating a policy is high, (2) the variability of the task requiring numerous subtasks and (3) sample complexities of dexterous tasks7,8,42,43,44. Moreover, the successful completion of a long-time-horizon task is contingent upon the successful completion of all subtasks in a particular sequence44. Finally, even using smaller subtasks to solve the problem44,45 can still be aggravated by considerable variations in their nature and limited task interrelation46.

Hierarchical learning (HL), whether used for RL or IL, can mitigate the above problems and alleviate some of these complexities19,47,48,49. HL offers multiple benefits when it comes to complex tasks associated with sparse rewards7, as it allows the decomposition of tasks into more approachable problems, that is, subtasks8. When these HL policies implement IL, commonly referred to as HIL, the differentiation between the specialized experts and the acquisition of specialized human skills in a teacher–student fashion is considered easier8,42,50.

A popular approach is the use of MoEs, where multiple task-specific experts are trained and specialized on a given subtask, with applications in computer graphics18,51 and robotics8,19,52. However, hierarchical reinforcement learning (HRL) still fundamentally depends on RL and hence is adversely affected by sparse rewards, complex planning tasks and difficulty in using prior knowledge8,44. HIL8,42 leverages expert demonstrations, unlike RL or HRL, to aid the overall training process and allow the demonstrator to isolate subtasks to facilitate solving longer, more complex and in-sequence tasks8,50.

Currently, in robotic manipulation, methods using MoEs trained with HRL or HIL are limited in the state of the art44,45. On the basis of previous work that introduced ensemble techniques in robot locomotion19 and human-centred teleoperation38, we are motivated to explore a new approach of IL using human-demonstrated tasks developing a suitable MoE architecture in the domain of robotic manipulation. This approach has the potential to extend beyond the original demonstrations and enable more complex manipulation tasks. Work similar to ours used an HRL approach to train a robotic gripper incorporating three experts: (1) approach, (2) manipulate and (3) retract44. While their results were validated against BC, showing higher (90%+) success rates when compared with RL, these studied tasks were limited to non-sequential tasks with short time horizons on a manipulator with a lower number of degrees of freedom, and restricted to three experts solving only picking and placing tasks44. In contrast, our work can train a single expert capable of solving picking and placing, and when combined with other experts specialized in rather high-level subtasks when compared with ref. 44 we can solve complex and long-horizon sequential tasks in manipulation.


This section presents the results for the ROMAN framework, which is composed of a modular hybrid hierarchical architecture to combine adaptive motor skills for solving complex manipulation tasks. It features a central manipulation network (MN) that activates specialized task-level experts in a required sequential combination, resulting in higher levels of manipulation capability and improved generalization to non-demonstrated situations. Moreover, the MN exhibits recovery capabilities by activating multiple expert weights to overcome local minima, which ultimately enhances the robustness for solving long-horizon sequential tasks.

Specifically, our validation shows the robustness of ROMAN’s HHL approach against (1) high exteroceptive observation noise, (2) complex non-interrelated compositional subtasks, (3) long-time-horizon sequential tasks and (4) cases not encountered during the demonstrated sequences. ROMAN achieves behaviour beyond imitation through hybrid training and allows the dynamic coordination of experts to recover from local minima successfully, with examples depicted in Fig. 2. Our findings highlight the versatility and adaptability of ROMAN, enabling autonomous manipulation with adaptive motor skills.

Fig. 2: ROMAN demonstrated the ability to adapt to the scenarios beyond demonstrated sequences and exhibited dynamic recovery capabilities, by balancing exploitation and exploration via the HHL approach.
figure 2

a,b, Policy adaptation of ROMAN during failures in pick and place and pick and drop subtasks, respectively. These intermediate failures are attributed to either an expert or a gating network error. In these instances, we show infrequent error cases (t = 1) of these experts, which, however, quickly re-adapt and regrasp the items (t = 2 to t = 4) to successfully complete the sequence. Most notably, this can be due to a combination of incorrect grasping of objects, expert trajectories or activation of sequences. c, The ability of the MN of the ROMAN framework to dynamically adapt in cases that were not encountered in the initial demonstrations, but rather those states were visited during RL training as the balance of exploitation and exploration, ultimately exhibiting new behaviours beyond imitation, leading to recovery capabilities from local minima. The figure represents 12 snapshots in time with a sequence from left to right and top to bottom, and the weight assignments by the MN highlighted.

To evaluate the scalability of a hierarchical architecture versus a single-NN approach, we compared ROMAN’s preliminary two-dimensional (2D) and final three-dimensional (3D) hierarchical architecture stage against monolithic NNs sharing an equivalent hybrid learning procedure. Snapshots of ROMAN completing long-horizon sequential tasks can be seen in Fig. 3, with examples of 2D and 3D operation depicted in Fig. 3c,d, respectively. Thereafter, we evaluated ROMAN’s final 3D stage composed of seven experts against (1) different levels of exteroceptive uncertainty, (2) extensive ablation studies of the internal hybrid learning procedure and (3) the effects of different numbers of demonstrations provided to the framework. All subsequent results from the experiments were conducted with identical network settings (states, actions and rewards), number of demonstrations and hyperparameter values to retain consistency. The overall architecture of ROMAN is depicted in Fig. 4, with the state space and settings of each NN specified in Table 1. More details on the hyperparameters and dimensions of the networks can be found in Supplementary Information and more specifically in Supplementary Tables 12, 13, 14 and 15. Information on the demonstrations can be found in Methods.

Fig. 3: Analysis of the MN observations using the t-SNE, with visualized snapshots showing ROMAN’s completion of sequential tasks in 2D and 3D scenarios.
figure 3

The t-SNE projects the 29-dimensional MN state vector into two dimensions. Principal component analysis was used to warm-start the t-SNE projection. a, The depiction of the state vectors at the start of each of the seven case scenarios, sampled at 1,000 Hz for 1 s. A total of 1,000 samples were projected with a perplexity of 400. b, An illustration of the state vectors during the sequence of actions contained in each case scenario, sampled for the first 1.5 s of each expert sequence. Consequently, as these are sampled within the sequence of actions, they appear ‘trajectory’-like, since the robot and the objects manipulated by it are in motion during the sampling. A total of 1,500 samples were projected with a perplexity of 200. Six out of seven scenario cases are depicted, as in practice S1 only includes a single expert activation and hence was omitted from the analysis. c, ROMAN in its initial 2D stage depicting all five distinct subtasks managed by each expert. d, ROMAN in its final stage in the most complex setting and longest-time-horizon sequential tasks in full 3D space, with seven different experts.

Fig. 4: Hybrid hierarchical architecture of ROMAN composed of high-level experts and the gating NN, and the formation of the ROMAN framework.
figure 4

a, The hybrid learning architecture of each high-level expert and gating NN. DoF, degrees of freedom. b, The higher hierarchical formation of ROMAN and how the experts are orchestrated and activated. The multilayer perceptrons for the NNs are visually depicted in both panels.

Table 1 Summary of ROMAN’s overall NN architecture, the state space of each NN and the settings of individual components in the hierarchical framework

Definition of success rate

Task success was attained and defined when all seven subtask goals depicted in Fig. 1 were satisfied. Consequently, to consider a scenario successful, all interrelated subtasks needed to be sequentially completed within the time limit.

Limitations of monolithic networks in long-horizon tasks

ROMAN’s preliminary version in two dimensions consisted of five experts (Fig. 3c) and was thereafter scaled up to three dimensions consisting of seven experts (Figs. 1 and 3d). Consequently, in this section we compare ROMAN’s preliminary and final stages against two monolithic single NNs with an equivalent hybrid learning procedure (shown in Fig. 4a) for two and three dimensions, respectively. These baseline evaluations allowed a direct comparison of a monolithic versus a hierarchical approach, to evaluate and demonstrate the advantages of a hierarchical task decomposition with an identical learning procedure. The single NNs had states identical to those of ROMAN’s MN and actions identical to those of ROMAN’s experts. To conduct a fair comparison, a total of N = 100 and N = 140 demonstrations were provided to the single NNs, accounting for ROMAN’s 2D and 3D cases composed of five and seven experts pretrained with N = 20 demonstrations, respectively. Table 2 details the robustness of ROMAN’s 3D case, including individual expert success rates.

Table 2 Summary of the results evaluated on increasing levels of Gaussian noise and uncertainties from the vision system for each expert and the main MN in ROMAN

The results shown in Table 3 and Table 4, for the 2D and 3D cases of the monolithic NN, respectively, suggest that a single NN is unable to solve the complex nature and long sequential task of our validated manipulation scenario given the same training procedure. While in two dimensions the single NN attains high success rates to some extent, these remain substantially lower than ROMAN’s, especially in increasing time horizons (S3, S4 and S5). Extending the dimensionality to three dimensions reveals that a monolithic NN is mostly unable to attain robust performance (S3), exhibiting complete failure in longer and more complex sequential cases (S4 and beyond). These results highlight the value of a hierarchical task decomposition as with ROMAN’s architecture. For more details and expansion regarding the monolithic NNs, including their architecture and hyperparameters, please consult Supplementary Tables 14 and 15.

Table 3 Experimental results of ROMAN: success rates are compared between a single NN and ROMAN in two dimensions with five experts
Table 4 Experimental results of ROMAN: success rates across all seven scenarios, between a single NN and ROMAN in three dimensions

Validation against exteroceptive uncertainty

This section presents all the results tested in 3D space with seven experts, with details shown in Table 1, to study the domain of robotics with complex settings. While scaling up to three dimensions with seven experts, the first objective was to evaluate the robustness of the hierarchical framework against different levels of Gaussian-distributed exteroceptive noise on the position states. The rationale for introducing noise in the exteroceptive states was to thoroughly evaluate the robustness of the framework against uncertainties under realistic conditions, since such states are typically more prone to noise than proprioceptive states in robotic systems19.

Evaluation against increasing levels of Gaussian noise

Foremost, we validate each expert’s individual robustness, which is critical before evaluating the MN’s performance during sequential activation, to avoid failures being caused by individual expert performance. This minimized the covariance between the success rates of each expert and that of the MN. Table 2 shows that all individual experts, even when presented with higher levels of noise, are resilient against the tested levels of uncertainty. It is worth noting that all picking experts were slightly more prone to errors due to their higher complexity, in line with refs. 44,53.

Next, we evaluated the MN’s performance in coordinating the different experts in the hierarchy of ROMAN. From the given seven experts, we tested seven different randomized case scenarios, where each scenario requires addition of another expert, making the overall tasks more complex. Results in Table 2 show robust performance to different noise levels. Although adding more experts increases the dimensionality of the problem, our results show that the MN is sufficiently resilient in the most complex settings in scenarios 6 and 7. However, there still is a performance drop in scenarios 3, 4 and 5 when compared with 6 and 7, which is discussed in Results.

Evaluation of vision system

The next objective was to test the robustness of ROMAN against exteroceptive uncertainties from a simulated vision system in the simulation. ROMAN and its experts, including the MN, were not trained with this vision detection module, but rather directly evaluated on it to test the feasibility and robustness of the framework to such a vision-based detection system. More details of the vision system can be found in Methods.

The results in Table 2 show that using a pretrained object detection module from vision produces high success rates even amongst the most complex sequential tasks. Despite a slight decrease in success rates as more sequences are added, ROMAN exhibits robustness to the vision system, sustaining high success levels. The decrease in success rates in S6 and less in S7 can be attributed to the unboxing subtask, which is more prone to visual occlusion (Fig. 1) and the similarity in the exteroceptive observations later analysed in a t-distributed stochastic neighbour embedding (t-SNE; Fig. 3).

Ablation study on ROMAN’s default learning approach

The next validation entails a comparison with state-of-the-art learning paradigms, including HRL and HIL approaches, similar to related work12,44. ROMAN makes use of BC to warm-start the policy via supervised learning and thereafter uses intrinsic rI (IL: GAIL) and extrinsic rE (RL) rewards via PPO for training, and we conduct ablations to the training procedure by excluding at least one of the previous paradigms.

The ablation results in Table 5 show that the exclusive use of rE (RL) exhibited complete failure, suggesting that the high complexity of the tasks is unattainable via random exploration of the action space. Using the rI provided by GAIL or coupling it with rE for RL and GAIL both showed substantially higher success rates, but limited to S1–S3, with longer-horizon tasks still being unattainable.

Table 5 Experimental results of ROMAN: success rates across all seven scenarios, between different comparisons of HRL, HIL and their combinations, used to train ROMAN

From the related work7,12,44, we summarize that training with BC alone appears to yield rapid performance degradation as the time horizon increases. This is in line with our results for both BC and RL,BC at σ = ±0.5 cm noise. While a notable boost in success rates is observed, longer sequential tasks such as S4–S7 (which exhibit higher variance in the trajectories visited due to compounding of errors throughout the trajectory) show lower performance when compared with that of ROMAN’s default learning. Despite BC being a simple yet effective algorithm, its performance is greatly affected when presented with out-of-distribution states, in line with refs. 36,45,54. To further test this finding, we evaluate both BC and RL,BC on increased levels of noise of σ = ±1.0 cm and σ = ±2.0 cm.

At a noise level of σ = ±1.0 cm, we observe a slight drop in success rates for both BC and RL,BC. ROMAN’s default settings still attain the highest success rates. At a higher level of σ = ±2.0 cm noise, we observe a notable drop in success rates for both BC and RL,BC. Employing BC at such levels of uncertainty further highlights its limitation, and adding rE produces slightly but not substantially higher success rates. In comparison, ROMAN’s success rates drop slightly when compared with previous levels of noise, but it still retains considerably higher degrees of resilience, highlighting the value of avoiding naively imitating demonstrations.

We conclude that the proposed HHL approach is advantageous in overcoming increasing exteroceptive uncertainties and the complexities associated with longer-time-horizon sequential tasks. Further, in Results, we demonstrate that the HHL architecture of ROMAN dynamically adapts to situations that were not encountered in the demonstrated sequence, and extends beyond the imitated behaviour during training. This is attributed to ROMAN’s balance between exploitation and exploration.

Effects of demonstrations

Finally, we compared the effect of different numbers of demonstrations on the overall performance of ROMAN. We analyse the effects of N = 7, N = 21 and N = 42 demonstrations on the success rates across all scenarios for the MN. Our results in Table 6 show that a relatively small number of demonstrations for the MN (N = 21, which corresponds to only N = 3 demonstrations for each of the seven subtasks) is sufficient to give a reasonable success rate. Doubling the number of demonstrations to N = 42 yields higher success rates than for N = 21, yet the difference is marginal. A one-shot demonstration of each scenario (that is, N = 7) did not yield acceptable success rates during complex sequences as shown in S4–S7. More details regarding the demonstrations can be found in Supplementary Table 7.

Table 6 Experimental results of ROMAN: success rates based on the number of demonstrations provided to the MN

Adaptation to recover from local minima

We observed that occasionally experts could fail in retaining a firm grasp, resulting in dropping a grasped object. As shown by the success rates, this occurred fairly infrequently and was primarily limited to experts with picking tasks. Further evaluation found that, when such rare expert-level failures occurred, the MN began to recognize the subtask state and gradually learned a new weight assignment until the tasks were successful. The use of the HHL approach, balancing exploitation and exploration, enabled a positive adaptation of the learning agent to commence a regrasp procedure, as shown in Fig. 2a,b.

Moreover, the MN in ROMAN learns the ability to recover from local minima by rapidly switching experts when it is necessary to do so. During the sequential activation of the seven experts, the robot gripper could occasionally become stuck under the cabinet while retrieving the rack. During such cases, the MN would activate other experts to alter the trajectory and move the gripper away from the cabinet until it was collision free, and then recommence the task successfully, as shown in Fig. 2c. This result highlights the value of combining the advantages of IL and RL paradigms and leveraging intrinsic and extrinsic rewards, resulting in a robust performance in cases not encountered in the demonstrations. Examples can be found in Supplementary Video 1.

t-SNE analysis of the similarity of sequences

To qualitatively study the MN’s ability to activate the necessary expert activation on the basis of its observations, we conducted dimensionality reduction via t-SNE. This allowed us to evaluate the similarities in the observations of the MN and its ability to distinguish between different scenarios. The t-SNE plots are shown in Fig. 3a,b.

First, we conducted a t-SNE on the MN observations at the beginning of each scenario to analyse similarities between the MN observations in different scenarios. As shown in Fig. 3a, scenarios S1–S7 differ to a great degree, and S3, S4 and S5 present a slight overlap with each other because the state vectors between these three are relatively similar. This suggests why the MN may not always activate the correct sequence, particularly at the beginning of a sequence when the end effector’s start position is randomized (as opposed to being the ending position of a previous subtask), leading to slightly lower success rates.

Second, we also conducted a t-SNE on the MN observations of each separate activation for every scenario studied. Figure 3b reveals the similarities in the MN observations throughout different expert activations in each of the seven case scenarios. By sampling within the sequence of actions, we obtain a low-dimensional projection of the trajectory of the MN observation vectors during the expert activations. In essence, this is due to the change in the spatial states of the objects in the scene and the end effector being in motion during the sequence of actions.

Overall, Fig. 3b shows no notable overlaps between the activations of the different experts within each scenario, and suggests that the MN is capable of distinctly activating experts during the subtask completion. Thus, regarding the decreased performance for S3, S4 and S5 observed in Tables 26, on the basis of the slight overlap between MN observations analysed in Fig. 3a, we conclude that the failures that account for the slight drop in performance occurred at the beginning of the sequences due to the randomized initialization.


Results of ROMAN and its implications

The hierarchical task decomposition of ROMAN allows for task-level experts to be trained to achieve robust performance in considerably complex sequential tasks. Hence, it enables the MN to focus on orchestrating these high-level experts, rather than low-level skills, thereby offloading unnecessary complexity from the MN. Our results show that ROMAN can orchestrate notably more complex sequential tasks of longer time horizons and higher dimensionalities than similar work in physics-based manipulation8,44,45.

Moreover, ROMAN’s HHL architecture (Fig. 4) achieved successful adaptation to non-encountered scenarios and recovery from local minima that were not explicitly demonstrated. Hence, the results suggest that, although IL is effective in providing a baseline, achieving a balance between imitating the demonstrations and maximizing the extrinsic RL reward through random exploration is crucial for successful adaptation beyond the demonstrated behaviours. This balance between exploration and exploitation provided by ROMAN also shares common ground with biological studies27,39.

Finally, results show that ROMAN’s central MN was able to solve the most complex and longest-horizon sequential manipulation tasks skilfully. Further investigation also found a performance drop in some of the tasks with lower complexity, such as S3–S5, compared with more complex ones such as S6 and S7. The t-SNE analysis concluded that this is primarily due to the difficulty for the MN to differentiate between those states at randomized initialization of tasks. Future work can explore more sensory feedback to differentiate ambiguous cases, or design a ‘memory’ mechanism by expanding the observation with history states.

Future work

Future work includes extending ROMAN to higher-dimensionality problems, such as multiexpert HL and bi-manual manipulation. Moreover, to enable real-world deployment in future work, a vision system for exteroceptive information would be needed to predict object poses—for example, using AprilTags, or segmenting/detecting objects using RGB/RGB-D cameras. Additionally, a dynamic grasping controller that incorporates force control could further enhance the grasping performance.


ROMAN is characterized by an HHL approach. In this architecture, multiple experts specialize in diverse and fundamental types of manipulation tasks that are activated, in the correct sequence, by a primary gating network, the MN. The validation of ROMAN will by definition be among different types of manipulation tasks commonly seen in robotics and physics-based interactions.

System overview

We validate our architecture in a complex medical laboratory setting, to highlight our approach in a setting where manipulation typically consists of (1) careful handling of small objects, (2) the necessity to perform multiple tasks and (3) the correct sequence of tasks to complete a long and complex end goal. The construction of the environment was done in such a way as to derive as many subtasks as possible and validated our method. We used the seven-degree-of-freedom Franka Emika robot in simulation with its default gripper in 3D space, based entirely on physics-based interactions with the environment. The system overview including the simulation environment and the overall depiction of the ROMAN framework are shown in Fig. 4. An architecture overview of the incorporated NNs in the ROMAN framework, including their individual states, actions, number of demonstrations and training time, is given in Table 1. More details of the system and simulation overview, and incorporated software tools55, including the general apparatus, can be found in the Supplementary Notes and more specifically Supplementary Note 1.

Vision system

As part of our preliminary investigation, we implemented a vision system using an RGB camera in the simulation to predict the poses of the different objects of interest (OIs). The vision system implements an object detection and pose estimation module based on the VGG-16 backbone architecture56. The system was initialized with pretrained weights on the ImageNet dataset and fine-tuned using a custom dataset, which was created by capturing the OIs from the simulated environment, including both the segmentation and labelling of the OIs. The output of the network predicted the poses of all OIs, specifically their 3D positions.

The rationale for testing with a camera set-up was to validate ROMAN’s robustness in a realistic setting, where pose prediction errors and visual occlusions naturally occur. When the target objects were occluded, the last known position was provided to the gating network. Since the pretrained object detection module from the vision system attained variable levels of positional error56, we simulated increasing levels of Gaussian-distributed noise to all exteroceptive observations of all NNs, so as to further test ROMAN’s capabilities besides its robustness to a vision system, which is in line with related work8,45. Overall, by introducing exteroceptive uncertainties, we can further assess the resilience of our framework and highlight the importance of a hybrid learning approach within a hierarchical architecture for solving complex sequential tasks.

Learning approach and preliminaries

We make use of two IL algorithms, GAIL36 and BC57. These two algorithms, coupled with the RL algorithm PPO20, allowed us to successfully and robustly imitate complex daily activity tasks for the purpose of autonomous robotic operation and physics-based interactions entailing multiple tasks. The hybrid learning procedure used for both the expert NNs and the MN in ROMAN is illustrated in Fig. 4a, while Fig. 4b depicts the hierarchical framework formation. In particular, the training procedure is composed of two stages: in stage one, the policy is warm-started using BC; in stage two, the policy is updated via the PPO algorithm with rE and rI stemming from the environment (RL) and from the discriminator network (GAIL), respectively.

BC (warm-starting the policy)

Foremost, to warm-start the policy, we used BC for a given number of initial epochs. The cutoff point for BC was determined via preliminary investigations and training sessions on the performance of the policy and the complexity of the sequential tasks. Notably, the cutoff point of BC was increased when transitioning from the 2D to 3D version of ROMAN to account for the increased complexity. We avoided using exclusively BC throughout the training process, so as to allow the agent to explore further samples and improve upon demonstrated behaviours, while keeping the demonstration dataset small12,36. This is due to BC being limited in its ability to generalize to out-of-distribution states, and thus is restricted to the trajectories seen in the provided demonstrations36,58. Most notably, this can lead to drifting errors when the agent encounters new trajectories outside those in the demonstrations36,59. In line with previous work concerned with robotic manipulation, sole dependence on BC should be avoided, and instead a viable alternative is to add a reward term when computing a separate RL gradient that corresponds to the BC loss45. In our work, using a dataset of state and action transitions std, atd provided by the demonstrator, we implement BC by training an NN policy π(st) = at using supervised learning to minimize the mean squared error loss between atd and at for the demonstration dataset.

GAIL (commenced after BC and active throughout)

To effectively match human demonstration data over a period, also known as a horizon, we made use of inverse RL and, in this case, GAIL36. GAIL was used after BC’s cutoff point, at which GAIL commenced and was active throughout training to attempt to minimize the divergence between the agent’s policy and that of the demonstrator. However, GAIL was not directly used to update policy parameters; we instead make use of a proxy imitation reward signal obtained by GAIL, described further in this section.

This is achieved by sampling a set of expert (τE) and agent (τA) trajectories of states and actions (st, at). The expert trajectories are sampled from a given demonstration dataset while the agent trajectories are sampled from a generative model also known as the generator (G). The generator, however, instead of being rewarded solely by the environment, is rewarded by a scalar score provided by the discriminator (D), implemented as a separate NN. In this process, the discriminator attempts to differentiate between the expert and agent trajectories, rewarding the generator if the divergence between these trajectories decreases. The discriminator is also trained to become ‘stricter’, resulting in the generator, for example, agent, improving its performance at imitating and converging towards the behaviour that was demonstrated by the human expert. This can be formulated as follows:

$${E}_{{\tau }_{\mathrm{E}}}[\nabla \,{\mathrm{log}}(D({s}_{t},{a}_{t}))]+{E}_{{\tau }_{\mathrm{A}}}[\nabla \,{\mathrm{log}}(1-D({s}_{t},{a}_{t}))]$$

where \({E}_{{\tau }_{\mathrm{E}}}\) and \({E}_{{\tau }_{\mathrm{A}}}\) represent the expert and agent trajectories from the training, which are represented as inputs to the discriminator network (D). The discriminator outputs a continuous value between 0 and 1, with a value closer to 1 meaning that the agent or generator is resembling a trajectory closer to that of the expert’s, essentially minimizing the divergence and maximizing the imitation. Hence, D can be used as a reward signal to train G to mimic the expert’s demonstrated data. Moreover, to allow the agent to further explore additional actions that can lead to improved performance when compared with what was demonstrated, we modify the above formulation for the discriminator to use only the states (st) and not the actions (at) of the demonstrated trajectories. In turn, this leads to increased exploration, which should encourage behaviours beyond those encountered in a demonstrated sequence when coupled with RL (more details are described in Results and Discussion).

Consequently, we reformulate Equation (1) as with ref. 60:

$${E}_{{\tau }_{\mathrm{E}}}[\nabla \,{\mathrm{log}}(D({s}_{t}))]+{\mathrm{E}}_{{\tau }_{\mathrm{A}}}[\nabla \,{\mathrm{log}}(1-D({s}_{t}))].$$

Sampling only the states for GAIL allowed us to be less restrictive in terms of imitation. Discriminating against both states and actions between the demonstrator and the expert, as with the original formulation of GAIL36, would have potentially led to disallowing further exploration by the agent of other actions, which may in actuality lead to better adaptation based on the state space and avoid a ‘naive’ copying of identical imitation.

The result of using the above two IL algorithms translated into a considerably reduced necessary dataset, compared with related work to train the agents successfully in complex long-horizon sequential tasks12,44.

RL (exploration beyond imitation)

In addition to the IL approaches mentioned above, we also made use of a small task-related extrinsic reward signal. We use extrinsic rewards to provide a small contribution towards the final policy to avoid exclusive dependence on pure imitation. As described below, we use intrinsic (from IL) as well as extrinsic task-related rewards to update the policy, with the IL reward being scaled by the highest weight and by extent being the main learning signal provider. Most notably, this HHL architecture showed the ability to adapt to new cases that were not encountered during demonstrations, and resilience in the presence of sensor uncertainty. Specifically, this allowed ROMAN to recover from local minima during the most complex sequence activation of experts, even when the sequence is not activated precisely or errors occur in individual experts. We chose PPO as our RL algorithm because it is robust and flexible across various hyperparameter settings.

Denoting our policy πθ as an NN parameterized by weights θ, the PPO update at step k is given by

$${\theta }_{k+1}=\arg \mathop{\max }\limits_{\theta }{{\mathbb{E}}}_{s,a \sim {\pi }_{{\theta }_{k}}}\left[L(s,a,{\theta }_{k},\theta )\right]$$

with a clipped loss function L(s, a, θk, θ) that has a surrogate term, a value term and an entropy term20.

Integration of BC, GAIL and RL

To learn to solve long and complex sequential tasks using limited demonstration data, we integrate a set of algorithms for an effective balance between exploitation and exploration. While using BC, we perform supervised learning on the policy using the demonstrations as a dataset: that is, policy updates are driven by the mean squared error loss on the demonstration dataset. While using GAIL and or RL, we use PPO as the general-purpose algorithm to perform policy updates. We then combine these methods by using different reward terms for rI and rE, where intrinsic rewards are provided by the discriminator score from GAIL, and extrinsic rewards are provided by the environment as per the RL formalism.

Regarding GAIL, as mentioned above, we modify the original framework to use only states in the discriminator, instead of states and actions, hence making use of Equation (2). We define the intrinsic reward term as rI = −log(1 − D(st)), where D(st)  (0, 1), and acts as a proxy reward term that can be used by PPO to maximize the GAIL objective. When training with GAIL and RL, we use a linear combination of reward terms such that r = rIwI + rEwE, with wI and wE fixed scaling parameters for intrinsic and extrinsic rewards, respectively. Our HHL control policy focuses more on imitation, that is, on the intrinsic compared with the extrinsic rewards: the rI are several magnitudes larger than the rE (wI > wE). Using the latter reward combination, the returns are computed as the discounted sum of rewards, and are used for the PPO update on the policy as in Equation (3).

ROMAN’s robustness is attributed to the above-employed hybrid learning architecture and in particular the combination of (1) using BC up to a given epoch for warm-starting policy optimization, (2) thereafter using the intrinsic reward provided by GAIL to further minimize the divergence of the agent and that of the expert demonstrator and finally (3) the addition of an extrinsic reward term from the RL paradigm to allow the agent to explore further and beyond what was demonstrated.

The individual NN architecture of each expert and the MN (the gating network), and the hierarchical architecture, are depicted in Fig. 4a,b, respectively. Figure 4b illustrates the hierarchical formation of ROMAN and, more specifically, that the exteroceptive information provided to each NN from the environment is determined by the objective of each expert and the relevance of that information for the successful completion of the given subtask. In contrast, the MN observes the entirety of the environment.

Demonstration acquisition and settings

All demonstrations were provided via keybindings from a generic keyboard, as shown in Fig. 4a. The keyboard was used to provide two levels of demonstrations. First, demonstrations were provided to the expert NNs, with keybindings corresponding to the velocity control of the robotic end effector and the binary state of opening or closing the gripper. The expert NNs shared identical actions and specialized in different manipulation skills. Second, given the pretrained expert NNs, a demonstrated sequence for the MN was provided via a set of different keybindings corresponding to the weight assignment of the incorporated experts in the hierarchical architecture. Therefore, the expert demonstrations were specific to each individual expert’s specialized skill and goal, which allowed these pretrained networks to be coordinated by the MN’s demonstrated sequence for the sequential activation of the task.

Two cameras in an orthographic projection were rendered onto a 2D monitor, visually displaying the environment from an upper and side-view perspective that allowed the human expert to observe the task and behaviour. In such a simulated environment, the determination of depth-associated distances is rendered easier for the human demonstrator, as shown by previous work1,4,61.

These demonstrations were used to warm-start the policy via BC and for the discriminator of GAIL in the form of an intrinsic reward to the PPO algorithm. A total of N = 20 demonstrations were provided to pretrain the ROMAN expert NNs, and a total of N = 42 demonstrations were provided to the MN, corresponding to N = 6 for each of the seven sequential scenarios. Our technical approach of using a keyboard for generating demonstration data and imitating trajectories for the expert NNs and the MN, as well as the corresponding technical implementations, was not derived from our previous work or other published works.


The physics engine NVIDIA PhysX allowed us to devise numerous tasks all containing physical properties and advanced physical characteristics such as hinges, linearly moving objects and spring joints. The full task is visually illustrated in Fig. 1, with the full sequence decomposed into its relevant subtasks in Fig. 3d.

The task was conceived and inspired by a medical laboratory setting, where frequently encountered manipulation tasks often involve a varying and flexible number of sequences of differing types of subtasks. The objective here was to retrieve a small vial, insert it into a rack and push them all together onto a conveyor belt. Within this workflow, we further derived additional subtasks while ensuring their interdependence. All derived tasks are common in robotic manipulation and physics-based interactions5. With a total of seven experts as in Fig. 1, we derived seven sequence activation cases, referred to as scenarios. The numbering of scenarios also indicates how many experts are involved, as each sequence builds upon the previous one by adding a new task to it. Finally, the episode terminated once either (1) the button next to the conveyor belt was pushed, (2) the maximum step count for the episode was reached or (3) the end effector deviated too far from the centre of the scene.

Expert network characteristics and architecture

A complex, long-horizon sequential task in full 3D space is decomposed into fundamental and high-level types of manipulation skill, henceforth referred to as experts. This allowed the validation of the robustness of the architecture over increasing complexity, uncertainty and dimensionality. The manipulation experts are derived with diverse and distinct specialized skills to cover a broad range of common tasks in real-life and robotic manipulation3,38. We ensured that these experts were not too closely interrelated to one another, thereby offering greater versatility and flexibility while used in combination. The total number of trained expert NNs is seven, as shown in Fig. 1 and listed below.

  • Pull Open (open drawer) [πPull]: expert responsible for pulling a linearly moving object, such as a sliding drawer.

  • Pick and Drop (unbox) [πPickDrop]: expert responsible for picking and dropping an object without regard to height offset. This is commonly seen when removing the lid or the cover of a disposable box to retrieve an OI.

  • Rotate Open (open cabinet) [πRotateOpen]: expert responsible for rotating a door handle configured around a single axis, a very common scenario seen when opening a cabinet or rotating drawer.

  • Pick and Place (place rack) [πPickPlace]: expert responsible for picking and placing an object carefully (with zero or close to minimal height drop).

  • Pick and Insert (insert vial) [πPickInsert]: expert responsible for picking and inserting an object with high precision in a particular docking target location.

  • Push (push rack and vial) [πPush]: expert responsible for pushing an object over a surface.

  • Push Button (push button) [πButton]: expert responsible for pushing a human-made switch or button.

Action space

All aforementioned experts are listed in the form of high-level abstract manipulation type (specific task on validated environment). All experts shared identical actions, including full end-effector velocities (α1, ±vx; α2, ±vy; α3, ±vz), as well as controlling the gripper state (α4, fxg)). Sharing the same action space across experts is relevant to highlight the value of our proposed hierarchical framework, as expert specialization is not aided by constraining the actions available to each expert to those that are only relevant for its respective specialization.

State space

The state space of each expert was identical for the proprioceptive and sensory states; however, it differed for the exteroceptive states depending on each specialized manipulation skill and the relevant information from the environment for the successful completion of each individual task. Consequently, the exteroceptive states were decided on the basis of the nature of each expert’s specialized skill and end goal. This allowed each NN to focus only on its own core exteroceptive information relevant to its subtask, and omit non-relevant ones—as seen from a neuroscientific perspective, whereby during the human decision-making process the relevance of information during a motor task is determined and specified62. The state and action spaces, including the demonstration settings and training times for each expert and the MN, are detailed in Table 1.

High-level task decomposition

The derived experts were composed in such a way as to allow a high-level task decomposition, thereby offloading the central MN from combining a large number of low-level action-based experts that can otherwise be solved by a single subtask-based expert. This was made possible by virtue of the employed hybrid learning procedure in the hierarchical architecture of ROMAN, which incorporates and orchestrates multiple NNs specialized in subtasks to efficiently and effectively solve complex manipulation tasks over a long time horizon.

In contrast, most related work decomposed manipulation experts into rather basic action-based primitives or action-level skills12,44. While this allows for the derivation of more abstract cases, it does limit the potential of a hierarchical model. In particular, a decomposition of low-level action-based skills prevents, to a great extent, the gating network from learning high-level scene understandings or solving complex sequences, as it focuses more on composing skills such as picking and placing, which can be instead solved by one single expert. For example, in ref. 44, the skill of the picking and placing task was learned using a three-expert hierarchical architecture composed of (1) approaching, (2) manipulating and (3) retracting. ROMAN’s framework shows that, by virtue of the employed hybrid learning approach, the derivation of picking and placing as a single high-level expert is made possible. This is how our HHL architecture overcomes such limitations by deriving experts specialized in high-level subtask-based manipulation skills, offloading the MN in turn from lower-level skill supervision.

Moreover, each single task-level expert trained via the employed hybrid learning procedure has its own inherent robustness in facing new states during the exploration of the RL process. This allowed the gating network to be trained more effectively in solving highly complex sequences over long time horizons, without the need to learn how to recombine primitive action-based experts to achieve a subtask.

From the results, we observed that when the MN was switching between the different experts there was a possibility of dropping the object when suddenly switching from any expert involved with picking to another. This is due to the limited control interface of the gripper, which provides only binary commands for opening and closing. To compensate for this, a dead zone (DZ) is introduced to account for the expert switching process. This relationship is shown as

$$\,{{\mbox{DZ}}}\,({x}_{\mathrm{g}})=\left\{\begin{array}{ll}\,{{\mbox{close}}}\,\quad &\,{{\mbox{if}}}\,\,{x}_{\mathrm{g}}\in [-1.0,-0.9],\\ \,{{\mbox{remain the same}}}\,\quad &\,{{\mbox{if}}}\,\,{x}_{\mathrm{g}}\in (-0.9,0.9),\\ \,{{\mbox{open}}}\,\quad &\,{{\mbox{if}}}\,\,{x}_{\mathrm{g}}\in [0.9,1.0].\end{array}\right.$$

Hence, the DZ ((0, 1)) implementation improved the overall stability of grasping, by only switching the gripper action to open or close when xg goes beyond the zone of (−0.9, 0.9). As a possible future work, one could potentially substitute the DZ implementation and enhance grasping control by incorporating a dynamic controller with force control or tactile sensing to render grasping more stable and reliable.

MN characteristics

The MN acts as a master control policy, overseeing the expert NNs and assignING weights ((0, 1)) to them. The final output is defined as the sum of these weighted actions:

$$\mathop{\sum }\limits_{i=1}^{m}\mathop{\sum }\limits_{j=1}^{n}{\alpha }_{i}{w}_{j}$$

whereby the number (m = 4) of all actions (αi) of each expert is controlled by a set of weights (wj) corresponding to the total number (n = 7) of experts in the hierarchy. One of the main issues in assigning the weights is to ensure the sum of all weights does not exceed unity, which can lead to unwanted behaviour, most notably torques and forces going beyond the robot’s capabilities.

Hence, we normalize the sum of weights assigned by the MN to activate experts, using a normalized exponential function, that is, softmax, which provides a probability distribution to better isolate the expert activation during long sequences. This is represented as

$$\sigma {({{{\bf{z}}}})}_{i}=\frac{{\mathrm{e}}^{{z}_{i}}}{\mathop{\sum }\nolimits_{j = 1}^{K}{\mathrm{e}}^{{z}_{j}}}\quad \,{{\mbox{for}}}\,\,i=1,\ldots ,K\,\,{{\mbox{and}}}\,\,{{{\bf{z}}}}=({z}_{1},\ldots ,{z}_{K})\in {{\mathbb{R}}}^{K}$$

where σ is the softmax function and z is the input vector, as a function of \({\mathrm{e}}^{{z}_{i}}\), denoting the standard exponential for each input, divided by the sum of all inputs K. In our case, the input vector is represented as a weight vector, with each element representing the weight of every single expert, with a sum equal to K = 7, representing all seven distinct experts.

The observation space of the MN contains the union of all of the observation spaces of each individual expert. Consequently, the observation spaces of the experts consist of the environment states that are relevant to their task, while the MN essentially observes the entirety of the relevant subtasks to better distinguish which expert should be activated at which time step. Figure 4 depicts the overall ROMAN framework, highlighting the MN as a gating mechanism that centrally governs the control policy in the HHL control framework.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.