In an influential essay in 2019, entitled ‘The Bitter Lesson’, machine learning researcher Richard Sutton observed that the main driver of progress in artificial intelligence (AI) is the continued scaling up of computational power1. This view predicts that while manual approaches that embed human knowledge and understanding in AI agents lead to satisfying advances in the short term, in the long run they only stand in the way of developing more general, scalable methods. This provocative conclusion has led to heated debates about the role of human ingenuity, but the ‘bitter lesson’ paradigm has more or less played out in the area of natural language processing. By using scaled-up neural networks and as many text examples from the internet as possible as training data, researchers could solve previously complex problems of producing human language without syntactical errors. Further scaling has produced general-purpose and multimodal models with billions of parameters such as GPT-4, Claude, Gemini and Llama that have game-changing applications in science and society.

Perhaps the time has arrived for robotics to learn its own bitter lesson and to benefit from a substantial scaling up and incorporating large amounts of training data. At the recent annual International Conference on Robotics and Automation (ICRA), several experts debated the statement “Generative AI will make a lot of traditional robotics approaches obsolete”. The field could certainly do with new ideas, as after decades of painstaking computational development and engineering, robotics methods for perception, motion planning, reasoning, grasping, manipulation and human–robot interactions are far from being able to navigate the complex and unpredictable human world. Indeed, deep learning methods are starting to compete with traditional approaches in robot control and sensor data processing. The promise of large generative AI models that, with sufficient training data, can generalize to different tasks and situations is tantalizing.

However, gathering training data for robots is costly and slow. In the ICRA panel, Jeannette Bohg from Stanford University made a back-of-the-envelope estimate that to reach a similar amount of data available for natural language processing, from streams of images and text produced by internet users, robotics training data needs to scale up by a factor of 27 million. This sounds daunting, but Bohg pointed out that there is no fundamental obstacle to achieving this goal. Researchers can rise to the challenge and put a substantial amount of effort into gathering good quality robotics data. Notably, a recent community effort named ‘Open X-Embodiment’ has produced a dataset of 22 robots, 527 skills and 160,266 tasks, which seems a sizeable start.

However, the feasibility of ever gathering sufficient data to develop a general-purpose robotics model is questionable. The complexity of real-world interactions is enormous, and high standards in reliability and robustness are needed. A high zero-shot performance of 50% or even 75% is an impressive achievement in the laboratory setting, but unacceptable in real-world interactions. In the debate, Chad Jenkins from the University of Michigan highlighted the problem of reliability and trust: can we be sure that a general-purpose robotics model is really going to work when we need it to? Although it might not be disastrous if a chatbot hallucinates answers, machines operating in the real world, interacting with humans, need to be safe and reliable. In Jenkins’ view, robotics will always need to turn to models based on physical understanding of the world.

Elsewhere at ICRA, researchers already explore the feasibility using large vision-language models for their robots. The initial results show a promising jump in capabilities and robustness in scene understanding, human–robot interaction, and even action planning. Large vision–language models such as GPT-4 and Gemini have absorbed internet-scale amounts of data from human users, and can arguably replicate a type of ‘common sense’ practical knowledge of the world that can potentially be used in robotics. It is also clear that this type of common-sense knowledge comes with substantial reliability issues and is unlike human-like understanding2. The semantic knowledge of everyday concepts that comes naturally to large vision–language models could already be harnessed in robotics’ scene understanding and interactions with humans.

But complex problems that come with acting in a dynamically changing world remain. How robots can physically interact with their environment will depend on their bodies (also called affordances), and a next step is highlighted in the ‘SayCan’ project at Google Research, in which the PaLM model is grounded in the affordances of real-world mobile robots3. A related research direction is to develop vision–language models with an advanced, physical common-sense understanding of the world. An essential ingredient is curated data collection of examples from videos for a better understanding of physical properties of objects and physical effects in manipulating them4,5.

There is a substantial momentum in robotics with big tech backing start-ups and initiatives6 and no doubt robots will become more prominent in society, given improvements in hardware, computational efficiency and the current momentum in AI. Designing robots that can safely and reliably operate in the real world remains a challenging problem, but large vision–language models and generative AI are injecting the field with fresh ideas.