The study of arguments has an academic pedigree stretching back to the ancient Greeks, and spans disciplines from theoretical philosophy to computational engineering. Developing computer systems that can recognize arguments in natural human language is one of the most demanding challenges in the field of artificial intelligence (AI). Writing in Nature, Slonim et al.1 report an impressive development in this field: Project Debater, an AI system that can engage with humans in debating competitions. The findings showcase how far research in this area has come, and emphasize the importance of robust engineering that combines different components, each of which handles a particular task, in the development of technology that can recognize, generate and critique arguments in debates.
Less than a decade ago, the analysis of human discourse to identify the ways in which evidence is adduced to support conclusions — a process now known as argument mining2 — was firmly beyond the capabilities of state-of-the-art AI. Since then, a combination of technical advances in AI and increasing maturity in the engineering of argument technology, coupled with intense commercial demand, has led to rapid expansion of the field. More than 50 laboratories worldwide are working on the problem, including teams at all the large software corporations.
One of the reasons for the explosion of work in this area is that direct application of AI systems that can recognize the statistical regularities of language use in large bodies of text has been transformative in many applications of AI (see ref. 3, for example), but has not, on its own, been as successful in argument mining. This is because argument structure is too varied, too complex, too nuanced and often too veiled to be recognized as easily as, say, sentence structure. Slonim et al. therefore decided to initiate a grand challenge: to develop a fully autonomous system that can take part in live debates with humans. Project Debater is the culmination of this work.
Project Debater is, first and foremost, a tremendous engineering feat. It brings together new approaches for harvesting and interpreting argumentatively relevant material from text with methods for repairing sentence syntax (which enable the system to redeploy extracted sentence fragments when presenting its arguments; the role of their syntax-repair technology is modestly underplayed by the authors). These components of the debater system are combined with information that was pre-prepared by humans, grouped around key themes, to provide knowledge, arguments and counterarguments about a wide range of topics. This knowledge base is supplemented with ‘canned’ text — fragments of sentences, pre-authored by humans — that can be used to introduce and structure a presentation during a debate.
Project Debater is extraordinarily ambitious, both as an AI system and as a grand challenge for AI as a field. As with almost all AI research that sets its sights so high, a key bottleneck is in acquiring enough data to be able to compute an effective solution to the set challenge4. Project Debater has addressed this obstacle using a dual-pronged approach: it has narrowed its focus to 100 or so debate topics; and it harvests its raw material from data sets that are large, even by the standards of modern language-processing systems.
In a series of outings in 2018 and 2019, Project Debater took on a range of talented, high-profile human debaters (Fig. 1), and its performance was informally evaluated by the audiences. Backed by its argumentation techniques and fuelled by its processed data sets, the system creates a 4-minute speech that opens a debate about a topic from its repertoire, to which a human opponent responds. It then reacts to its opponent’s points by producing a second 4-minute speech. The opponent replies with their own 4-minute rebuttal, and the debate concludes with both participants giving a 2-minute closing statement.
Perhaps the weakest aspect of the system is that it struggles to emulate the coherence and flow of human debaters — a problem associated with the highest level at which its processing can select, abstract and choreograph arguments. Yet this limitation is hardly unique to Project Debater. The structure of argument is still poorly understood, despite two millennia of research. Depending on whether the focus of argumentation research is language use, epistemology (the philosophical theory of knowledge), cognitive processes or logical validity, the features that have been proposed as crucial for a coherent model of argumentation and reasoning differ wildly5.
Models of what constitutes good argument are therefore extremely diverse6, whereas models of what constitutes good debate amount to little more than formalized intuitions (although disciplines in which the goodness of debate is codified, such as law and, to a lesser extent, political science, are ahead of the game on this front). It is therefore no wonder that Project Debater’s performance was evaluated simply by asking a human audience whether they thought it was “exemplifying a decent performance”. For almost two thirds of the debated topics, the humans thought that it did.
A final challenge faced by all argument-technology systems is whether to treat arguments as local fragments of discourse influenced by an isolated set of considerations, or to weave them into the larger tapestry of societal-scale debates. To a large degree, this is about engineering the problem to be tackled, rather than engineering the solution. By placing a priori bounds on an argument, theoretical simplifications become available that offer major computational benefits. Identifying the ‘main claim’, for example, becomes a well-defined task that can be performed almost as reliably by machine as by humans7,8. The problem is that humans are not at all good at that task, precisely because it is artificially engineered. In open discussions, a given stretch of discourse might be a claim in one context and a premise in another.
Moreover, in the real world, there are no clear boundaries that delimit an argument: discourses that happen beyond debating chambers are not discrete, but connect with a web of cross-references, analogy, exemplification and generalization. Ideas about how such an argument web might be tackled by AI have been floated in theory9 and implemented using software — a system called DebateGraph (see go.nature.com/30g2ym4), for example, is an Internet platform that provides computational tools for visualizing and sharing complex, interconnected networks of thought. However, the theoretical challenges and socio-technical issues associated with these implementations are formidable: designing compelling ways to attract large audiences to such systems is just as difficult as designing straightforward mechanisms that allow them to interact with these complex webs of argument.
Project Debater is a crucial step in the development of argument technology and in working with arguments as local phenomena. Its successes offer a tantalizing glimpse of how an AI system could work with the web of arguments that humans interpret with such apparent ease. Given the wildfires of fake news, the polarization of public opinion and the ubiquity of lazy reasoning, that ease belies an urgent need for humans to be supported in creating, processing, navigating and sharing complex arguments — support that AI might be able to supply. So although Project Debater tackles a grand challenge that acts mainly as a rallying cry for research, it also represents an advance towards AI that can contribute to human reasoning — and which, as Slonim et al. put it, pushes far beyond the comfort zone of current AI technology.
Nature 591, 373-374 (2021)
Slonim, N. et al. Nature 591, 379–384 (2021).
Lawrence, J. & Reed, C. Comput. Linguist. 45, 765–818 (2020).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Preprint at https://arxiv.org/abs/1810.04805 (2018).
Feigenbaum, E. A. Proc. 5th Int. Joint Conf. Artif. Intell. 1014–1029 (Morgan Kauffman, 1977).
van Eemeren, F. H. et al. Handbook of Argumentation Theory (Springer, 2014).
Hahn, U. Trends Cogn. Sci. 24, 363–374 (2020).
Levy, R., Yonatan, B., Hershcovich, D., Aharoni, E. & Slonim, N. Proc. COLING 2014, 25th Int. Conf. Comput. Linguist. Tech. Pap. 1489–1500 (2014).
Trautmann, D., Daxenberger, J., Stab, C., Schütze, H. & Gurevych, I. Proc. AAAI Conf. Artif. Intell. 34, 9048–9056 (2020).
Rahwan, I., Zablith, F. & Reed, C. Artif. Intell. 171, 897–921 (2007).