A taxonomy and review of generalization research in NLP

Hupkes, Dieuwke; Giulianelli, Mario; Dankers, Verna; Artetxe, Mikel; Elazar, Yanai; Pimentel, Tiago; Christodoulopoulos, Christos; Lasri, Karim; Saphra, Naomi; Sinclair, Arabella; Ulmer, Dennis; Schottmann, Florian; Batsuren, Khuyagbaatar; Sun, Kaiser; Sinha, Koustuv; Khalatbari, Leila; Ryskina, Maria; Frieske, Rita; Cotterell, Ryan; Jin, Zhijing

doi:10.1038/s42256-023-00729-y

Download PDF

Analysis
Open access
Published: 19 October 2023

A taxonomy and review of generalization research in NLP

Nature Machine Intelligence volume 5, pages 1161–1174 (2023)Cite this article

17k Accesses
1 Citations
201 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

The ability to generalize well is one of the primary desiderata for models of natural language processing (NLP), but what ‘good generalization’ entails and how it should be evaluated is not well understood. In this Analysis we present a taxonomy for characterizing and understanding generalization research in NLP. The proposed taxonomy is based on an extensive literature review and contains five axes along which generalization studies can differ: their main motivation, the type of generalization they aim to solve, the type of data shift they consider, the source by which this data shift originated, and the locus of the shift within the NLP modelling pipeline. We use our taxonomy to classify over 700 experiments, and we use the results to present an in-depth analysis that maps out the current state of generalization research in NLP and make recommendations for which areas deserve attention in the future.

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

The language network as a natural kind within the broader landscape of the human brain

Article 12 April 2024

Main

Good generalization, roughly defined as the ability to successfully transfer representations, knowledge and strategies from past experience to new experiences, is one of the primary desiderata for models of natural language processing (NLP), as well as for models in the wider field of machine learning^1,2. For some, generalization is crucial to ensure that models behave robustly, reliably and fairly when making predictions about data different from the data on which they were trained, which is of critical importance when models are employed in the real world. Others see good generalization as intrinsically equivalent to good performance and believe that, without it, a model is not truly able to conduct the task we intended it to. Yet others strive for good generalization because they believe models should behave in a human-like way, and humans are known to generalize well. Although the importance of generalization is almost undisputed, systematic generalization testing is not the status quo in the field of NLP.

At the root of this problem lies the fact that there is little understanding and agreement about what good generalization looks like, what types of generalization exist, how those should be evaluated, and which types should be prioritized in varying scenarios. Broadly speaking, generalization is evaluated by assessing how well a model performs on a test dataset, given the relationship of this dataset with the data on which the model was trained. For decades, it was common to exert only one simple constraint on this relationship: that the train and test data are different. Typically, this was achieved by randomly splitting the available data into training and test partitions. Generalization was thus evaluated by training and testing models on different but similarly sampled data, assumed to be independent and identically distributed (i.i.d.). In the past 20 years, we have seen great strides on such random train–test splits in a range of different applications (for example, refs. ^3,4).

With this progress, however, came the realization that, for an NLP model, reaching very high or human-level scores on an i.i.d. test set does not imply that the model robustly generalizes to a wide range of different scenarios. We have witnessed a tide of different studies pointing out generalization failures in neural models that have state-of-the-art scores on random train–test splits (as in refs. ^5,6,7,8,9,10, to give just a few examples). Some show that when models perform well on i.i.d. test splits, they might rely on simple heuristics that do not robustly generalize in a wide range of non-i.i.d. scenarios^8,11, over-rely on stereotypes^12,13, or bank on memorization rather than generalization^14,15. Others, instead, display cases in which performances drop when the evaluation data differ from the training data in terms of genre, domain or topic (for example, refs. ^6,16), or when they represent different subpopulations (for example, refs. ^5,17). Yet other studies focus on models’ inability to generalize compositionally^7,9,18, structurally^19,20, to longer sequences^21,22 or to slightly different formulations of the same problem¹³.

By showing that good performance on traditional train–test splits does not equal good generalization, these examples bring into question what kind of model capabilities recent breakthroughs actually reflect, and they suggest that research on the evaluation of NLP models is catching up with the fast recent advances in architectures and training regimes. This body of work also reveals that there is no real agreement on what kind of generalization is important for NLP models, and how that should be studied. Different studies encompass a wide range of generalization-related research questions and use a wide range of different methodologies and experimental set-ups. As of yet, it is unclear how the results of different studies relate to each other, raising the question of how should generalization be assessed, if not with i.i.d. splits? How do we determine what types of generalization are already well addressed and which are neglected, or which types of generalization should be prioritized? Ultimately, on a meta-level, how can we provide answers to these important questions without a systematic way to discuss generalization in NLP? These missing answers are standing in the way of better model evaluation and model development—what we cannot measure, we cannot improve.

Here, within an initiative called GenBench, we introduce a new framework to systematize and understand generalization research in an attempt to provide answers to the above questions. We present a generalization taxonomy, a meta-analysis of 543 papers presenting research on generalization in NLP, a set of online tools that can be used by researchers to explore and better understand generalization studies through our website—https://genbench.org—and we introduce GenBench evaluation cards that authors can use to comprehensively summarize the generalization experiments conducted in their papers. We believe that state-of-the-art generalization testing should be the new status quo in NLP, and we aim to lay the groundwork for facilitating that.

The GenBench generalization taxonomy

The generalization taxonomy we propose—visualized in Fig. 1 and compactly summarized in Extended Data Fig. 2—is based on a detailed analysis of a large number of existing studies on generalization in NLP. It includes the main five axes that capture different aspects along which generalization studies differ. Together, they form a comprehensive picture of the motivation and goal of the study and provide information on important choices in the experimental set-up. The taxonomy can be used to understand generalization research in hindsight, but is also meant as an active device for characterizing ongoing studies. We facilitate this through GenBench evaluation cards, which researchers can include in their papers. They are described in more detail in Supplementary section B, and an example is shown in Fig. 2. In the following, we give a brief description of the five axes of our taxonomy. More details are provided in the Methods.

**Fig. 1: Graphical representation of our proposed taxonomy of generalization in NLP.**

**Fig. 2: Example of a GenBench evaluation card.**

Motivation

The first axis of our taxonomy describes the high-level motivation for the study. The motivation of a study determines what type of generalization is desirable, as well as what kind of conclusions can be drawn from a model’s display or lack of generalization. Furthermore, the motivation of a study shapes its experimental design. It is therefore important for researchers to be explicitly aware of it, to ensure that the experimental set-up aligns with the questions they seek to answer. We consider four different types of motivation: practical, cognitive, intrinsic, and fairness and inclusivity.

Generalization type

The second axis in our taxonomy indicates the type of generalization the test is addressing. This axis describes on a high level what the generalization test is intended to capture, rather than considering why or how, making it one of the most important axes of our taxonomy. In the literature, we have found six main types of generalization: compositional generalization, structural generalization, cross-task generalization, cross-lingual generalization, cross-domain generalization and robustness generalization. Figure 1 (top right) further illustrates these different types of generalization.

Shift type

The third axis in our taxonomy describes what kind of data shift is considered in the generalization test. This axis derives its importance from the fact that data shift plays an essential formal role in defining and understanding generalization from a statistical perspective, as well as from the fact that different types of shift are best addressed with different kinds of experimental set-up. On the data shift axis we consider three shifts, which are well-described in the literature: covariate, label and full shift. We further include assumed shift to denote studies that assume a data shift without properly justifying it. In our analysis, we mark papers that consider multiple shifts between different distributions involved in the training and evaluation process as having multiple shifts.

Shift source

The fourth axis of the taxonomy characterizes the source of the data shift used in the experiment. The source of the data shift determines how much control the experimenter has over the training and testing data and, consequently, what kind of conclusions can be drawn from an experiment. We distinguish four different sources of shift: naturally occurring shifts, artificially partitioned natural corpora, generated shifts and fully generated data. Figure 1 further illustrates these different types of shift source.

Shift locus

The last axis of our taxonomy considers the locus of the data shift, which describes between which of the data distributions involved in the modelling pipeline a shift occurs. The locus of the shift, together with the shift type, forms the last piece of the puzzle, as it determines what part of the modelling pipeline is investigated and thus the kind of generalization question that can be asked. On this axis, we consider shifts between all stages in the contemporary modelling pipeline—pretraining, training and testing—as well as studies that consider shifts between multiple stages simultaneously.

A review of generalization research in NLP

Using our generalization taxonomy, we analysed 752 generalization experiments in NLP, presented in a total of 543 papers from the anthology of the Association for Computational Linguistics (ACL) that have the (sub)words ‘generali(s∣z)ation’ or ‘generali(s∣z)e’ in their title or abstract. Aggregate statistics on how many such papers we found across different years is available in Fig. 3. For details on how we selected and annotated the papers, see Supplementary section A. A full list of papers is provided in Supplementary section G, as well as on our website (https://genbench.org). On the same website, we also present interactive ways to visualize the results, a search tool to retrieve relevant citations, and a means to generate GenBench evaluation cards, which authors can add to their paper (or appendix) to comprehensively summarize the generalization experiments in their paper (for more information, see Supplementary section B). In this section, we present the main findings of our analysis.

**Fig. 3: Papers about generalization in the ACL anthology.**

Overall trends on different axes

We begin by discussing the overall frequency of occurrence of different categories on the five axes, without taking into account interactions between them. We plot the relative frequencies of all axis values in Fig. 4 and their development over time in Fig. 5. Because the number of generalization papers before 2018 that are retrieved is very low (Fig. 3a), we restricted the diachronic plots to the last five years. All other reported statistics are computed over our entire selection of papers.

**Fig. 4: Relative occurrences of axes values across all analysed papers.**

Motivations

As we can see in Fig. 4 (top left), by far the most common motivation to test generalization is the practical motivation. The intrinsic and cognitive motivations follow, and the studies in our Analysis that consider generalization from a fairness perspective make up only 3% of the total. In part, this final low number could stem from the fact that our keyword search in the anthology was not optimal for detecting fairness studies (further discussion is provided in Supplementary section C). We welcome researchers to suggest other generalization studies with a fairness motivation via our website. However, we also speculate that only relatively recently has attention started to grow regarding the potential harmfulness of models trained on large, uncontrolled corpora and that generalization has simply not yet been studied extensively in the context of fairness. Overall, we see that trends on the motivation axis have experienced small fluctuations over time (Fig. 5, left) but have been relatively stable over the past five years.