A taxonomy and review of generalization research in NLP

The ability to generalize well is one of the primary desiderata for models of natural language processing (NLP), but what ‘good generalization’ entails and how it should be evaluated is not well understood. In this Analysis we present a taxonomy for characterizing and understanding generalization research in NLP. The proposed taxonomy is based on an extensive literature review and contains five axes along which generalization studies can differ: their main motivation, the type of generalization they aim to solve, the type of data shift they consider, the source by which this data shift originated, and the locus of the shift within the NLP modelling pipeline. We use our taxonomy to classify over 700 experiments, and we use the results to present an in-depth analysis that maps out the current state of generalization research in NLP and make recommendations for which areas deserve attention in the future. With the rapid development of natural language processing (NLP) models in the last decade came the realization that high performance levels on test sets do not imply that a model robustly generalizes to a wide range of scenarios. Hupkes et al. review generalization approaches in the NLP literature and propose a taxonomy based on five axes to analyse such studies: motivation, type of generalization, type of data shift, the source of this data shift, and the locus of the shift within the modelling pipeline.


Introduction
Good generalisation, roughly defined as the ability to successfully transfer representations, knowledge, and strategies from past experience to new experiences, is one of the primary desiderata for models of natural language processing (NLP), as well as for models in the wider field of machine learning (Elangovan et al., 2021;Kirk et al., 2021;Lake et al., 2017;Linzen, 2020;Marcus, 2018Marcus, , 1998Schmidhuber, 1990;Shen et al., 2021;Wong and Wang, 2007;Yogatama et al., 2019, i.a.). For some, generalisation is crucial to ensure that models behave robustly, reliably, and fairly when making predictions about data different from the data that they learned from, which is of critical importance when models are employed in the real world. Others see good generalisation as intrinsically equivalent to good performance and believe that without it a model is not truly able to conduct the task we intend it to. Yet others strive for good generalisation because they believe models should behave in a human-like way, and humans are known to generalise well. While the importance of generalisation is almost undisputed -in the past five years, in the ACL Anthology alone over 1200 papers mentioned it in their title or abstract -systematic generalisation testing is not the status quo in the field of NLP.
At the root of this problem lies the fact that there is little understanding and agreement about what good generalisation looks like, what types of generalisation exist, and which should be prioritised in varying scenarios. Broadly speaking, generalisation is evaluated by assessing how well a model performs on a test dataset, given the relationship of this dataset with the data the model was trained on. For decades, it was common to exert only one simple constraint on this relationship: that the train and test data are different. Typically, this was achieved by randomly splitting available data into a training and a test partition. Generalisation was thus evaluated by training and testing models on different but similarly sampled data, assumed to be independent and identically distributed (i.i.d.). In the past 20 years, we have seen great strides on such random train-test splits in a range of different applications. Since the first release of the Penn Treebank (Marcus et al., 1993), F 1 scores for labelled constituency parsing went from above 80% at the end of the previous century (Collins, 1996;Magerman, 1995) and close to 90% in the first ten years of the current one (e.g. Petrov and Klein, 2007;Sangati and Zuidema, 2011) to scores up to 96% in recent years (Mrini et al., 2020;Yang and Deng, 2020). On the same corpus, performance for language modelling went from per-word perplexity scores well above 100 in the mid-90s (Kneser and Ney, 1995;ROSENFELD, 1996) to a score of 20.5 in 2020 (Brown et al., 2020). In many areas of NLP, the rate of progress has become even faster in the recent past. Scores for the popular evaluation suite GLUE went from values between 60 and 70 at its release in 2018 (Wang et al., 2018) to scores exceeding 90 less than a year after (Devlin et al., 2019), with performances on a wide range of tasks reaching and surpassing human-level scores by 2019 (e.g. Devlin et al., 2019;Liu et al., 2019b;Wang et al., 2019Wang et al., , 2018. In 2022, strongly scaled-up models (e.g. Chowdhery et al., 2022) showed astounding performances on almost all existing i.i.d. natural language understanding benchmarks.
With this progress, however, came the realisation that, for an NLP model, reaching very high or human-level scores on an i.i.d. test set does not imply that the model robustly generalises to a wide range of different scenarios in the way humans do. In the recent past, we witnessed a tide of different studies pointing out generalisation failures in neural models that have state-of-the-art scores on random traintest splits (Blodgett et al., 2016;Khishigsuren et al., 2022;Kim and Linzen, 2020;Lake and Baroni, 2018;Marcus, 2018;McCoy et al., 2019;Razeghi et al., 2022;Sinha et al., 2021, to give just a few examples). Some show that when models perform well on i.i.d. test splits, they might rely on simple heuristics that do not robustly generalise in a wide range of non-i.i.d. scenarios Kaushik et al., 2019;McCoy et al., 2019), over-rely on stereotypes (Parrish et al., 2022;Srivastava et al., 2022), or bank on memorisation rather than generalisation (Lewis et al., 2021;Razeghi et al., 2022). Others, instead, display cases in which performances drop when the evaluation data differs from the training data in terms of genre, domain or topic (e.g. Malinin et al., 2021;Michel and Neubig, 2018;, or when it represents different subpopulations (e.g. Blodgett et al., 2016;Dixon et al., 2018). Yet other studies focus on models' inability to generalise compositionally (Dankers et al., 2022;Kim and Linzen, 2020;Lake and Baroni, 2018;Li et al., 2021b), structurally (Sinha et al., 2021;Weber et al., 2021;Wei et al., 2021), to longer sequences (Dubois et al., 2020;Raunak et al., 2019), or to slightly different task formulations of the same problem .
By showing that good performance on traditional train-test splits does not equal good generalisation, the examples above bring into question what kind of model capabilities recent breakthroughs actually reflect, and they suggest that research on the evaluation of NLP models is catching up with the fast recent advances in architectures and training regimes. Unfortunately, this body of work also reveals that there is no real agreement on what kind of generalisation is important for NLP models: different studies encompass a wide range of generalisation-related research questions, and they use a wide range of different methodologies and experimental setups. As of yet, it is unclear how the results of different found 'in the wild' e.g. different domains curated splits on natural data e.g. different lengths generated evaluation data for natural training data e.g. HANS , or natural evaluation data for a generated training set generated training and evaluation data e.g. SCAN  from training to test data fr om pr e-tr ai ni ng to tr ai ni ng da ta from pretraining to test data Generalisation studies have various motivations (1)...
They involve data shifts (3), where the data can come from natural or synthetic sources (4).
These data shifts can occur in different stages of the modelling pipeline (5). ...and can be categorised into types (2). studies relate to each other: how should generalisation be assessed, if not with i.i.d. splits? How do we determine what types of generalisation are already well addressed and which are neglected, or which types of generalisation should be prioritised? Ultimately, on a meta-level, how can we provide answers to these important questions without a systematic way to discuss generalisation in NLP? These missing answers are standing in the way of better model evaluation and model development: what we cannot measure, we cannot improve.
The current article introduces a new framework to systematise and understand generalisation research, and it is an attempt to provide answers to the questions above. We present a generalisation taxonomy, a meta-analysis of existing research on generalisation in NLP, a set of online tools that can be used by researchers to explore and better understand generalisation studies through our website, and we introduce evaluation cards that authors can use to comprehensively summarise the generalisation experiments conducted in their papers. We believe that state-of-the-art generalisation testing should be the new status quo in NLP, and with this work, we aim to lay the groundwork for facilitating this change.
In the remainder of this article, we first describe the five axes of our taxonomy ( §2.1-2.5); these are the main axes along which generalisation studies differ. In §3, we present our analysis of the current state of generalisation research, grounded on a review of 449 papers and a total of 619 generalisation experiments. In §4, we summarise our main findings and make concrete recommendations for more sound and exhaustive generalisation tests in NLP research.

The generalisation taxonomy
We now begin a discussion of the five axes of the proposed generalisation taxonomy, which are also visualised in Figure 1 and summarised in Appendix E. The proposed taxonomy intends to be beneficial to understanding generalisation research in NLP in hindsight but is also meant as an active device for characterising ongoing studies as well as work that is still to come. We facilitate this through evaluation cards -analogous to the model cards proposed by  and the data sheets of Gebru  In Appendix B, we will further discuss how to use the evaluation cards, and we provide also a single-column version of it. On our website, we provide a tool to automatically generate latex code for evaluation cards.
et al. (2021) -which researchers can fill out for the experiments they conducted in their work and include in their paper. Doing so aids the cause of making generalisation evaluation the status quo, and enables effective monitoring of trends in generalisation research. An example of an evaluation card is provided in Figure 2; Appendix B elaborates on how to use the cards.

Motivation: what is the high-level motivation for a generalisation test?
The first axis we consider is the high-level motivation of a generalisation study. We identified four closely intertwined goals of generalisation research in NLP, which we refer to as the practical, the cognitive, the intrinsic, and the fairness motivation. The motivation of a study determines what type of generalisation is desirable, it shapes the experimental design, and it affects which conclusions can be drawn from a model's display or lack of generalisation. It is therefore crucial for researchers to be explicitly aware of the motivation underlying their studies to ensure that the experimental setup aligns with the questions they seek to answer. 1 2.1.1 Practical: in what settings can the model be used or improved?
One frequent motivation to study generalisation is of a markedly practical nature. Studies that consider generalisation from a practical perspective seek to assess in what kind of scenarios a model can be deployed, or which modelling changes can improve performance in various evaluation scenarios. An example of a research question that is often addressed with a primarily practical motivation is how well 1 As we will see in what follows, the same questions can often be asked with different underlying motivations. This makes it sometimes difficult to identify what exactly the motivation of a generalisation study is. Often, studies may inform conclusions along all four dimensions. However, given the importance of the motivation for the implications and design of the study, we nevertheless try to identify the main guiding motive of a study in our review ( §3), and we encourage researchers to be explicit about the motivation of their future studies. models generalise to different text domains or to data collected in different ways. For instance, Michel and Neubig (2018) consider how well machine translation models trained on canonical text can generalise to noisy data from an internet platform, and Lazaridou et al. (2021) investigate language model generalisation to texts written in different time periods. Other questions that are frequently addressed from a practical perspective concern biases in the training data, and whether models robustly generalise to datasets that do not share those biases, or whether they learnt spurious correlations due to that bias (e.g. Behnke et al., 2022;.

Cognitive: does the model generalise like a human?
A second high-level motivation that drives generalisation research is cognitively oriented and can be separated into two underlying categories: one focusing on models and one aimed at learning about cognition and the language faculty in humans through computational models. The first category is related to model behaviour: human generalisation is a useful reference point for the evaluation of models in NLP because it is considered to be a hallmark of human intelligence (e.g. Lake et al., 2017;Marcus, 2003) and, perhaps more importantly, because it is precisely the type of generalisation that is required to successfully model natural language. Humans learn quickly, from fewer data than existing models, and they easily (compositionally) recombine concepts they already know to understand concepts they have never before encountered (Fodor and Pylyshyn, 1988;Linzen, 2020;Marcus, 2018). These feats are thus, arguably, important desiderata for models. 2 In some cases, it might be difficult to distinguish cognitive from practical motivations: a model that generalises like a human should score well also on practically motivated tests, which is why the same experiments can be motivated in multiple ways. In our axes-based taxonomy, we rely on the motivations provided by the authors. Compositional generalisation experiments, for instance, can be cognitively motivated -e.g. when the authors suggest machines ought to generalise the way humans do -but also practically -e.g. when the authors question which machine learning techniques improve performance on benchmarks that happen to be used to test compositional generalisation.
The second, more deeply cognitively inspired category embraces work that evaluates generalisation in models to learn more about language and cognition (e.g. Baroni, 2021;Lakretz et al., 2021b;Marcus, 1999;McClelland and Plaut, 1999). Studies in this category investigate what underlies generalisation in computational models, not in order to improve the models' generalisation capabilities but to derive new hypotheses about the workings of human generalisation.

Intrinsic: does the model solve the task correctly?
A third motivation to evaluate generalisation in NLP models, which cuts through the two previous motivations, appertains to the question of whether models learned the task we intended them to learn, in the way we intended the task to be learned. The shared presupposition underpinning this type of research is that if a model has truly learned the task it is trained to do, it should be able to execute this task also in settings that differ from the exact training scenarios. What changes, across studies, is the set of conditions under which a model is considered to have appropriately learned a task. Researchers studying compositional generalisation (see §2.2.1), for example, assume that a correct understanding of language implies that the underlying compositional structure of language is captured; under that assumption, a model should not have trouble generalising to new inputs that are generated using the same compositional system. Others instead argue that true language understanding implies being able to use language across a wide variety of tasks (see §2.2.3). Yet others maintain that for a model to truly capture aspects of language understanding, such as relations of entailment between two sentences (e.g. Bowman et al., 2015a;Marelli et al., 2014;Williams et al., 2018), it should be able to do so across different datasets, even if those were sampled in a slightly different way (e.g. Talman and Chatzikyriakidis, 2019). In studies that consider generalisation from this perspective, generalisation failures are taken as proof that the modelin fact -did not learn the task as we intended it to learn it. Instead, it displayed behaviour that superficially made us think it did, for instance by relying on spurious patterns or non-generalisable heuristics.
2.1.4 Fairness and inclusivity: does the model generalise in a fair and responsible way?
A last yet very important motivation for generalisation research is the desire to have models that are fair, responsible and unbiased. One category of studies driven by these concepts, often ethical in nature, asks questions about how well models generalise to diverse demographics, typically considering minority or marginalised groups (e.g. Bender et al., 2021;Blodgett et al., 2016;Koh et al., 2021), or investigates to what extent models perpetuate (undesirable) biases learned from their training data (e.g. Dixon et al., 2018;Hutchinson et al., 2020;Sheng et al., 2019). Another line of research related to both fairness and inclusivity focuses on efficiency, both in terms of the amount of data that is required for a model to converge to a solution as well as the necessary amount of compute. In such studies, efficiency is seen as a correlate of generalisation: models that generalise well should learn more quickly and require fewer data (see, e.g. Marcus, 2018). As such, they are more inclusively applicable -for instance to lowresource languages or demographic groups for which little data is available, they are more accessible for groups with smaller computational resources, and they have a lower environmental impact (see, e.g. Strubell et al., 2019). While they have not been mentioned before in this section, studies on efficiency can naturally also be motivated by practical concerns, as well as by cognitive interests (e.g. comparing sample efficiency in humans and models).

Generalisation type: what type of generalisation is a test addressing?
The second axis in our taxonomy describes, on a high level, what aspects of generalisation a test is intended to capture, making it an important axis of our taxonomy. We identify and describe six types of generalisation that are frequently considered in the literature. Some types are rooted in knowledge about human generalisation, such as those that target compositional ( §2.2.1) or structural generalisation ( §2.2.2). Others, instead, are motivated by more practical concerns, such as generalisation across tasks ( §2.2.3), languages ( §2.2.4) and domains ( §2.2.5), or by an interest in analysing how robustly models generalise ( §2.2.6). An overview of generalisation types is presented in Figure 3.

Compositional generalisation
The first prominent type of generalisation addressed in the literature is compositional generalisation, which is often argued to underpin human's ability to quickly generalise to new data, tasks and domains (Fodor and Pylyshyn, 1988;Lake et al., 2017;Marcus, 2018;Schmidhuber, 1990). Because of this strong connection with humans and human language, work on compositional generalisation often has a primarily cognitive motivation, although practical concerns such as sample efficiency, quick adaptation and good generalisation in low-resource scenarios are frequently mentioned as additional or alternative drivers (Chaabouni et al., 2021;Linzen, 2020, to give just a few examples). While it has a strong intuitive appeal and clear mathematical definition (Montague, 1970), compositional generalisation is not easy to pin down empirically. Here, we follow Schmidhuber (1990) in defining compositionality as the ability to systematically recombine previously learned elements to map new inputs made up from these elements to their correct output. 3 In language, the inputs are 'forms' (e.g. phrases, sentences, larger pieces of discourse), which are mapped to their meaning or interpretation. Since compositional generalisation is  Figure 3: The six generalisation types, explained in detail in §2.2.1- §2.2.6. defined in terms of both an input and output space, it is usually evaluated in tasks such as sequence classification (e.g. Bowman et al., 2015b;Hupkes et al., 2018;Veldhoen et al., 2016), machine translation (e.g. Dankers et al., 2022;Liu et al., 2021;Raunak et al., 2019), semantic parsing (e.g. Finegan-Dollak et al., 2018;Keysers et al., 2019;Kim and Linzen, 2020;Shaw et al., 2021) or other kinds of generative tasks (e.g. Hupkes et al., 2020;Lake and Baroni, 2018). In such tasks, the input and output spaces are clearly distinct. As far as we are aware, there have not yet been many explicit systematic attempts to evaluate compositionality in (ungrounded) language models. 4 If and how compositionality can be adequately evaluated in such models, where the input and output (form and meaning) are conflated in one space (the space defined by the language vocabulary), are questions that are yet to be answered. 5

Structural generalisation
A second category of usually cognitively inspired generalisation studies focuses on the extent to which models can process or generate structurally (grammatically) correct forms, rather than on whether they can assign to forms correct interpretations. Unlike compositional generalisation, structural generalisation does not require an output space (the meaning or interpretation space; see §2.2.1). This makes it more straightforwardly evaluated in form-only models (i.e. language models). We distinguish two broad categories of structural generalisation: syntactic generalisation and morphological generalisation.
Syntactic generalisation Some structural generalisation studies focus specifically on syntactic generalisation: they consider whether models can generalise to novel syntactic structures or novel elements in known syntactic structures. The typical experimental setup involves training data designed to contain 4 There are, however, several studies that focus on structural generalisation in such models. Contrary to compositional generalisation, structural generalisation does not focus on the ability of models to correctly interpret new inputs, or to assign meanings to them, but only on their ability to generalise with respect to input forms. We discuss structural generalisation in §2.2.2. 5 An interesting example of this open research line is the qualitative study conducted by Brown et al. (2020) to test if GPT-3 can use novel words correctly in a sentence; as another example, slightly further away from traditional forms of compositionality, Talmor et al. (2020) finetune pretrained masked language models on multi-hop composition in question answering. or exclude specific conditions: Jumelet et al. (2021) and Weber et al. (2021) remove specific grammatical environments from the training data and then test whether models nevertheless learn to generalise to such environments; Wei et al. (2021) vary word frequencies in the training corpus to investigate how syntactic rule learning in pretrained language models is affected by the frequencies observed in the training phase. It is unfortunately difficult to conduct this type of study using very large language models: the computational cost of training these models on multiple datasets is high and generating specific test splits given knowledge of what is in the training data is often not possible, as large models' training data is often not in the public domain. Overall, the lack of control over the relationship between the training and the evaluation data of large language models makes it hard to assess to what extent the incidental examples reported for these models (most notably, in their respective release papers) are reflective of successful generalisation. This problem has only very recently begun to be acknowledged in the NLP community, with models being now openly released together with their training data (e.g. Scao et al., 2022;. Morphological generalisation A second category of structural generalisation studies focuses on morphological inflection, a popular testing ground for questions about human structural generalisation abilities. Papers focusing on morphological inflection (e.g. Corkery et al., 2019;Dankers et al., 2021;Kirov and Cotterell, 2018;Liu and Hulden, 2022;Malouf, 2017;McCurdy et al., 2020) are typically rooted in strong cognitive motivations. While most of this work considers i.i.d. train-test splits, recent studies have focused on how morphological transducer models generalise across languages (e.g. McCarthy et al., 2019;Pimentel et al., 2021a;Vylomova et al., 2020) as well as within each language (Calderone et al., 2021;Li and Wilson, 2021;Liu and Hulden, 2022;Pimentel et al., 2021b;Szolnok et al., 2021;. These studies often take inspiration from the so-called wug tests used in psycholinguistics to assess human morphological generalisation to novel words (Berko, 1958;Marcus et al., 1995). They can potentially also be conducted with large language models but the lack of access to their training data, as explained before, makes it difficult to determine whether the supposedly novel test words were truly never seen by the models.

Generalisation across tasks
A third direction of generalisation research considers the ability of individual models to adapt to multiple NLP problems. We refer to this ability as generalisation across tasks or cross-task generalisation. Along with the great advancements in NLP models, in the past ten years, the nature of cross-task generalisation tests has changed quite substantially; we discuss this evolution in the current section. Multitask learning Cross-task generalisation in NLP has been traditionally strongly connected to transfer and multitask learning (Collobert and Weston, 2008). In multitask learning, a model is either trained and evaluated on a set of tasks, or pretrained on some tasks and then adapted to others. As this setup favours approaches that benefit from positive transfer across tasks, it implicitly studies forms of cross-task generalisation. 6 Examples of benchmarks that were originally meant to address this kind of cross-task transfer -although they are not used as such any longer -are multitask benchmarks such DecaNLP , GLUE (Wang et al., 2018) and its successor SuperGLUE . More recent benchmarks formulate all tasks as sequence-to-sequence problems (e.g. Aribandi et al., 2022;Raffel et al., 2020;Xie et al., 2022) so that they can be addressed with a single, typically very large, text-to-text language model.
The pretrain-finetune paradigm Cross-task generalisation is traditionally deemed an extremely challenging topic. This has changed with the relatively recent trend of models that are first pretrained with a general-purpose, self-supervised objective -usually (masked) language modelling -and then further finetuned with the addition of task-specific parameters that learn to execute different tasks using the representations emerged in the pretraining phase. The popularisation of this pretrain-finetune paradigm has shifted thoughts on how to evaluate cross-task generalisation. Rather than evaluating how learning one task can benefit another, this paradigm instead gives a central role to the question of how well a model that has acquired some general knowledge about language can successfully be adapted, with task-specific parameters, to different kinds of tasks (e.g. Devlin et al., 2019;Howard and Ruder, 2018;Liu et al., 2019b;Peters et al., 2018). Interestingly, after finetuning, task performance is typically evaluated with random train-test splits, and thus generalisation within individual tasks is not necessarily considered.
In-context learning In the most recent years, the focus of cross-task generalisation studies has shifted even further, to scenarios which consider how well pretrained language models fare in different tasks without the addition of task-specific parameters. In the most extreme case, this implies evaluating a language model directly on a range of tasks without any further training. To do so, tasks are reformulated as text-completion problems, such that language models can be prompted directly with a question representing a specific task (zero-shot learning), potentially preceded by a few examples (few-shot learning) (Radford et al., 2019). The latter case, in which the intention is that models -without any parameter updates -'learn' from the examples given in the context, is often also referred to as in-context learning. Unfortunately, studies that investigate the relationship between the training and test data in such setups are rare, which leaves this young research area with many open questions. A different class of in-context learning studies are those that finetune a pretrained model with prompts from one set of tasks, and then evaluate it on another set of tasks (e.g. Sanh et al., 2022;Wei et al., 2022;Zhong et al., 2021). While also in this case the pretraining corpus is uncontrolled, at least the relationship between the finetuning training and test data can be monitored, and the performances on the test data with and without finetuning easily compared; nevertheless, few studies do so.

Generalisation across languages
The fourth type of generalisation we include in our taxonomy is generalisation across languages, or cross-lingual generalisation. Research in NLP has been very biased towards models and technologies for English (Bender, 2011) and most of the recent breakthroughs rely on amounts of data that are simply not available for the vast majority of the world's languages. As well as from a practical perspective, work on cross-lingual generalisation is thus important for the promotion of inclusivity and democratisation of language technologies. While the field of multilingual modelling is vast and naturally instigates interesting generalisation questions, relatively few papers in the area focus explicitly on cross-lingual generalisation. In this section, we discuss two main strands of research that do address this type of generalisation; in Appendix D, we provide a list of benchmarks that can be used to evaluate generalisation across languages.
Cross-lingual finetuning There are several ways in which cross-lingual generalisation can be evaluated. Most existing cross-lingual studies focus on scenarios where labelled training data is available in a single language (typically English) and the model is evaluated in multiple languages. A common approach to address this problem is to finetune a multilingually pretrained language model on task-specific annotations available in one or a few languages, and then transfer to other languages in a zero-shot fashion (e.g. Pires et al., 2019;Wu and Dredze, 2019). This setup tests to what extent a model's ability to solve tasks is invariant to the language of the labelled data used for training. It has been used to show, for instance, that Multilingual BERT (Devlin et al., 2019) finetuned on English labelled data generalises well to languages with different scripts, but exhibits some systematic deficiencies that affect language pairs with different word-order features, such as English and Japanese (Pires et al., 2019).
Multilingual learning A second way in which cross-lingual generalisation can be evaluated is by testing whether multilingual models perform better than monolingual models on language-specific tasks as a result of being trained on multiple languages at the same time. As is the case for multitask learning, approaches that are simultaneously trained on multiple languages (or multiple tasks) can be thought of as an implicit evaluation of generalisation across those languages (or across tasks). There is a large number of papers investigating multilingual models, usually for language modelling or machine translation (e.g. Aharoni et al., 2019;Al-Shedivat and Parikh, 2019;Costa-jussà et al., 2022;Fan et al., 2021;. Most of these papers have as their main aim to introduce models that improve on multilingual tasks across the board and are not otherwise motivated by generalisation questions. Some, however, do include explicit generalisation experiments in their setup, for example to assess the dependence of generalisation on the amount of data available for different languages , or on the number of languages a model is exposed to during training (Aharoni et al., 2019).

Generalisation across domains
The next category is generalisation across different domains, a type of generalisation that is often required in naturally occurring scenarios -more so than the types discussed so far -and thus carries high practical relevance. While there is no precise definition of what constitutes a domain, domains broadly refer to collections of texts exhibiting different topical and/or stylistic properties, such as different genres or texts with varying formality levels. Examples of domains we found in the literature are fiction, letters, governmental documents, telephone calls, and face-to-face interactions (Williams et al., 2018), biomedical texts (Fried et al., 2019), texts collected from online sources such as ArXiv, Github and OpenSubtitles (Artetxe et al., 2021), or texts produced by different language communities, e.g. written in Standard American English and African-American English (Blodgett et al., 2016). Domain adaptation Cross-domain generalisation has often been studied in connection with domain adaptation, the problem of adapting an existing general model to a new domain (Daumé III, 2007). This has been a very active research area in machine translation (Axelrod et al., 2011;Bertoldi and Federico, 2009;Chu et al., 2017;Chu and Wang, 2018;Freitag and Al-Onaizan, 2016;Hu et al., 2019;Joty et al., 2015;Koehn and Schroeder, 2007;Luong and Manning, 2015;Wang et al., 2017a,b), with several standard datasets Michel and Neubig, 2018) and dedicated tracks in popular shared tasks like WMT (Bojar et al., 2019;Specia et al., 2020). It has also been studied in part-of-speech tagging (Blitzer et al., 2006), sentiment analysis (Blitzer et al., 2007) and language model pretraining (Gururangan et al., 2020), among other tasks.
Temporal generalisation Finally, domain generalisation is related to temporal generalisation, as investigated in studies where the training data is produced in a specific time period and the model is tested on data from a different time period, either in the future or in the past. This problem has as yet been studied in a limited range of tasks, including language modelling and question answering (Lazaridou et al., 2021), named entity recognition in social media (Derczynski et al., 2016;Fromreide et al., 2014;Rijhwani and Preotiuc-Pietro, 2020), named entity disambiguation (Agarwal et al., 2018), document classification (He et al., 2018;Paul, 2018, 2019) and sentiment analysis (Lukes and Søgaard, 2018).

Generalisation in the context of robustness
The last category of generalisation research we consider on the generalisation type axis concerns models' ability to learn task solutions that abstract away from spurious correlations that may occur in the training data, and that are aligned with the underlying generalising solution that humans associate with the task (e.g. Gururangan et al., 2018;McCoy et al., 2019;Talman and Chatzikyriakidis, 2019). We refer to this type of generalisation as robustness generalisation. Research on robustness generalisation usually focuses on data shifts that stem from varying data collection processes. Different from most of the previous categories discussed in §2.2, such shifts are generally unintended and can be hard to spot. Current work, therefore, focuses on characterising such scenarios and understanding their impact. Many of these studies show that models do not generalise in the way we would expect them to, because the training data was in some subtle manner not representative of the true task distribution. Generalisation evaluation in the context of robustness can be driven by several different motivations: some studies are motivated by more practical concerns, others are conducted to gain a better perspective on intrinsic task understanding, and yet others are directed towards the development of fair and unbiased NLP models. In this section, we discuss three common scenarios of robustness evaluation.
Annotation artefacts A common scenario is one where there are annotation artefacts in the training data, which may result in an overestimation of a model's performance on a particular task. Artefacts occur frequently when datasets are collected through crowdsourcing, with undesired data properties being introduced in subtle ways as a result of how the annotation procedure was set up. Popular natural language inference datasets such as SNLI (Bowman et al., 2015a) and MultiNLI (Williams et al., 2018) have been found particularly susceptible to such artefacts. For example, Gururangan et al. (2018) and Poliak et al. (2018) showed that models can learn to make correct predictions for NLI instances by only looking at hypotheses, with spurious patterns in word choice and grammatical features (e.g. negation being indicative of the contradiction class) making it unnecessary for a model to use logical inference. The lack of true task understanding causes NLI models to generalise poorly across different datasets (Talman and Chatzikyriakidis, 2019). Besides NLI, other tasks such as question answering have also been reported to suffer from annotation artefacts (Jia and Liang, 2017;Kaushik and Lipton, 2018), even when the authors made a conscious effort to avoid such artefacts during the annotation process (Elazar et al., 2021).
Standardised splits Another line of work questions the way we use data splits in general, especially the extent to which scores on standardised splits that remain static over time are reflective of a model's generalisation abilities. For instance, Gorman and Bedrick (2019) showed that models perform much worse on fully random train-test splits than the reported state-of-the-art performances on standardised random splits. Søgaard et al. (2021) go even further and advocate for the use of heuristic and adversarial splits, thanks to which a model's capability for generalisation is challenged directly -for instance by putting all longer sentences in the test set, or by splitting the data to maximise the difference between train and test set along a certain dimension.
Subpopulation bias A third scenario in which robustness and performance overestimation play a role is the case where certain demographics are under-or over-represented in the training data. As it may result in models that generalise poorly to specific demographic groups, this is a particularly harmful  case of overestimation. Toxicity classifiers, for example, suffer from unintended bias caused by certain identity terms being disproportionately represented in the training data (e.g. "I am a gay man" being assigned high toxicity scores because the word "gay" is often used in toxic comments; Dixon et al., 2018), and abusive language detection models exhibit gender bias caused by imbalances in the training data (Park et al., 2018). A way to detect such imbalances and thus systematically avoid cases of overestimation is evaluating models by their worst-group accuracy, rather than the average accuracy across all demographic groups (Koh et al., 2021).

Shift type: what kind of data shift is considered?
We have seen that generalisation tests may differ in terms of their motivation and the type of generalisation that they target. What they share, instead, is that they all focus on cases in which there is a form of shift between the data a model is (pre)trained on and the data that is used for evaluation. In other words, for some datasets (X 1 , Y 1 ) and (X 2 , Y 2 ) considered in the experimental setup, it holds that p(x 1 , y 1 ) = p(x 2 , y 2 ). In the third axis of our taxonomy, graphically depicted in Figure 4, we describe the ways in which two datasets used in a generalisation experiment can differ. This axis adds a more formal dimension to our taxonomy and derives its importance from the fact that data shift plays an essential role in formally defining and understanding generalisation from a statistical perspective. We consider three main types of shift which are well-attested in the literature -covariate shift, label shift and full shift -and include two additional types of shift -assumed shift and multiple shifts -to account for studies that cannot be labelled with any of the three main shift types. We formalise the differences between the test, training and potentially pretraining data involved in generalisation tests as shifts between the respective data distributions: These data distributions can be expressed as the product of the probability of the input data p(x) and the conditional probability of the output labels given the input data p(y|x): This allows us to define four main types of relations between two data distributions, depending on whether the distributions differ in terms of p(x), p(y|x), both, or none. The last type constitutes the case in which there is no shift in data distributions -i.e. both p(x tr ) = p(x tst ) and p(y tr |x tr ) = p(y tst |x tst ). 7 This matches the i.i.d. evaluation setup traditionally used in machine learning. As discussed earlier, this type of evaluation, also referred to as evaluation of within-distribution generalisation, has frequently been reported not to be indicative of good performance for the more complex forms of generalisation that we often desire from our models. We will not further discuss it here, but instead focus on the other three cases, commonly referred to as out-of-distribution (o.o.d.) evaluation. Figure 4 summarises the types of distribution shift discussed in this section.

Covariate shift
The most commonly considered data distribution shift in o.o.d. generalisation research is one where p(x tst ) = p(x tr ) but p(y tst |x tst ) = p(y tr |x tr ). In this scenario, often referred to as covariate shift (Moreno-Torres et al., 2012;Storkey, 2009), the distribution of the input data p(x) changes, but the conditional probability of the labels given the input -which describes the task -remains the same. Under this type of shift, one can evaluate if a model has learned the underlying task distribution while only being exposed to p(x tr , y tr ). Challenge sets such as HANS (McCoy et al., 2019), PAWS (Yang et al., 2019), or the COGS test set (Kim and Linzen, 2020) deliberately address these shifts, with examples being selected or generated to violate invalid heuristics models are known or expected to follow. Covariate shift is also addressed in cross-generalisation and robustness evaluation studies, such as those conducted by Ryu et al. (2018) and Tan et al. (2019) on real-world datasets. Compared to the other shift types, covariate shift is the easiest to tackle without performing additional training or pre-and post-processing. As we will see in what follows, a common approach to address other, more complex shifts, is to turn them into covariate shifts.

Label shift
The second type of shift corresponds to the case in which the focus is on the conditional output distributions: p(x tst ) = p(x tr ) and p(y tst |x tst ) = p(y tr |x tr ). We refer to this case as label shift but it is also known as concept shift in the literature (Moreno-Torres et al., 2012). Label shift can happen within the same task when there are inter-annotator disagreements, a temporal shift in the data, or a change of domain (e.g. the phrase "it doesn't run" can lead to different sentiment labels depending on whether it appears in a review for software or one for mascara). Label shift also occurs when there is a change in task (as in §2.2.3). For instance, the same sentence might have a negative gold label in a sentiment classification task, but a positive label when the task is changed to toxicity identification. Or, in case of a more extreme label shift, the labels themselves can change, for example when shifting from language modelling (where the set of labels is the language vocabulary) to POS-tagging. In NLP studies, label shift is often seen as an obstacle that needs to be overcome rather than as a setting in which models are directly evaluated: if the same example has contradictory labels in training and test data, it is unclear what decision at test time should be considered good generalisation behaviour. In practice, there are two main ways in which label shift is typically addressed. The first is to add a finetuning or adaptation stage in which a model is updated to represent the shift that occurred (e.g. Biesialska et al., 2020;Sun et al., 2020) or new parameters are added to represent newly introduced labels (Devlin et al., 2019; Howard and Ruder, 2018;Peters et al., 2018, i.a.). The second way to address label shift is to augment the input data with domain or task indicators (e.g. Brown et al., 2020;Raffel et al., 2020). We saw before that the phrase "it doesn't run" can be both positive and negative, depending on its domain of occurrence. By adding indicators that specify the domain, the problem can be converted into a covariate shift (or potentially even no shift, if both indicators are represented in the training and test distributions). Similarly, in prompting setups, where tasks are formulated as questions in natural language, label shifts caused by a change of task are turned into a different shift type that can be solved without further finetuning (see, e.g. Bach et al., 2022;Brown et al., 2020;Schick and Schütze, 2021).

Full shift
The most extreme type of shift corresponds to the case in which p(x) and p(y|x) change simultaneously: p(x tst ) = p(x tr ) and p(y tst |x tst ) = p(y tr |x tr ). We refer to this case as full shift. Full shifts may occur in language modelling tasks, where changes in the p(x) directly translate into changes in p(y|x), when adapting to new language pairs in multi-lingual experiments (e.g. Costa-jussà et al., 2022;Kodner et al., 2022), or when entirely different types of data are used either for pretraining (e.g. Papadimitriou and Jurafsky, 2020, who test if pretraining on music impacts learning language afterwards) or for evaluation (e.g. De Varda and Zamparelli, 2022, who evaluate generalisation to different languages). Full shifts can be addressed without retraining -because they do not necessarily imply that the same input x is assigned a different label at test time. Nevertheless, they are challenging, and, similarly to label shifts, they are often turned into different types of shifts that can be more easily addressed. 8

Multiple shifts
We have so far focused on the types of shifts that can occur between two data distributions. Some studies, however, consider shifts between multiple distributions at the same time, for instance to investigate how different types of pretraining architectures generalise to o.o.d. splits in a finetuning stage  or which pretraining method achieves better cross-domain generalisation in a second training stage . In our taxonomy, we label such cases as multiple shifts, and -at least in the current version -we do not distinguish between different configurations of multiple shifts (e.g. label+covariate, or covariate+covariate). 9 We will discuss multiple shifts further in §2.5.

Assumed shift
When classifying shifts in our review, we will mainly focus on cases where authors explicitly consider the relationship between the data distributions they use in their experiments and the assumptions they make about this relationship are either well-grounded in the literature (e.g. it is commonly assumed that switching between domains constitutes a covariate shift) or empirically verified. Nevertheless, we identify numerous studies that claim to be about generalisation where such considerations are absent: it is assumed that there is a shift between training and test data, but this is not verified or grounded in previous research. We include this body of work in our review and refer to the corresponding type of shift as assumed shift. Sometimes, the assumed shift is not explicitly checked because it is considered plausible given general linguistic knowledge (e.g. Wilcox et al., 2021). Other times, the relationship between training and test data is not investigated because the researchers do not have access to the training data. The BigBench benchmark , for instance, contains several tasks designed to measure generalisation, but the training datasets of the models investigated are not in the public domain. Yet in other cases, the training data is available to the authors of the paper, but no extensive analysis is presented (e.g. Brown et al., 2020;Chowdhery et al., 2022).

Shift source: how are training and test data obtained?
In the previous section, we discussed what types of shifts may occur in generalisation tests. We now focus on how those shifts originated: our fourth axis, graphically shown in Figure 5, concerns the source of the differences occurring between the pretraining, training and test data distributions. The source of the data shift determines how much control an experimenter has over training and test data and, The training set is natural, but the test data is generated or selected (or vice versa, but that is rarer) .e.g. adversarial test data, generated test data train test fully generated data All data is fully generated e.g. using a grammar or templates consequently, what kind of conclusions can be drawn from a generalisation experiment. We distinguish four different sources of shifts: (i) naturally occurring shifts, (ii) artificially partitioned natural corpora, (iii) generated shifts and (iv) fully generated datasets.
To formalise the description of these different sources of shift, we consider the unobserved base distribution which describes all data considered in an experiment: The variable τ represents a data property of interest, with respect to which a specific generalisation ability is tested. This can be an observable property of the data (e.g. the length of an input sentence), an unobservable property (e.g. the timestamp that defines when a data point was produced), or even a property relative to the model under investigation (e.g. τ could represent how quickly a data point was learned in relation to overall model convergence). The base distribution over x, y and τ can be used to define different partition schemes to be adopted in generalisation experiments. Formally, a partitioning scheme is a rule f : T → {true, false} that discriminates data points according to a property τ ∈ T . To investigate how a partitioning scheme impacts model behaviour, the pretraining, training and test distributions can be defined as: Using these data descriptions, we can now discuss four different sources of shifts.

Naturally occurring shifts
The first scenario we consider is one in which shifts naturally occur between corpora. Both the data partitions of interest are naturally occurring corpora, to which no systematic operations are applied: for the purposes of a generalisation test, experimenters have no direct control over the base distribution nor the partitioning scheme f (τ ). In other words, the variable τ refers to properties that naturally differ between collected datasets. Examples of naturally occurring shifts emerge from splits containing data from different annotators (Geva et al., 2019), sources or domains (e.g. Artetxe et al., 2021;Talman and Chatzikyriakidis, 2019), populations (e.g Dixon et al., 2018;Talat et al., 2018), time periods (e.g. Lazaridou et al., 2021), or from different data collection procedures targeting the same task (Wang et al., 2018;Williams et al., 2018). In this category, we also include cross-task and cross-lingual generalisation studies in which all corpora involved are natural corpora (e.g. FitzGerald et al., 2022;Mishra et al., 2022).

Splits of natural corpora
A slightly less natural setup is one in which a naturally occurring corpus is used, but it is artificially split along specific dimensions. The primary difference with the previous category is that the variable τ refers to properties along which data would not naturally be split, such as the length or syntactic complexity of a sample. Experimenters have thus no control over the data itself, but they control the partitioning scheme f (τ ). Raunak et al. (2020), for instance, split naturally occurring machine translation corpora such that longer sentences occur in the test data, and Weber et al. (2021) split a language modelling corpus such that the training data does not contain specific types of grammatical environments.

Generated shifts
The third category concerns cases in which one data partition is a fully natural corpus and the other partition is designed with specific properties in mind to address a generalisation aspect of interest. Data in the constructed partition may avoid or contain specific patterns (Bhargava et al., 2021;Cui et al., 2022;Dankers et al., 2022;Fancellu et al., 2017), violate certain heuristics (Dayanik and Padó, 2021;Libovický et al., 2022;McCoy et al., 2019), include unusually long or complex sequences (Lakretz et al., 2021a;Raunak et al., 2019), or it may be constructed adversarially, generated either by humans (Kiela et al., 2021) or automatically (e.g. Sakaguchi et al., 2021;Zellers et al., 2018).
In the examples provided above, the constructed partition always corresponds to the test data; the opposite -where instead the training data is synthetic or generated and the test data natural -is also possible, yet less common (e.g. Papadimitriou and Jurafsky, 2020).

Fully generated
The last possibility is to use only generated data. Generating data is often the most precise way of measuring specific aspects of generalisation as experimenters have direct control over both the base distribution and the partitioning scheme. Sometimes the data involved is entirely synthetic (e.g. Hupkes et al., 2020;Lake and Baroni, 2018), other times it is templated natural language or a very narrow selection of a natural language corpus (e.g Keysers et al., 2019; Kim and Linzen, 2020). Generated splits can vary in several different dimensions. Sometimes, τ is a simple observable data property. For instance, Hupkes et al. (2020) split their corpus based on the presence of particular function pairs P, implicitly setting τ = P ∈ x. In some cases, τ may also be defined relative to the τ of other examples, and can only be computed globally, such as in the case of maximum compound divergence splitting (Keysers et al., 2019). 10 2.5 Locus of shift: between which data distributions does the shift occur?
The four axes that we have discussed so far demonstrate the depth and breadth of generalisation evaluation research, and they also clearly illustrate that generalisation is evaluated in a wide range of different experimental setups. They described the high-level motivations for studying generalisation in NLP models, the types of generalisation that have been frequently evaluated in the literature, the data distribution shifts used for generalisation tests, and the possible sources of those shifts. What we have not yet explicitly discussed is between which data distributions those shifts can occur: the locus of the shift. In our taxonomy, the shift locus forms the last piece of the puzzle, as it determines what part of the modelling pipeline is investigated and, with that, what kind of generalisation questions can be answered. We consider shifts between all stages in the contemporary modelling pipeline -pretraining, training and testing, as well as studies that consider shifts between multiple stages at the same time, as expressed by the data distributions that we have considered in §2.3 (for a graphical representation, we refer to Figure 6). Given these distributions, there exist five possible loci of shifts: shifts between the training and 10 Maximum compound divergence is not restricted to generated data, but can in some cases also be applied to natural data.  Figure 6: The five loci of splits, along with the parts of the modelling pipeline they allow investigating.
test data, between the finetuning training and test data, between the pretraining and finetuning training data, between the pretraining and test data, and between all data distributions.
We describe the five loci of shift and how they interact with different components of the modelling pipeline with the aid of three modelling distributions. These modelling distributions correspond to the previously described stages -testing a model, training it, and potentially pretraining it: In these equations, φ broadly denotes the training and pretraining hyperparameters, θ refers to the model parameters, and X , Y indicate sets of inputs and their corresponding output. Equation (9) defines a model instance, specifying a probability distribution over the target test labels Y tst given the model's parameters θ * and a set of test inputs X tst . Equation (10) defines a training procedure, specifying a probability distribution over model parameters θ * ∈ R d given a training dataset X tr , Y tr , a set of training hyperparameters φ tr , and a (potentially pretrained) model initialisationθ. Lastly, Equation (11) defines a pretraining procedure, specifying a conditional probability over the set of parametersθ, given a pretraining dataset, a set of pretraining hyperparameters φ pr , and a model initialisation. 11 Between which of these stages a shift occurs impacts which modelling distributions can be evaluated. We now discuss the different potential loci of shifts.

The train-test locus
Probably the most commonly occurring locus of shift in generalisation experiments is the one between training and test data, corresponding to the classic setup where a model is trained on some partition of the data and then directly evaluated on a shifted (out-of-distribution) test partition. Studies with the traintest locus can assess two different parts of the modelling pipeline. In some cases, researchers investigate the generalisation abilities of a model instance. Studies of this type, therefore, report the evaluation of a single set of parameters θ * as described in Equation (9) -typically made available by others -without considering how exactly it was trained and how that impacted the model's generalisation behaviour. For example, a surge of studies considered the behaviour of the pretrained language model made available by Gulordava et al. (2018), to investigate how it generalises to, for instance, different syntactic constructions (e.g. Lakretz et al., 2019). 12 Alternatively, researchers might evaluate one or more training procedures, investigating if the training distribution results in model instances that generalise well -for example, to study how generalisation compares between different architectures Saxton et al., 2019) or how it is affected by the amount of training data (e.g. Artetxe et al., 2021;Rae et al., 2021). While these cases also require evaluating model instances, the focus of the evaluation is not on one particular instance, but rather on the procedure that generated the (multiple) evaluated model instances.

The finetune train-test locus
The second potential locus of shift bears similarities to the first one but instead considers data shifts between the train and test data used during finetuning, and thus concerns models that have gone through an earlier stage of training. This locus occurs when a model is evaluated on a finetuning test set that contains a shift with respect to the finetuning training data (Damonte and Monti, 2021;Kavumba et al., 2022;Ludwig et al., 2022). Studies with a finetune train-test locus can evaluate the same parts of the modelling pipeline as studies with a train-test locus. However, studies that investigate the generalisation abilities of individual finetuned model instances are rare. More frequently, research with this locus focuses on the finetuning procedure and on whether it results in finetuned model instances that generalise well on the test set. Experiments evaluating o.o.d. splits during finetuning often also include a comparison between different pretraining procedures (e.g. they compare how BERT models and RoBERTa models behave during finetuning), thus investigating both a pretrain-train shift and a finetune train-test shift. We will mark them as having multiple loci, as will be further discussed in the last subsection.

The pretrain-train locus
A third possible locus of shift is between pretraining and training data. Experiments with this locus evaluate whether a particular pretraining procedure, as described in Equation (11)

The pretrain-test locus
The fourth locus of shift is between pretraining and test data. This locus occurs when a pretrained model is evaluated directly on o.o.d. data, without further training (i.e. X tr , Y tr = ∅, ∅) -as frequently happens in in-context learning setups (e.g. ) -or when a pretrained model is finetuned on examples that are i.i.d. with respect to the pretraining data and then tested on out-ofdistribution instances. The former case (θ * =θ) is similar to studies with only one training stage in the train-test locus, but distinguishes itself by the nature of the (pre)training procedure, which typically has a general purpose objective, rather than being task-specific (e.g. a language modelling objective). Furthermore, while generalisation studies with a train-test locus almost always explicitly consider the relationship between training and test data, this is frequently not the case with pretrain-test studies, where data shifts are assumed.

Multiple loci
The last option on the locus axis describes studies which simultaneously investigate multiple shifts between different parts of the modelling pipeline. We refer to these cases as generalisation tests with multiple loci. More explicitly, experiments of this type consider shifts both between the pretraining and the training data, as well as between the training and the test data. 13 Multiple-loci experiments evaluate all stages of the modelling pipeline at once: they assess the generalisability of models produced by the pretraining procedure as well as whether generalisation is achieved in the finetuning stage (e.g. FitzGerald et al., 2022;Hu et al., 2020;Tu et al., 2020;. Because multiple-loci experiments necessarily also contain multiple shifts, we mark them as multiple shifts in the shift type axis. The nature of the two shifts may not be the same, but in our analysis, we group them all into a single category. In our proposed evaluation cards (Appendix B), however, different loci within a single experiment can be recorded separately.

A review of existing generalisation research
We presented a taxonomy containing five categorical axes that can be used to characterise generalisation research. We now use the taxonomy to analyse a large amount of existing generalisation research and create a comprehensive map indicating which areas are covered and which are still unexplored. More specifically, we consider 619 generalisation experiments in NLP, presented in a total of 449 papers from the ACL Anthology that have the (sub)words generalisation, generalization, generalise or generalize in their title or abstract, and we label them with their axis values on the five taxonomy axes. In Appendix A, we provide more details on the selection procedure of the papers. The full list of papers is provided in Appendix F, as well as -in searchable form -on our website. 14 On the same website, we furthermore present interactive ways to visualise the results; a search tool to retrieve relevant citations; and a means to generate evaluation cards, that authors can put in their paper or appendix to comprehensively summarise which generalisation experiments they did (for an example, we refer to Figure 2 and Appendix B). In this section, we present the main findings of our analysis.

Overall trends on different axes
We begin by discussing the overall frequency of occurrence of different categories on the five axes, without taking into account interactions between them. We plot the relative frequencies of all axis values in Figure 8 and their development over time in Figure 9. Because the number of generalisation papers retrieved before 2018 is very low (see Figure 7a), we restrict the diachronic plots to the last five years; all other statistics reported are computed over our entire selection of papers.

Motivations
As we can see in Figure 8 (top left), by far the most common motivation to test generalisation is the practical motivation. The intrinsic and cognitive motivations follow whereas the studies in our review that consider generalisation from a fairness perspective make up only 3% of the total. In part, this low number could stem from the fact that our keywords search in the anthology (see Appendix A for more information) was not optimal for detecting fairness studies, and we welcome researchers to submit other generalisation studies with a fairness motivation for review. However, we also speculate that only relatively recently attention is starting to grow for the potential harmfulness of models trained on large, uncontrolled corpora and that generalisation has as yet simply not been studied extensively in the context of fairness. Overall, we see that trends on the motivation axis have experienced small fluctuations over the past five years (Figure 9a) but they have remained relatively stable.

Generalisation type
We find that cross-domain is the most frequent generalisation type, making up more than 30% of all studies, followed by robustness, cross-task and compositional generalisation (Figure 8, left side). Structural and cross-lingual generalisation are the least commonly investigated. On the one hand, studies investigating structural generalisation may be underrepresented as they typically focus more on whether models can capture structures at all, rather than on whether they generalise to new structures. On the other hand, while cross-lingual studies may be undersampled as they tend to less frequently use the word 'generalisation' in their title or abstract (sometimes in favour of 'transfer'), we hypothesise that their low number is reflective of the English-centric disposition of the field. As for fairness studies, we encourage researchers to suggest cross-lingual generalisation papers that we may have missed via our website so that we can better estimate to what extent cross-lingual generalisation is in fact understudied.

Shift type
Data shift types (Figure 8, bottom) are very unevenly distributed over their potential axis values: the vast majority of generalisation research considers covariate shift. Given that covariate shift is more easily addressed by most current modelling techniques, and that it can occur between any two stages of the modelling pipeline -while label and full shift typically occur between pretraining and finetuning -this is, to some extent, to be expected. More unexpected, perhaps, is the relatively high amount of assumed shifts, which correspond to studies that claim to test generalisation but do not explicitly consider how the test data relates to data used at various stages of model training. The percentage of assumed shifts has in fact increased over the past few years (Figure 9b). We hypothesise that this trend, which signals a movement of the field in the wrong direction, is predominantly caused by the use of increasingly large, general-purpose training corpora. Such large corpora, which are often also not in the public domain, make it very challenging to analyse the relationship between the training and testing data and, consequently, to determine what kind of conclusions can be drawn from evaluation results. More promising, instead, is the fact that several studies consider multiple shifts, thus assessing generalisation throughout the entire modelling pipeline.

Shift source
On the shift source axis (Figure 8, bottom right), we see that almost half of the reviewed generalisation studies consider naturally occurring shifts, natural corpora that are not deliberately split along a particular dimension. As discussed later in the current section, this type of data source is most prevalent in cross-task and cross-domain generalisation studies, for which such naturally different corpora are widely available. The next most frequent categories are generated shifts, where one of the datasets involved is generated with a specific generalisation property in mind, and artificially partitioned natural data, describing settings in which all data is natural, but the way it is split between train and test is controlled. Fully generated datasets are less common, making up only 11% of the total number of studies.

Shift locus
Lastly, for the locus axis (Figure 8, top right), we see that the majority of cases focuses on (finetune) train-test splits. Much fewer studies focus on shifts between pretraining and training or pretraining and testing. Similar to the previous axis, a comparatively small percentage of studies considers shifts in multiple stages of the modelling pipeline. At least in part, this might be driven by the larger amount of compute that is required for those scenarios. Over the last five years (Figure 9c), however, the percentage of studies considering multiple loci and pretrain-test loci -the two least frequent categories -has increased.

Interactions between axes
Next, we consider interactions between different axes. Are there any combinations of axes that occur together very often or combinations that are instead rare? We encourage the reader to explore these interactions dynamically on our website. Here, we discuss a few relevant trends.

What data shift source is used for different generalisation types?
In Figure 10a, we show the frequency of each data source per generalisation type, normalised by the total number of times that generalisation type occurs (i.e. row sum, to make patterns comparable between generalisation types). The shift source varies widely across different types of generalisation. Compositional generalisation, for instance, is predominantly tested with fully generated data, a data type that hardly occurs in research considering robustness, cross-lingual or cross-task generalisation. Those three types of generalisation are most frequently tested with naturally occurring shifts or, in some cases, with artificial splits of natural corpora. Structural generalisation is the only generalisation type that appears to be tested across all different data types. As far as we are aware, there exist very few studies that directly compare results between different sources of shift -for instance, to investigate to what extent results on generated shifts or fully generated data are indicative of performances on natural corpora. 15 Such studies could provide insight into how choices in the experimental design impact the conclusions that are drawn from generalisation experiments, and we believe that they are an important direction for future work.

For which loci of shift are different generalisation types studied?
We observe that of all the generalisation types, only cross-task generalisation is frequently investigated in the pretrain-train and pretrain-test stages (Figure 10b). For all other types of generalisation, the vast majority of tests are conducted in the train-test or finetune-train/test stage. In some cases, these differences are to be expected: as general-purpose pretrained models are usually trained on very large, relatively uncontrolled corpora, investigating how they generalise to a different domain without further finetuning is typically not possible, and neither is evaluating their robustness, which typically requires more detailed knowledge of the training data. The statistics also confirm the absence of studies that consider compositional generalisation from pretraining to finetuning or from pretraining to training, which as we previously discussed ( §2.2.1) is philosophically and theoretically challenging in such setups. A final observation is the relative underrepresentation of studies with multiple loci across all generalisation types, especially given the large number of studies that consider generalisation in the finetuning stage or the pretrain-training stage. Those studies have included multiple training stages but considered generalisation in only one of them. We hope to see this trend change in the future, with more studies considering generalisation across the entire modelling pipeline. Figure 10c shows that assumed shifts mostly occur in the pretrain-test locus, which confirms our hypothesis that assumed shifts are likely caused by the use of increasingly large, general-purpose training corpora. When such pretrained models are further finetuned, they often consider either a shift between pretraining and finetuning where new labels are introduced or a covariate shift in the finetuning stage -as such, they do not require an in-depth understanding of the pretraining corpus. 16 When models are directly evaluated, however, the only shift that can be considered is the one between the very large pretraining corpus and the test corpus. This trend points to a substantial challenge when it comes to evaluating generalisation with limited knowledge about model pretraining.

How does motivation drive generalisation research?
To discuss the relationship between the motivation behind a study and the other axes, we focus in particular on its interactions with generalisation type, shift locus and shift source, as shown in Figure 10d-10f. Considering first the relationship between motivation and generalisation type (Figure 10d), we see that cross-domain, robustness, cross-task and cross-lingual generalisation are predominantly motivated by practical considerations, with robustness generalisation studies being also frequently motivated by an interest in understanding how models work intrinsically. We find that compositional and structural generalisation studies are both frequently driven by cognitive motivations -which is to be expected given the importance of these concepts in human cognition and intelligence. The motivation given most frequently for compositional generalisation, however, is a practical one. While in human learning, compositionality is indeed often associated with important practical properties -speed of learning, quick generalisation -as far as we know, there is little empirical evidence that compositional models actually perform better on natural language tasks. A similar apparent mismatch can be observed in Figure 10f when focusing on the practical motivation. Practical generalisation tests are typically aimed at improving models or at being directly informative of a model's applicability. Nonetheless, more than 20% of the practically motivated studies use either artificially partitioned natural data or even fully generated data. To what extent could their conclusions then actually be informative of models applied in practical scenarios? These apparent mismatches between the motivation and the experimental setup demonstrate the importance of the motivation axis in our taxonomy -being aware of and explicit about a study's motivation ensures that its conclusions are indeed informative to the underlying research question. Another interesting observation that can be made from the interactions between motivation and shift locus is that the vast majority of cognitively motivated studies are conducted in a train-test setup. While there are many good reasons for this, conclusions about human generalisation are drawn from a much more varied range of 'experimental setups'. For instance, any experiments done with adults can be thought of as more similar to tests with finetune train-test or pretrain-test locus than to the train-test locus, as adults have a life-long experience over which the experimenter has little control beyond participant selection. On the one hand, this suggests that generalisation with a cognitive motivation should perhaps be evaluated more often with those loci. On the other hand, it begs the question of whether the field could take inspiration from experiments on human generalisation for the challenging effort of evaluating the generalisation of large language models, trained on uncontrolled corpora, in a pretrain-test setting. While there are, of course, substantial differences between the assumptions that can reasonably be made about the linguistic experiences of a human and the pretraining of a language model, 17 we believe that input from domain experts who have extensively studied generalisation in humans might be very beneficial to improving model generalisation testing in these more challenging setups.

Conclusion
While the ability to generalise well is considered to be one of the primary desiderata for NLP models, there is very little agreement on what kind of generalisation behaviour modern-age NLP models should exhibit, and under what conditions that should be evaluated. For decades, generalisation has been simply evaluated with random train-test splits. The recent past, however, has seen several studies illustrating that models that exhibit near-perfect performances on such i.i.d. splits can sometimes drastically fail in a wide range of scenarios that require different forms of generalisation. This body of work demonstrates the need for more comprehensive generalisation testing but it does not provide much guidance: different papers use different experimental setups, different types of data and even entertain different ideas about what it means for an NLP model to generalise well. As a consequence, even though its importance is almost undisputed, extensive state-of-the-art generalisation testing is not currently the standard in NLP. With our work, we aim to set the first step towards making it the new status quo. In this last section, we summarise our work, discuss its limitations, and sketch how we believe it can be used in the future.

The generalisation taxonomy
We presented a new framework to systematise and understand generalisation research. The core of this framework consists in a taxonomy that characterises generalisation studies along five dimensions. This taxonomy, which is designed based on an extensive review of generalisation papers in NLP, can be used to critically analyse existing generalisation research and to structure new studies. The five nominal axes of the taxonomy describe why a study is executed (the main motivation of the study), what the study intends to evaluate (the type of generalisation it aims to solve), and how the evaluation is conducted (the type of data shift considered, the source of this data shift, and the locus in which the shift is investigated). An overview of our taxonomy is provided in Figure 1; the axes are discussed in §2.1-2.5. For the reader's convenience, a concise summary is provided in Appendix E.

Existing work on generalisation: the taxonomy in action
To illustrate the use and usefulness of our taxonomy, we used it to analyse 449 papers from the ACL Anthology that have the (sub)words generali(s/z)ation or generali(s/z)e in their title or abstract. Through this analysis, we demonstrated that the taxonomy is applicable to a wide range of generalisation studies and we were able to provide a comprehensive map of the field, observing overall patterns and making suggestions for areas that should be prioritised in the future. Our most important conclusions and recommendations are: • The goal of a study is not always perfectly aligned with its experimental design. We advise that future work should be more explicit about motivations -which strongly impact what sort of generalisation is even desirable -and should incorporate deliberate assessments to ensure that the experimental setup matches the goal of the study. To facilitate this, we introduce evaluation cards (see Appendix B) that can be used to comprehensively report which generalisation experiments are conducted in a paper.
• Cross-lingual studies and generalisation studies motivated by fairness goals are underrepresented.
We suggest that these areas should be given more attention in future work.
• Papers that target similar generalisation questions vary widely in the type of evaluation setup they use. In our view, the field would benefit from more meta-studies that consider how the results of experiments with different experimental paradigms compare to each other.
• The vast majority of generalisation studies focuses on only one stage of the modelling pipeline. More work is needed that considers generalisation in all stages of training, to prioritise models whose generalising behaviour persists throughout their training curriculum.
• Recent popular NLP models that can be tested directly for their generalisation from pretraining to testing (e.g. in prompting setups, without any further model training) have often been evaluated without considering the relationship between the (pre)training and the test data. We envisage that this is due to the fact that generalisation is particularly difficult to assess when large uncontrolled training data is involved, and we suggest that inspiration might be taken from how generalisation is evaluated in experiments with human participants, where control and access to the 'pretraining' data of a participant are unattainable.
While the review and conclusions presented in this paper are necessarily static, along with this paper we also launch a website, on which new entries can be added by authors. On this website, we furthermore provide a set of visualisation tools that make it possible to visualise our results in different ways and a set of search tools that allows browsing through the reviewed papers, finding studies with specific features, and collecting relevant paper references.

Future work
By providing a systematic framework and set of concrete (online) tools to allow for a structured understanding of generalisation, we believe we have set the necessary first step towards making state-of-the-art generalisation testing the new status quo in NLP. We hope that our taxonomy will prove useful in clarifying what type of generalisation is useful in which scenario; that it will help researchers define and characterise generalisation studies, systematically registering them with our proposed evaluation cards; and that our online overview of generalisation studies will continue to provide a comprehensive picture of what happens in the field of generalisation. Still, our work is by no means intended to be the end of the road. For example, while our taxonomy can make future generalisation research in NLP more comparable, structured and carefully designed, and while our survey suggests interesting research directions, this work does not provide standardised data or procedures for generalisation testing. We envision that important generalisation tests should be hosted on a shared platform, along with a leaderboard to make generalisation testing more accessible and transparent, and that the platform should not be controlled by a single group of people but by a larger community of NLP researchers and domain experts. Lastly, in the same way as our thoughts on how generalisation should be evaluated have evolved with models in the past, so should such a platform continue to evolve in the future. What we consider important to evaluate now might change next year, and when models get better at setups considered difficult today, we might discover new types of generalisation that we had not thought of before. How we evaluate models should be reflective of this constant evolution, and which tests are prioritised should thus change along with new models and knowledge.

Limitations
Designing a coherent, consistent, and at the same time, usable taxonomy of generalisation research in NLP is a non-trivial task, which required substantial discussion among the authors. We finish this paper by reporting the main decisional trade-offs of our work, concerning the definition of the taxonomy, the annotation process and the selection of papers to review.
First, we designed this taxonomy and its set of axes to highlight theoretically important but also practically functional distinctions between generalisation studies. In doing so, we opted for relatively coarse axes values, which allow drawing higher-level conclusions about the field as a whole. At the same time, this sometimes groups together papers that could be regarded separately. An already discussed example are the studies with a pretrain-train locus, which by definition all share that they include more than one training stage and investigate generalisation in the first one. This category thus contains both papers that use a general-purpose pretraining objective and then finetune on different tasks and studies whose finetuning objective matches the pretraining objectives (e.g. studies that consider domain-adaptation in a continual learning setup). While those differences are, at least in part, reflected on other axes, in some cases it might be helpful to distinguish those two cases more explicitly. Something similar occurs on the shift type axis. The three formal shift types that we consider are statistically well-grounded but shifts of the same type can still largely vary. Whether the distance between two distributions is small or large might make a substantial difference in the difficulty of the generalisation problem, which is something that is currently not reflected in our taxonomy. Although quantifying differences between distributions is often problematic in practice, we believe that adjusting the taxonomy to capture the difficulty of generalising to a particular shift could be helpful in the future. More generally, we imagine that future experimental paradigms might call for the addition of values on some of the axes, or even the addition of new axes. We are already observing, for example, that new studies include increasingly often more than three modelling distributions. Our taxonomy can be naturally extended to account for modelling pipelines with an arbitrary number of learning stages.
A second limitation concerns the labelling and characterisation of individual studies. In the description of the axes and their different values, we aimed to be as comprehensive and precise as possible. In practice, however, there are always cases in which the actual category of a paper is debatable. Sometimes this occurs because the paper itself is not clear about what exactly it attempts to evaluate or about its motivation; we hope that our taxonomy will reduce the number of such cases in the future. In other instances, it is simply difficult to apply some concepts or distinctions, despite their theoretical sharpness, to concrete studies. A clear example of this challenge is the shift type. In theory, p(x), p(y|x) and p(y) are clearly defined concepts; in practice, it is usually impossible to estimate the actual difference between two (natural) distributions. Some might even argue that, in practice, train and test sets are virtually always distributionally different. For the purpose of systematising generalisation testing and characterising experiments, however, this is not a useful observation. In our taxonomy design and annotations, we aimed to make distinctions that we deemed useful, rather than relying on "true" but unknown differences between distributions.
Thirdly, in our paper selection and annotation, we deliberately excluded a few types of papers. For example, we did not include any studies that considered more than one modality. While we believe they are interesting to consider from a generalisation perspective, they are also more difficult to characterise within a single taxonomy, as they involve more distributions (with sometimes very different support) and thus more distribution shifts. We consider including such papers a compelling step for future work. Another set of papers that we excluded are those that do not conduct behavioural experiments but look at the generalisability of representations (e.g. probing papers). We do not see any a priori reason that they could not be characterised by our taxonomy, and we believe this would be a valuable enterprise. In particular, although marking the difference between behavioural and representational experiments might require updating the taxonomy, a comparison of behavioural and representational experiments with the same axis values might make for an interesting meta-study.
A last critical observation that we would like to make is that our work builds on the assumption that strong generalisation skills are considered crucial for models of NLP. While we generally believe this to be true, there might be cases where generalisation is not in fact needed. One could provocatively argue that for LLMs trained on extremely large English data sets, practically speaking the vast majority of application scenarios is close to i.i.d. and that complex forms of generalisation are thus not needed. We abstain from judging whether and when this holds but argue that when researchers believe that their setup requires no generalisation, they should clearly state so and explain why that is the case.

A Annotation setup
In this section, we describe the procedures we used for the selection of the papers in our review and their annotation.

A.1 Paper selection
An initial selection of manuscripts was made through a substantive preliminary literature review by the main authors of this paper. We then carried out a search through the ACL anthology. We started by retrieving all papers that have the (sub)words generalisation, generalization, generalise or generalize in their title or abstract. In Figure 7, we see that the number of papers with those keywords grew substantially over time, both in absolute and relative terms. We manually checked the abstracts and titles of the resulting papers to remove those that were not, in fact, addressing a generalisation question (for instance, because they proposed a generalisation of a method, or because they used random traintest splits). Furthermore, we restricted ourselves to papers with one modality. We then annotated the resulting papers using the taxonomy presented in the previous sections. During the annotation process, we sometimes removed entries that upon further reading did not contain generalisation experiments, and we duplicated entries that contained multiple experiments with different values on one of our axes. The findings presented in this section encompass in total 619 generalisation experiments, presented in 449 papers. The full list of papers can be found in the second bibliography at the end of this paper, as well as on our website 18 . While the conclusions in this -static -paper pertain only to this specific selection of papers, we intend to keep expanding the number of entries on our website with existing papers we missed or as new generalisation papers are published.

A.2 Annotation
The annotation of all selected papers was done collectively by the authors of this article. Each paper was given five labels by a first annotator, one for every axis of our taxonomy, and these labels were then checked by a second annotator. Disagreements were discussed among the two annotators, and for unresolved cases, a third annotator was used. As a guide, we used the diagram presented in Figure 11. An FAQ with common questions that occurred while using this diagram, which intends to capture our taxonomy but is naturally a simplified version of it, can be found on our website. In addition to the taxonomy axes values, we also annotated which task(s) the studies considered. If a paper performed the same experiment with multiple different tasks, we label it multiple tasks, use the overarching category (e.g. NLU) when possible, or mark it as multitask if the purpose is to show that a paper can do those all at the same time. If a paper contained multiple studies with different values on the same axis -e.g. a paper considers both cross-domain and compositional generalisation or uses both natural shifts and synthetic data -we record those experiments separately.  If the original objective is a language modelling objective, or the pretraining objective is the same as the finetuning objective, it might be that the shift is actually a full shift or even a label shift 5. Source no no Figure 11: A graphical representation of our annotation process and an indication of where in a paper you might find the information required to complete the annotation. One paper can potentially contain multiple generalisation questions -e.g. both cross-domain and cross-task generalisation, or both generated shifts and splits using natural data. In that case, the diagram has to be walked through twice. Of course, the diagram is an aid that helps characterise papers but also simplifies the full taxonomy. On our website, we keep track of common questions that arise when using the diagram to characterise papers in an FAQ.

B Evaluation cards
In the main text of this paper, we have argued several times that we believe standardisation is an important requirement for state-of-the-art evaluation in NLP. To further push in this direction, we propose, on top of our taxonomy, a standardised way to indicate what kind of generalisation experiments a paper reports: evaluation cards. Evaluation cards (for an example, see Figure 2) allow visualising all generalisation experiments conducted in a study in a comprehensive way and thus to easily shows how extensively a model was evaluated. In contrast to our review, in which multiple loci and shifts are grouped under one category for visualisation and analysis purposes, the evaluation cards do allow recording which shifts and loci are investigated in the same experiments. In the example card in Figure 2, for instance, the experiment indicated with the triangle considers a covariate shift in the finetune stage (from one language to another), but through investigating multiple pretrained models it also investigates a label shift from pretraining to training. On our website, we provide a tool to generate (one or two-column) evaluation cards, that authors can include in the appendix of their papers. In Figure 12, we show how this tool is rendered in our web interface.

C Author contributions
To recognise individual author contributions, we detail those contributions below, following the Contributor Roles Taxonomy (CRediT) 19 introduced by Elsevier. Authors are listed in the order they appear on the author list.

D Multi-lingual benchmarks
While the field of multilingual modelling is vast and associated with many interesting generalisation questions, papers in this area do not often focus explicitly on generalisation. In this section, we provide a list of the most important available multilingual benchmarks which can be used to evaluate cross-lingual generalisation. Multilingual benchmarks or datasets are created in a variety of ways. Several benchmarks are created by translating monolingual benchmarks into different languages, usually through a professional translation service (Artetxe et al., 2020;Conneau et al., 2018;Ebrahimi et al., 2022;FitzGerald et al., 2022;Lewis et al., 2020;Li et al., 2021a;Longpre et al., 2021;Mostafazadeh et al., 2016;Ponti et al., 2020;Williams et al., 2018;Yang et al., 2019;Zhang et al., 2019). Other multilingual benchmarks, instead, have been built by separately annotating each language via its native speakers (e.g. Adelani et al., 2021;Asai et al., 2021;Clark et al., 2020;Muller et al., 2021). Yet another way to construct multilingual benchmarks is to leverage existing resources that cover multiple languages. For instance, Wikipedia has been used as a resource to derive multilingual benchmarks (Botha et al., 2020;Liu et al., 2019a;Pan et al., 2017;Rahimi et al., 2019), and several multilingual summarisation datasets have been created by extracting article-summary pairs from online newspapers or how-to guides (e.g. Hasan et al., 2021;Ladhak et al., 2020;Nguyen and Daumé III, 2019;Scialom et al., 2020;Varab and Schluter, 2021). Various linguistic resources have also been exploited: for instance, the Universal Dependencies treebank (Nivre et al., 2020) has been used to evaluate cross-lingual part-of-speech tagging, and multilingual WordNet and Wiktionary have been used to build XL-WiC (Raganato et al., 2020), an extension of WiC (Pilehvar and Camacho-Collados, 2019) that reformulates word sense disambiguation in 12 languages as a binary classification task. Finally, in the same spirit of GLUE and SuperGLUE for English, several aggregated benchmarks include selected sets of benchmarks previously proposed by others (e.g. Hu et al., 2020;Liang et al., 2020;Ruder et al., 2021;Wang et al., 2022), which allow for evaluating cross-task and cross-language generalisation simultaneously.

E A concise summary of our taxonomy
For the convenience of the reader, in this section of the supplementary materials we provide a concise summary of our generalisation taxonomy. The taxonomy we propose is based on a detailed analysis of a large number of existing studies on generalisation in NLP, and it includes the main five axes along which those studies differ. The five axes capture different aspects of generalisation studies, that together form a comprehensive picture of the motivation and goal of the study and provide information on important choices in the experimental setup. The first axis of our generalisation taxonomy ( §2.1) is the high-level motivation for the study. The motivation of a study impacts or even determines what type of generalisation is desirable, as well as what kind of conclusions can be drawn from a model's display or lack of generalisation. Furthermore, the motivation of a study shapes its experimental design. It is therefore important for researchers to be explicitly aware of it, to ensure that the experimental setup aligns with the questions they seek to answer. We consider four different types of motivations: the practical motivation, the cognitive motivation, the intrinsic motivation, and the fairness and inclusivity motivation.
The second axis in our taxonomy ( §2.2) indicates the type of generalisation the test is addressing. This axis describes on a high level what exactly it is that a generalisation test is intended to capture, rather than considering why or how, making it one of the most important axes of our taxonomy. In the literature, we have found six main types of generalisation: compositional generalisation, structural generalisation, cross-task generalisation, cross-lingual generalisation, cross-domain generalisation, and robustness generalisation.
The third axis in our taxonomy ( §2. 3) describes what kind of data shift is considered in the generalisation test. This axis adds a statistical interpretation to our taxonomy and derives its importance from the fact that data shift plays an essential formal role in defining and understanding generalisation from a statistical perspective, as well as from the fact that different types of shifts are best addressed with different kinds of experimental setups. On the data shift axis, we consider three shifts which are well-attested in the literature: covariate shift, label shift and full shift. We further include two additional types of shift -assumed shift and multiple shifts -to account for studies that cannot be labelled with any of the three main shift types.
In the fourth axis of our taxonomy ( §2.4), we consider what is the source of the data shift used in the experiment. The source of the data shift determines how much control the experimenter has over the training and testing data and, consequently, what kind of conclusions can be drawn from an experiment. We distinguish four different sources of shifts: naturally occurring shifts, artificially partitioned natural corpora, generated shifts and fully generated datasets.
In the last axis of our taxonomy ( §2.5), we consider what is the locus of the data shift, or, in other words, for what part of the modelling pipeline generalisation is investigated. The locus of the shift, together with the shift type, forms the last piece of the puzzle, as it determines what part of the modelling pipeline is investigated and thus the kind of generalisation question that can be asked. On this axis, we consider shifts between all stages in the contemporary modelling pipeline -pretraining, training and testing -as well as studies that consider shifts between multiple stages simultaneously. Talmor, A., Elazar, Y., Goldberg, Y., and Berant, J. (2020). oLMpics-on what language model pre-training captures. TACL.