Three types of incremental learning

Incrementally learning new information from a non-stationary stream of data, referred to as ‘continual learning’, is a key feature of natural intelligence, but a challenging problem for deep neural networks. In recent years, numerous deep learning methods for continual learning have been proposed, but comparing their performances is difficult due to the lack of a common framework. To help address this, we describe three fundamental types, or ‘scenarios’, of continual learning: task-incremental, domain-incremental and class-incremental learning. Each of these scenarios has its own set of challenges. To illustrate this, we provide a comprehensive empirical comparison of currently used continual learning strategies, by performing the Split MNIST and Split CIFAR-100 protocols according to each scenario. We demonstrate substantial differences between the three scenarios in terms of difficulty and in terms of the effectiveness of different strategies. The proposed categorization aims to structure the continual learning field, by forming a key foundation for clearly defining benchmark problems.


Supplementary Note 2: An example beyond the academic continual learning setting
Here we provide an example to illustrate the generalized versions of the three continual learning scenario. Similar as is done for the academic continual learning setting in the main text, the MNIST dataset is used to demonstrate how a given 'task-free' data stream [10][11][12][13] can be performed in three different ways. Moreover, using this example, we also perform an empirical evaluation to compare the generalized scenarios with each other and to test the efficacy of different computational strategies on each of them.
Task-free version of Split MNIST. We start by describing a task-free version of Split MNIST, in which there are no sharp boundaries between contexts and in which a context can be experienced more than once. We then illustrate how such a task-free continual learning experiment can be performed according to all three of the generalized versions of the continual learning scenarios defined in the main text.
The first step in setting up this continual learning experiment is defining the context set (i.e., the set of underlying distributions). We use the same context set as with the standard version of Split MNIST in the main text: the original MNIST-dataset is split up into five contexts in such a way that each contexts contains two digits ( Fig. 2.1A). This means that the non-stationary aspect of the data in this experiment can be described as "the type of digit, in a pairwise manner".
The second step to set up the experiment is defining the data stream (i.e., the set of experiences that are sequentially presented to the algorithm). In the academic continual learning setting, the data stream directly corresponds to the context set, as each experience simply contains all the training data of the corresponding context. Instead, here we follow the more general framework described by equation (1) in the main text, which states that the i th observation of experience t is drawn from context c with probability p t,i c . We use the following probabilities: otherwise with x t c = (t − (500 + (c − 2) * 2000)) mod 10, 000 where x mod y is the modulo operator that returns the remainder of the division of x by y. The total number of experiences is 10, 000 and the number of observations per experience is 128. The resulting data stream has gradual transitions between contexts and the first context is revisited ( Fig. 2.1B). Similar gradual transitions between contexts were used by refs. 4,9 . Fig. 2.1 | A task-free data stream can be performed according to each of the three generalized continual learning scenarios. This figure illustrates three key components of a continual learning experiment: a, the context set specifies what aspect of the data changes over time; b, the data stream specifies how that aspect changes over time; and c, the scenario specifies how the aspect of the data that changes over time relates to the mapping that must be learned. Notation: X is the input space, C is the context space, Y is the within-context label space and G is the global label space. Dc is the underlying distribution of context c and p t,i c is the probability that the i th observation in experience t is sampled from Dc. Importantly, as for the standard version of Split MNIST, this task-free version of Split MNIST can be performed in three different ways depending on how the mapping that must be learned relates to the context space: with generalized task-incremental learning the algorithm must learn the mapping f: X × C − → Y (i.e., context labels are known at test time), with generalized domain-incremental learning the mapping to be learned is f : X − → Y, and with generalized class-incremental learning the mapping f : X − → C × Y (or f : X − → G) must be learned ( Fig. 2.1C). As in the academic setting, domain-and class-incremental learning can be distinguished by whether the expected output is the within-context label or the global label as well.
Empirical comparison. Next, using the task-free version of Split MNIST outlined above, we set out to perform an empirical comparison to test how well the different computational strategies work in each of the three generalized continual learning scenarios. However, not all methods that were used for the empirical comparisons in the main text could be applied in a task-free setting in a straight-forward manner, because some methods require context boundaries to perform certain consolidation operations (e.g., update the regularization term in the loss function or replace the copy of the model used for generating replay). A way around this is to instead perform these consolidation operations at regular intervals; in our experiments we did this after every 100 experiences. In other words, every 100 experiences were considered to be a new context. This approach resulted in feasible modified versions for the methods SI and LwF. On the other hand, for the methods EWC and FROMP this approach resulted in very high computational costs, while for the generative replay methods DGR and BI-R performance would likely fall sharply due to insufficient number of iterations between consolidation operations for the generative models to converge. These methods were therefore left out from the comparison. For the methods ER and A-GEM, rather than filling up the memory buffer with randomly selected samples after finishing training on a context, we instead used reservoir sampling 14,15 . For ER, the loss on the current experience and the loss on the replayed data were weighted equally: L total = 0.5Lcurrent + 0.5L replay . For the method iCaRL two modifications were made: reservoir sampling was used to fill the memory buffer (instead of the herding algorithm) and the feature extractor was trained using standard cross entropy loss with replay of samples from the memory buffer, using similar weighting as for ER above (instead of using the binary classification / distillation loss given by equation (28) in the main text). The approaches Separate Networks, XdG and Generative Classifier did not require modifications to work in this setting.
Similar as for the academic setting in the main text, here in this more flexible continual learning setting we found clear differences between the three generalized continual learning scenarios (Table 2.1). Importantly, we found qualitatively similar results as in the academic setting with regards to which computational strategies work well in which generalized scenarios. In particular, parameter and functional regularization achieved reasonable performance with generalized task-incremental learning, but they did not work well with generalized domain-incremental learning and they completely failed with class-incremental learning. Replay performed well in all three generalized scenarios, while template-based classification was able to match or surpass replay's performance with generalized class-incremental learning.

Strategy
Method Budget GM Gen Task  Experimental details. Unless stated otherwise above, the experimental settings for these experiments were similar as those that were used for the Split MNIST experiments in the main text. For example, the same optimizer, learning rate and base neural network architecture were used, and the set up of the output layer was analogous to the way it was set up in the academic setting. Similarly, the compared methods were implemented as described in the main text, except for the modifications discussed above. The hyperparameters for the methods XdG and modified SI were selected by a grid search (

Supplementary Note 3: An example of a mixture of scenarios
Here we provide an example illustrating how the use of a multidimensional context space can result in continual learning problems that consist of multiple scenarios. For this example we use the MNIST dataset with rotated images. The first dimension of the context space, denoted C 1 , divides the data into distinct contexts based on their class label, akin to the Split MNIST protocol (Fig. 3.1A). The second dimension of the context space, denoted C 2 , controls the amount by which a digit is rotated, akin to the Rotated MNIST protocol (Fig. 3.1B). These two dimensions of the context space can vary independently from each other: regardless of what digit an image contains, it can be rotated by all possible amounts. Importantly, each dimension of the context space could be performed according to a different scenario, which means that the resulting continual learning problem could be a mixture of two different scenarios (Table 3.1).
Real-world continual learning problems are often more complex and intertwined than this example. We believe that when solving such complex continual learning problems, it is often useful to break down the problem to its constituent parts by identifying what type(s) of continual learning must be done, as this will be an important determinant of what kind of continual learning strategies might be most appropriate.

Fig. 3.1 | A multidimensional context space.
Schematic illustrating a two-dimensional context space using the MNIST dataset with rotations. a, The first dimension C1 divides the digits based on their class label. b, The second dimension C2 controls the amount of rotation. Each dimension could be associated with a different scenario, allowing for the construction of continual learning problems that are mixtures of scenarios, see Table 3.1.
Task-IL Rotation is given, choice Rotation is given, choice Rotation is given, choice between 2 known digits between odd and even between all 10 digits Domain-IL Rotation unknown, choice Rotation unknown, choice Rotation unknown, choice between 2 known digits between odd and even between all 10 digits Class-IL Identify rotation + choice Identify rotation + choice Identify rotation + choice between 2 known digits between odd and even between all 10 digits

Supplementary Note 4: Strategies for continual learning
Here we discuss and categorize different strategies for continual learning with deep neural networks. We do not aim to provide an extensive review of continual learning methods (for such reviews, see refs. [16][17][18] ), but rather we focus on the underlying computational strategies. Many recent methods for continual learning combine multiple of these strategies.
Context-specific components. One way to reduce interference between contexts is through context-specific components. That is, to use certain parts of a network only for specific contexts (Fig. 3A in the main text). A common example of context-specific components is the multi-headed layout, which means that there is a separate output layer for each context 19,20 . Context-specificity, or modularity, can also be imposed on a network by gating its nodes or weights in a different way for each context. Such context-specific gates could be learned, for example using evolutionary algorithms 21,22 , gradient descent 8 or Hebbian plasticity 23 , or they could be defined randomly and a priori 7 .
Another way to achieve context-specificity is to add new components, and thus to dynamically expand the network, when a new context is learned [24][25][26] .
In the extreme version of this strategy, there is a completely separate network for each context. In this case, there is no forgetting at all. However, such a full segregation does not have good scaling properties and it precludes positive transfer between contexts. We believe that a specific case of this strategy, whereby the available computational resources (e.g., number of parameters) are divided over all contexts and a separate network is learned for each, is an important but often overlooked baseline for task-incremental learning problems, upon which successful task-incremental learning methods should be able to improve.
In principle, context-specific components can only be used with task-incremental learning, because context identity is required to select the correct context-specific components. However, when combined with an algorithm for identifying the context to which a sample belongs, this strategy could also be used with domain-or class-incremental learning [27][28][29][30][31][32] . In this regard it is important to point out that context identification is a class-incremental learning problem, as the mapping that must be learned is of the form f : X − → C.
Parameter regularization. Another popular strategy for continual learning is parameter regularization. To reduce forgetting, this strategy encourages neural network parameters important for past contexts not to change too much when learning new contexts (Fig. 3B in the main text). This strategy can be motivated from a Bayesian perspective, as parameter regularization methods can often be interpreted as performing sequential approximate Bayesian inference on the parameters of a neural network 2,33,34 .
A common way to do parameter regularization is by adding a regularization term to the loss that penalizes changes to the network's parameters θ weighted by an estimate of their importance for previous contexts (e.g., refs. 2,3,35-37 ): whereby L is the loss on the current context, θ * is the parameter vector relative to which changes are penalized (e.g., the value of θ after finishing training on the last context), Σ is an estimate of how important parameters are for previous contexts and ∥.∥ Σ is a weighted norm. Typically, a weighted L 2 -norm is used, in which case the regularization term is given by ∥θ − θ * ∥ Σ = 1 2 (θ − θ * ) T Σ (θ − θ * ). Another way to do parameter regularization is by projecting gradients into a subspace orthogonal to the one important for past contexts, to encourage parameter updates that do not interfere with previously stored information [38][39][40] . A critical aspect of parameter regularization is the estimation of the importance of the parameters for previous contexts. A popular approach is to use the Fisher Information matrix of the last context 2 , as under certain assumptions it reflects how much small changes to the parameters would increase the loss. Usually a diagonal approximation of the Fisher Information is used, which assumes that all parameters are independent, but this assumption can be relaxed by using alternative approximations 37,40,41 . A drawback of the Fisher Information is that its computation can be costly. Several other parameter regularization methods have substantially lower computational overhead by instead estimating a per-parameter importance online during training 3,35 .
Parameter regularization does not require storing observations and it can be used, at least in theory, for all three scenarios. Many parameter regularization methods assume discrete contexts and knowledge about context switches during training, as that is when the regularization term is typically updated, but this assumption is sometimes relaxed 4,10 .
Functional regularization. An important issue with parameter regularization is that the behaviour of deep neural networks depends on its parameters in complex ways, which makes it challenging to accurately estimate the true importance of parameters for previous contexts. Rather than in parameter space, it might therefore be more effective to perform regularization directly in the function space of a neural network [42][43][44] (Fig. 3C in the main text). Functional regularization encourages the input-output mapping f θ of the neural network to change not too much at a particular set of inputs, which we refer to as the 'anchor points', when training on a new context: whereby L is the loss for the current context, f θ * is the input-output mapping relative to which changes are penalized (e.g., the input-output mapping of the network after finishing training on the last context) and A ⊂ X is the set of anchor points at which the divergence between f θ and f θ * is measured. A critical aspect of functional regularization is the selection of anchor points. The optimal set of anchor points contains all inputs from the previous contexts, but using that set is computationally very costly and requires storing all those inputs. An alternative, proposed by Li and Hoiem 45 , is to use the inputs from the current context and measure the divergence between f θ and f θ * with a knowledge distillation loss 46 . Advantage is that this does not require storing observations, but there is no guarantee that the inputs from the current context are suitable anchor points. Another line of work formulates neural networks as Gaussian Processes 47 , which allows for summarizing the input distributions of previous contexts with inducing points 43 or memorable inputs 44 , and for performing functional regularization in a Bayesian framework.
As for parameter regularization, most functional regularization methods can be used with all three continual learning scenarios. These methods typically assume context boundaries are known, but there is work relaxing this assumption 43 .

Replay.
Another strategy for continual learning, referred to as 'replay' or 'rehearsal', is to complement the training data of a new context with data representative of previous contexts 48 (Fig. 3D in the main text). With exact or experience replay, observations from previous contexts are stored in a memory buffer and revisited when training on new contexts 15,[49][50][51][52][53] . Usually it is assumed that a limited amount of data can be stored, and an open research question is how to optimally select the samples to populate the memory buffer [52][53][54][55] . An alternative to storing observations is to learn a generative model to generate the data to be replayed 5,56-62 . An issue with such generative replay is that incrementally training a generative model can be challenging, especially when the data are complex 63,64 . A work-around can be to generate and replay latent features rather than raw inputs 5,61 , although that usually requires pre-training of the lower, non-replayed layers of the network.
Typically, when replay is used, the objective is to jointly optimize the loss on the current and replayed data. An alternative approach, originally proposed by Lopez-Paz and Ranzato 65 , is to use the loss on the replayed data as inequality constraints when optimizing for the current context. The idea is that the loss for previous contexts should not increase, but that it should be allowed to decrease.
Replay is similar to functional regularization in the sense that both strategies protect past knowledge by operating in the function space of a network, but a difference is that replay additionally allows for continued training on previous contexts. While the goal of regularization is to preserve what was learned, replay can be thought of as aiming to promote what should have been learned. However, the distinction between replay and functional regularization is sometimes blurred. For example, a common approach for generative replay is to learn a generative model for the inputs, but to then label the generated inputs based on predictions made for them by a copy of the network as it was after finishing training on the previous context 5, 56,60,64 . This version of generative replay is therefore also a form of functional regularization.
Replay is suitable for all three continual learning scenarios. Moreover, there is a growing literature focused on developing replay-based methods for when there are no clear context boundaries [51][52][53]63,[66][67][68] . An important concern with replay is its computational efficiency, since it involves constantly retraining on all previously seen contexts. Promisingly, it has been suggested that, rather than replaying everything, it might be sufficient to only replay items that are similar to the new data 69 . Moreover, it has been shown that because 'not forgetting' is easier than 'learning', replaying relatively small amounts of data can already be enough 5 .
Template-based classification. Lastly we discuss classification using templates, which can be used as a strategy for class-incremental learning. With template-based classification, a template -which can also be thought of as a model or a representation -is learned for each class, after which classification decisions are made based on which template is most suitable for the sample to be classified (Fig. 3E in the main text). A key advantage is that this rephrases an often challenging class-incremental learning problem as a typically easier addressable task-incremental learning problem, whereby each 'task' is to learn a template for a specific class 70 .
A popular approach, with roots in cognitive science, is to use 'prototypes' as class templates 71 . In deep learning, a prototype is usually the mean vector of a class in an embedding or metric space defined by a neural network 72,73 . Samples to be classified are then assigned to the class of the prototype to which they are closest (i.e., classification is performed based on a nearest-class-mean rule in the embedding space 74 ). With class-incremental learning, if the embedding network is fixed, this approach can be implemented by storing a single prototype per class 75 . However, the embedding network might need to be updated to better separate newly encountered classes. To prevent prototypes from drifting when the embedding network evolves, several methods store a number of examples for each class to recompute or update the prototypes after changes to the embedding network 54,76 . An alternative that relaxes the need to store data is to estimate and correct prototypes' drift based on the drift observed for data of the current context 77,78 .
Another example of template-based classification is generative classification 70 . In this case, the template learned for each class is a generative model, and the template's suitability for a test sample is measured by the likelihood of the sample under the generative model. Similar as with generative replay, when a suitable feature extractor is available, generative classification can be performed on latent features rather than on the raw inputs. Generative classification does not require storing observations, but making classification decisions can be computationally expensive. A more efficient alternative could be to use an energy-based model and to compute for each class an energy value rather than a likelihood 9 .

Supplementary Note 5: Relevance for unsupervised and reinforcement learning
The main focus of this article is on supervised learning, and classification in particular, but aspects of the three continual learning scenarios described here are also relevant for unsupervised and reinforcement learning. Firstly, with both types of learning, an important distinction is whether it is clear from what underlying context an input or a problem is: task-incremental (or context-aware) versus "not task-incremental" (or context-unaware). Secondly, when context identity is not provided, in principle it is also possible to make the distinction whether context identity must be inferred or not. However, context inference is either a supervised problem (if context labels are provided during training) or an unsupervised problem (if context labels are not provided during training).
More generally, the perspective of how the non-stationary aspect of the data relates to the mapping to be learned, might be useful to generate more fine-grained categorizations of continual learning problems involving unsupervised or reinforcement learning.

Supplementary Note 6: Other categorizations of continual learning
A key contribution of this article is proposing a categorization of continual learning problems based on how the non-stationary aspect of the data relates to the mapping to be learned. We believe this particular categorization is useful, as underpinned by our experimental comparisons, but other relevant and often complementary categorizations can be made as well. For example, continual learning problems can also be categorized based on whether transitions between contexts are sharp or gradual 4,79,80 , whether a context can be experienced multiple times [81][82][83] , how the different contexts relate to each other (e.g., how similar are they?) 84,85 or how many observations there are in each experience (online versus batch-wise continual learning) 75,86 .
Another way to categorize continual learning problems is through the lens of dataset shift formalization 87,88 . A first distinction here is between 'real concept drift', whereby the causal relation between the inputs and outputs changes, and 'virtual drift', whereby only the distribution from which is sampled changes 79,80,89 . Virtual drift has been further dissected into 'domain drift', where the input distribution from which is sampled changes in such a way that there is no change in the output distribution, and 'virtual concept drift', where there is a change in the output distribution 79 . Although motivated from a different perspective, these types of drift can be related to the three scenarios described here: real concept drift associates with task-incremental learning as often the algorithm must be informed about the causal change, while domain drift and virtual concept drift have intuitive links to respectively domain-and class-incremental learning.
Some other recently proposed categorizations also have similarities with the three scenarios. A common distinction in the continual learning literature is between methods evaluated with a 'multi-headed' or a 'single-headed' layout 19,20 . A multi-headed layout is linked to task-incremental learning as it requires context identity to be known, while a single-headed layout does not. However, an important difference is that the multi-headed versus single-headed distinction is tied to the architectural layout of a network's output layer (i.e., this is a distinction at the algorithmic or implementational level), while the three scenarios reflect the conditions under which a model is evaluated (i.e., our categorization is at the computational level). This is relevant because, for example, a multi-headed output layer is not the only way to use context identity information 7,8,22,24,25 . Another distinction in the literature is between benchmarks with 'new instances' (NI) versus with 'new classes' (NC) 90,91 . NI benchmarks typically correspond to domain-incremental learning and NC benchmarks to class-incremental learning, but this mapping is not exact, as for example either type of benchmark could also be performed according to the task-incremental learning scenario.

Supplementary Note 7: Different ways of using context identity information
In the task-incremental learning scenario, the most common way to use context identity information is in the form of a multi-headed output layer (i.e., to have a separate output layer for each context). This is often a sensible way to use context identity information, but it is not the only way. Here, we show an example where context identity information is more efficiently used in another way.
We use the Permuted MNIST protocol ( Fig. 1.1) with a sequence of 10 different permutations. We compare a selection of the methods that were included in the experiments reported in the main text. In the first set of experiments, denoted Multihead, context identity information is used in the 'standard' way, meaning that all methods use a multi-headed output layer. In the second set of experiments, denoted Singlehead + XdG, context identity information is instead used in the network's hidden layers, by combining each method with XdG. As can be seen from Table 7.1, for all compared methods, using context identity information in the network's hidden layers gives stronger performance than the use of a multi-headed output layer. For reference, we also report performances when context identity information is not used at all (Singlehead), which thus corresponds to the domain-incremental learning scenario.

Strategy
Method Budget GM Task  Experimental details. A sequence of ten contexts was used for the Permuted MNIST experiments. Context identity was provided during training. Each context contained all ten MNIST digits, with in every context a different random permutation applied to the pixels. Before being permuted, the original MNIST images were zero-padded to 32x32 pixels. No other pre-processing was performed. The standard training/test-split was used, resulting in 60,000 training and 10,000 test images per context. The base neural network had 2 fully-connected hidden layers of 1000 ReLU units each and a softmax output layer. Training was done for 5000 iterations per context, using mini-batches of size 128 and the ADAM-optimizer (β 1 = 0.9, β 2 = 0.999) with learning rate 0.0001.
Details of the compared methods are described in the Methods section in the main text. For EWC, to reduce the computational costs, the diagonal elements of the Fisher Information matrix were computed using 1000 samples (i.e., the outer summation in equation (5) in the main text was over 1000 randomly selected samples rather than over the entire training set). For DGR, the generative model was a variational autoencoder with the encoder and decoder network both containing 2 fully-connected hidden layers of 1000 ReLU units each. The latent variable layer had 100 Gaussian units.
With XdG, always 60% of the network's hidden units were masked in each context. This percentage was selected based on a grid search in which XdG was used with the baseline None, and the outcome of this grid search was then also used for XdG in combination with the other methods. For EWC and SI, separate grid searches were performed to select the value of their hyperparameters (

Supplementary Note 8: Hyperparameters
Several of the in this article compared continual learning methods have a hyperparameter. In deep learning, the typical way of setting the value of hyperparameters is by training models on the training set for a range of hyperparameter-values, and selecting those that result in the best performance on a separate validation set. This strategy has been adapted to the continual learning setting as training models on the full data stream with different hyperparameter-values using only every context's training data, and comparing their overall performances using separate validation sets (or sometimes the test sets) for each context. However, we would like to stress that this means that these hyperparameters are set (or learned) based on an evaluation using data from all contexts, which violates the continual learning principle of only being allowed to access each context's training data in the order specified by the data stream. Although it is tempting to think that it is acceptable to relax this principle for contexts' validation data, we argue here that it is not. A clear example of how using each context's validation data continuously throughout an incremental training protocol can lead to an unfair advantage is provided by ref. 92 , in which after finishing training on each context a 'bias-removal parameter' is computed that optimizes performance on the validation sets of all contexts seen so far (see their section 3.3). Although the hyperparameters of the methods compared here are less influential than those in the above report, we believe that it is important to realize this issue associated with traditional grid searches in a continual learning setting and that at a minimum influential hyperparameters should be avoided in methods for continual learning. Nevertheless, to give all methods the best possible chance, and to explore how influential the hyperparameters are, we did perform grid searches to set the values of hyperparameters (see Figures 8.1 and 8.2). Given the issue discussed above, we believe using validation sets for these grid searches has the risk of being misleading, and we evaluated the performances of all hyperparameters using the contexts' test sets. For each grid search, all experiments were run once, after which 10 or 20 new runs were executed using the selected hyperparameter-values to obtain the results in Tables 2 and 3 in the main text.