The Virtual-Environment-Foraging Task enables rapid training and single-trial metrics of rule acquisition and reversal in head-fixed mice

Behavioural flexibility is an essential survival skill, yet our understanding of its neuronal substrates is still limited. While mouse research offers unique tools to dissect the neuronal circuits involved, the measurement of flexible behaviour in mice often suffers from long training times, poor experimental control, and temporally imprecise binary (hit/miss) performance readouts. Here we present a virtual-environment task for mice that tackles these limitations. It offers fast training of vision-based rule reversals (~100 trials per reversal) with full stimulus control and continuous behavioural readouts. By generating multiple non-binary performance metrics per trial, it provides single-trial estimates not only of response accuracy and speed, but also of underlying processes like choice certainty and alertness (discussed in detail in a companion paper). Based on these metrics, we show that mice can predict new task rules long before they are able to execute them, and that this delay varies across animals. We also provide and validate single-trial estimates of whether an error was committed with or without awareness of the task rule. By tracking in unprecedented detail the cognitive dynamics underlying flexible behaviour, this task enables new investigations into the neuronal interactions that shape behavioural flexibility moment by moment.

animal succeeded. Note that correction trials were not analysed. Instead, they served to reinforce an increasing price of failure in terms of running distance, and as 'instruction trials' so that animals could memorize the correct choice (Rule 4). In the same vein, the fact that changing running direction on the treadmill was somewhat energy-consuming ensured that, in contrast to easier response paradigms (e.g. licking), animals never responded spontaneously or randomly. We tested this in two animals by simply switching off the projection of the virtual environment for ~ 15 minutes. In the absence of a visible target, both mice never changed running direction abruptly (data not shown). An example of the power of (in this case unintended) trade-offs is shown in Supp. Figure S1a: The animal's initial task performance is poor, but improves almost instantly when the settings of the treadmill are marginally adjusted (lateral gain is increased by 10%) to make targets slightly easier to reach. In other words, the initial task configuration set the 'price' of successful task performance too high for this animal to make learning an attractive option. 4) Frustration leads to superstition: As shown by [ 7 ], in the absence of clear evidence, mice tend towards 'superstitious' decision-making, and stop responding to cues that could ensure success. As a consequence, for a training sequence to be successful, each training stage needs to be solvable in a majority of trials. It is also important not to entrain transient training steps for too long in order to avoid frustration about unexpected rule changes. To keep a sufficient percentage of trials solvable, we a) test a range of orientation differences (ΔOri) from 90° to 5°, with easier trials serving as 'anchor trials', and b) manually guide animals towards the correct target after repeated error trials. 5) Avoid abstraction: Unsurprisingly, mice are not readily able to learn abstract associations.
It is therefore important to represent conceptual associations in a physical way. For example, rewarding animals immediately upon approaching the correct target made the reward association almost instant -animals began anticipatory licking within 50-200 trials (Supp. Fig. S1c, see also [ 8 ]). Given that trials in early training tended to last 2-10 seconds, this corresponds to a training time of 5-15 minutes before reward anticipation set in.
Similarly, to clearly signify time-outs, we used a dedicated time-out environment. This increased learning rates markedly compared to an unmarked time countdown in a nondescript environment (data not shown). 6) Utilize innate behaviours: Following from rule 5, it is helpful to use innate behaviours to facilitate stimulus-response links. In this case, we created an environment that mimics foraging (pursuing a visual target to obtain food in a cluttered environment). It also led us to abandon go-no-go tasks, since mice do not easily inhibit behaviours like running or licking for reward. 7) Detect, don't discriminate: Since we can assume that mice have a very limited capacity to attend to multiple objects simultaneously, choosing between two stimuli is more difficult than detecting one target from a background. Such a task still requires visual discrimination, but conceptually the animal searches for one target instead of comparing two. We therefore presented the target stimuli from the beginning, only gradually introducing distractors (initially at low contrast) as a 'cluttered background'. This approach was vastly more successful than a training scheme we attempted in two mice, where we immediately introduced both types of stimuli. In these cases, performance did not exceed chance level after three training sessions (data not shown). In other tasks this principle also has been utilized explicitly or implicitly (e.g. [ 1,9 ]  the number of handling sessions before training (right). Red circles: 12 animals trained in the original task (including the five animals that were subsequently trained on rule reversals). Asterisks: Statistical significance of correlation coefficients (*p < 0.05; **p < 0.01). Bottom panels: Same, but for the smallest ΔOri reached. Animals that had experienced increased handling and habituation learned the task more quickly, but this was unrelated to the visual acuity or overall task performance that the animals ultimately reached. Thus, handling apparently did not affect overall ability, but enhanced the speed with which each animal reached its optimal performance.     Note that at the time point covered by the red analysis windows, which corresponds to the measured reaction time, the slope deviates sharply from 1, but not before or after. performance metric (* family-wise p < 0.05; ** family-wise p < 0.01 after Dunn-Sidak correction for multiple comparisons; see Methods). These correlations reveal that some performance metrics, e.g. hit index and reaction time, had a largely linear relation with stimulus difficulty (ΔOri). In contrast, metrics like running speed and lick position were largely independent of stimulus difficulty. The psychophysical curves after rule reversals indicate that task performance in this context tended to be less consistently related to stimulus difficulty. This was especially true for the first reversal, while performance after the second reversal seemed to be more comparable to initial performance. The stimulus dependence of task performance after rule reversals likely decreases because at this point rule acquisition and frustration affect behavioural output more heavily, overriding pure stimulus processing. Note that for both reversals, performance was still close to optimal.   Note that the normalized difference for running speed is computed in the opposite direction from the other metrics (subtracting incorrect from correct trials rather than vice versa). The reason is that if animals predict trial outcomes, the other metrics (reaction time, path surplus and lick position) would be expected to be smaller in correct than incorrect trials, while running speed would be expected to be greater in correct than incorrect trials. Inset on the right: The EP index for the ten trials shown in a is computed as the average of the normalized differences shown in the four figure panels on the left.   c) Distribution of the ratios between the EP index peak height in the first or second rule reversal, and the EP index peak height in the original task. Red arrow: Average of all 10 ratios (5 animals x 2 reversals). Note that the EP index peak after rule reversals tends to be equal to, or even slightly higher, than that in the original task, as indicated by a ratio close to, or larger than, 1. This suggests that after a rule reversal, animals did not only merely abandon the previous task rule, which would lead to an EP index close to zero.
Instead, they appeared to actively adopt (and therefore anticipate) the new rule, resulting in a large positive peak of the EP index. i.e. with both anticipated and inadvertent errors. In contrast, High-Alert phases (marked by the trajectory moving sharply to the right) generally go along with a negative or low EP index, indicating that the errors that occur during this time are largely inadvertent.
Black arrows point out instances of such High-Alert phases.
c) Relation between the proportion of anticipated errors made during spontaneous states of high and low alertness (same as Fig. 7b), but with the initial trials of task learning progressively removed. Shown are five animals in the original task (black dots) and two reversals (dark and light grey dots, respectively), with three data points missing due to an insufficient overall number (n < 5) of error trials. Left panel: Proportion of anticipated errors computed based on all trials beginning at training stage 5 (as in Fig. 7b), but with the first 50 trials not taken into account. Middle panel: Same, but with the first 75 trials beginning at training stage 5 removed. Right panel: Same, but with the first 100 trials removed. In all three cases, data points are concentrated above the diagonal (*p < 0.05; based on t-test for dependent samples). This indicates that the proportion of anticipated errors is significantly smaller for high than low alertness trials, whether animals were still somewhat learning the task (Fig. 7b)