Response outcomes gate the impact of expectations on perceptual decisions

Perceptual decisions are based on sensory information but can also be influenced by expectations built from recent experiences. Can the impact of expectations be flexibly modulated based on the outcome of previous decisions? Here, rats perform an auditory task where the probability to repeat the previous stimulus category is varied in trial-blocks. All rats capitalize on these sequence correlations by exploiting a transition bias: a tendency to repeat or alternate their previous response using an internal estimate of the sequence repeating probability. Surprisingly, this bias is null after error trials. The internal estimate however is not reset and it becomes effective again after the next correct response. This behavior is captured by a generative model, whereby a reward-driven modulatory signal gates the impact of the latent model of the environment on the current decision. These results demonstrate that, based on previous outcomes, rats flexibly modulate how expectations influence their decisions.

. The repeating bias build-up from previous trials impacts the rats' classification accuracy. Accuracy, defined as the fraction of correct responses, versus the current stimulus strength s in trials following different trial sequences. a-b, Congruent condition in which accuracy was computed in the Repeating Block for trials following several correct repetitions ( a ) or in the Alternating block following correct alternations b , Insets show the color code for sequences of correct repetitions (red) and alternations (blue). Accuracy increased with the length of the congruent sequence, particularly at low stimulus strength (see statistics below). c-d, Incongruent condition in which accuracy was computed in the Repeating block following correct alternations (c) and in the Alternating block following correct repetitions (d). In contrast to the congruent condition, the detrimental effect on accuracy of increasing the length of the incongruent sequence was independent of stimulus strength. Significance was assessed using a mixed-effects ANOVA with factors stimulus strength s, sequence length n, prior-block congruency (congruent/incongruent) and random factor animal, revealed a significant three-way interaction n × s × congruency, F(1,786)=38.57, p < 1e-9; separate analysis of the congruent and incongruent conditions yielded a significant two-way interaction n × s for the congruent condition (F(1,387)=42.09, p <1e-9) but not for the incongruent condition (F(1,387)=2.12, p =0.146). Dashed lines mark the maximum and minimum performance that can be reached for stimuli with s = 0 in each block if subjects adopted the strategy of always repeating or alternating the previous response. In the repeating block these values are the sequence repeating probability P rep and 1-P rep , respectively, whereas in the alternating block they are 1-P rep and P rep . Dots show median across n = 10 animals (Group 1). Error bars and shaded areas show the first and third quartiles. Figure 5. GLM assessing the impact of current stimulus evidence and recent history onto rats choices . a, Exemplar series of recent history trials used to model rat decisions at current trial t . The Stimulus series depicts the amplitude of each acoustic frame for the low (green bars) and high frequency tones (purple bars). The Response series shows the rat decisions (green squares for Leftward and purple squares for Rightward choices). The Transition series shows the relation of two consecutive responses (blue for Repetition, red for Alternation) and the Outcome series indicates whether the responses were rewarded (orange) o not (black). These series are combined to generate the regressors that are organized into the Current Stimulus (b), After-effect (c), Lateral (d) and Transition (e) modules. Because all regressors are ultimately summed together, the modules have no functional relevance and were only introduced for structural clarity. b, The amplitude difference between low and high tones in each acoustic frame of the current stimulus (pink box) is weighted separately by a stimulus kernel which is fitted for each animal (black dots). The outcome of this sum provides the Stimulus evidence S t . which can take values ranging from strong Left to strong Right evidence (color bar with green-purple gradient; black tick shows the value of the example sequence). c, The net stimulus evidence, i.e. the sum of the stimulus evidence over all frames, of each of the previous trials ( t-1 , t-2 , …, green and purple bars in gray box) is weighted by the after-effect kernel providing the After-effect bias A t . d, The Lateral module weights separately previous rewarded r⁺ and unrewarded responses r⁻ that take the values -1 (Left), +1 (Right) or 0 . The sum of these two series gives rise to the lateral bias t L . e, Transitions are considered separately depending on the outcome of the two trials in the transition: T⁺⁺ (rewarded-rewarded), T⁻⁺ (error-rewarded), T⁺⁻ (rewarded-error) and T⁻⁻ (error-error) which take the values -1 (alternation), +1 (repetition) and 0. The weighted sum of transition regressors is then multiplied by the previous response r t-1 in order to yield the transition bias T which provides Rightward vs Leftward evidence. f, The sum S t + A t + L t + T t is then passed through a sigmoid function to yield the trial-by-trial probability of selecting a Right response. For each rat, the weights of the kernels (shown as black connected dots in the center gray boxes) were fitted to maximize the model's probability to generate the actual animal choices. 6 Supplementary Figure 6. Fitted weights quantifying the impact onto animals decisions of the different modules. Average coefficients ( n = 10 animals) obtained in the GLM when separately fitting the choices in trials after a correct (orange) and error response (black). a, Frame-by-frame influence of the current stimulus onto rats decisions, commonly termed psychophysical kernel, shows that the current stimulus, and in particular the first acoustic frames in the stimulus, had a strong impact on decisions. b, After-effect kernel quantifying the repulsion caused by the net sensory evidence of previous trials as a function of trial lag (i.e. number of trials back from current trial). It shows that choices were biased away from the sides associated with previously presented stimuli. This repulsive side bias captured the effect of having heard the previous stimuli, above and beyond the bias introduced by the category of those stimuli, which was captured by the lateral bias (see Methods). It was consistent with an after-effect caused by sensory adaptation in which a strong acoustic power at a given frequency would reduce the likelihood to perceive that frequency in subsequent trials. c-d , Fixed side bias (c) and Left and Right lapse rates (d) obtained in GLM fitting for trials following an error (black) or correct response (orange). e, Normalized variance of X t (X= S, A, L and T) averaged over animals (see Supplementary Methods) quantifies the overall relative impact of each module (current stimulus, after-effect, lateral bias and transition bias) on the rats choices. f, Transfer coefficient T t → T t+k versus trial lag k , quantifying the degree to which the lateral bias at trial t affects the lateral bias on subsequent trials, is calculated separately depending on the outcome of each trial (colored lines show rewarded choices and black lines error choices; see Supplementary Methods for details). Compare with transfer coefficient of the transition bias shown in Fig. 5c. g , Transfer coefficient L t → L t+k for transition bias versus trial lag k for individual animals when trial t was incorrect and trial t+1 was correct. The coefficient at t +2 was significantly larger than zero for all n = 10 rats (Wald test p < 0.003). h , Average coefficients for transitions T⁺⁺ ( n = 10 animals) obtained when the GLM is fitted separately depending on the outcome of last two trials (see inset; e.g. −+ represents that t-2 was incorrect and t-1 was correct). When the last two trials were correct (++), the T⁺⁺ occurring in the last ten trials had a strong influence (dark orange), whereas when the last trial was an error (+− or −−) they had no influence (gray and black dots). Crucially, the influence of previous T⁺⁺ transitions was again strong, when the choice followed a −+ sequence (an error followed by a correct response; light orange dots), confirming the ability of the animals to rapidly recover the transition bias lost after the error, when the subsequent trial was rewarded. Error bars in all panels indicate the first and third quartiles Supplementary Figure 7. GLM results for individual animals. GLM weights obtained from each animal in Group 1 (rows) when separately fitting the after-correct (orange) and after-error choices (black). First two columns: Lateral weights from previously rewarded ( r⁺ ) and unrewarded ( r⁻ ) trials. Last four columns: Transition weights computed separately for T⁺⁺ (rewarded-rewarded), T⁻⁺ (error-rewarded), T⁺⁻ (rewarded-error) and T⁻⁻ (error-error). We tested whether the heterogeneity across animals of the T⁺⁺ kernel after-correct responses was smaller to the heterogeneity of the r⁺ or the r⁻ kernels after-correct responses. For that, we normalized the individual T⁺⁺ kernels by the maximum value across lags and animals and then we summed all the time lags in each animal. We did the same for the r⁺ and r⁻ kernels. The variance of the summed normalized T⁺⁺ kernels was significantly smaller than of the r⁺ kernels (variance ratio was 0.24, p = 0.0425, two-tailed F-test) and also when comparing between T⁺⁺ and r⁺ (variance ration was 0.094, p = 0.0016, two-tailed F -test). We did not compare the T⁺⁺ individual kernels after errors because their magnitude was negligible. with Current stimulus and lateral modules ( sensory + lateral ), with Current stimulus and transition modules ( sensory + transition ), and the full GLM fitted separately for trials following correct responses and errors ( full error/corrects ). Comparison is quantified by the difference between Bayesian Information Criterion (BIC) with respect to the full model. We used the BIC definition as penalized log-likelihood (Schwartz, 1978), i.e. larger BIC score means better model (see Section 2.5). Points represent individual animals and vertical lines indicate mean over rats. b , Decision weights for lateral regressors r⁺ (orange) and r⁻ (black) in the reduced sensory + lateral GLM fitted for all trials. The strong peak at t-2 for post-correct trials is incompatible with a decaying attraction or repulsion effect. The peak is caused by the large T⁺⁺ weight at t-1 obtained in after-correct trials in the full model ( Fig.  4b left). As explained in the Supp. Methods, regressors T⁺⁺ t-1 and r⁺ t-2 are identical when computed only in post-correct trials. c, GLM weights for a model that used separate regressors for correct repetitions ( Rep⁺⁺ , blue) and correct alternations ( Alt⁺⁺ , red solid curve). A third generic transition regressor ( T , black) was also added. Amplitude for Rep⁺⁺ and Alt⁺⁺ weights were comparable (dashed red shows inverted Alt⁺⁺ weights), showing that correct repetitions and alternations generate symmetrical biases, supporting the idea of single transition bias. The standard GLM with a single transition bias provided a better account of rat decisions than the modified model with separate Alt⁺⁺ and Rep⁺⁺ regressors ( BIC difference > 20 for all animals, when fitting both models separately on trials following correct responses and errors). Error bars in all panels indicate the first and third quartiles. Figure 9. Rats use the same strategy in the two blocks. Average coefficients obtained in the GLM when separately fitting trials in the Repeating block (blue lines) and in the Alternating block (red lines). As in Fig. 4, weights were fitted separately for after-correct trials (bright colors) and after-error trials (light colors). a, Frame-by-frame influence of the current stimulus onto rats decisions. b, Influence of the sensory after-effect caused by the net sensory evidence of previous trials as a function of trial lag (i.e. number of trials back from current trial). c, Influence of the response side (Left vs Right) from previously rewarded ( r⁺ , left panel) and unrewarded ( r⁻ , right panel) trials. d, Influence of previous transitions (repetition vs. alternation) computed separately for T⁺⁺ (a rewarded trial followed by a rewarded trial), T⁻⁺ (error-rewarded), T⁺⁻ (rewarded-error) and T⁻⁻ (error-error) .

Supplementary
Points in all panels show median coefficients across animals (Group 1, n = 10) and error bars in all panels indicate the first and third quartiles. Figure 10. Rats use the same strategy at the beginning and at the end of the blocks. Panels are as in Supplementary Fig. 9 but comparing the GLM weights fitted to all trials (light colored curves) and to only the first 50 trials of each block (bright colored curves). Accuracy at the beginning of the block was lower than during the rest of the block (trials 1-50 yielded an average accuracy of 0.730 and trials 50-200 yielded 0.755; two-tailed paired t-test p <1e-5). The strategy they used during the block transition was the same as revealed by the similar GLM weights. Thus, the difference in accuracy was probably due to animals not having accumulated sufficient transition evidence congruent with the corresponding block (e.g. in switching from a Repetition block to an Alternation block they had to change z T from positive to negative). Accuracy did not change significantly at the end of the session ( average 0.751 for last 200 trials vs 0.747 for the rest of trials; two-tailed paired t -test across animals, p = 0.30; similar results were obtained when focussing on the last 150 or 100 trials of each session). Points in all panels show median coefficients across animals and error bars indicate the first and third quartiles. Figure 11. GLM for rats performing an acoustic intensity discrimination task. Same as Supplementary Fig. 9 but for animals in Group 2 ( n = 6 animals) running the Interaural Level Difference discrimination Task. Average coefficients obtained in the GLM when separately fitting the choices in trials after a correct (orange) and error response (black). Points in all panels show median coefficients across animals and error bars indicate first and third quartiles.

Supplementary Figure 12. History effects in uncorrelated stimulus sequences. a-c, Average
GLM weights (Group 3, n = 9 animals) during initial sessions with uncorrelated stimulus sequences (light colored lines) and during subsequent correlated sessions with Repetitive and Alternating blocks (dark colored lines). Choices were fitted separately after-correct (top row) and after-error trials (bottom row). The lateral weights (a-b) do not vary substantially after introducing correlated sequences (except the coefficient of r⁺ t -1 ). In contrast, the magnitude of the weights of the T⁺⁺ transitions (c) after correct increased substantially, whereas the magnitude after errors remained at zero. Error bars show the first and third quartiles. d , Normalized variance of X t (X= S, A, L and T) averaged over animals of each module (see labels over bars) before (black) and after introducing sequence correlations (gray). The blue dots show the normalized variance for each subject and error bars indicate the first and third quartiles. A value of -1 correspond to an extinction of the gating signal on the subsequent trial (i.e. a full blockade of the corresponding bias), while +1 correspond to full recovery of the bias (i.e. gating equal to its maximum value of 1). g, model comparison shows a minor improvement of the double-gating model compared to Standard model. Values for the rest of parameters are consistent with results from the Standard model. Bars show median coefficients across animals (Group 1, n =10 animals) and error bars indicate first and third quartiles. Figure 16. GLM analysis fitted to simulated data from standard dynamical model. For each rat, we fitted the same GLM used with the experimental data ( Fig. 4) to the simulated data from the standard dynamical model. Panels show the different GLM kernels as described in Figure 4. The obtained kernels were quite similar to the kernels obtained from the experimental data (compare the kernels shown here with those in Fig. 4). Points in all panels show median coefficients across animals and error bars indicate first and third quartiles. Figure 17. GLM for the optimal observer. We can formalize the optimal observer as building an expectation from previous trials of which side will be rewarded in the subsequent trial, and using this expectation as a prior to be combined with the stream of stimulus information. Because such prior is updated on a trial-by-trial basis using the choices and outcomes from previous trials, it generates history-dependent choice biases that are captured by the transition and lateral kernels of the GLM. The precise form of those transition and lateral kernels will depend on how much knowledge of the structure of the task the observer has. A common assumption is that the observer does not know that the environment is structured in blocks of trials with a fixed repeating probability but she believes that there is a certain probability H (or hazard rate ) that the probability to repeat the previous stimulus category changes to a new value randomly drawn from a fixed distribution (Yu and Cohen, 2008;Menyel et al, 2016). This is exactly the Dynamic Belief Model (DBM) studied in (Yu and Cohen, 2008) where they showed that the estimate of the repeating probability is very well approximated by a simple exponentially-weighted running average of recent repetitions and alternations of the rewarded side . In other words, in such a model the observer knows on every trial, independently of the response outcome, in which side was the reward (i.e. the stimulus category) and can therefore track the probability that the rewarded side is repeated or alternated. The time constant of the exponential decay is (Yu and Cohen, 2008). a-d, Transition kernels obtained from the fitting of τ = log( ) 3 2(1−H) −1 the GLM to simulated data generated using the DBM. The exponential weighting of the DBM gives rise to exponential transition kernels that decay as a function of trial lag similarly to what we found for T ++ transitions in our rats (Fig. 4b). However, there were multiple differences between the transition kernels displayed by the rats and the DBM. First, the DBM generated non-zero kernels in transitions involving error choices (i.e. T -+ , T +and T --) simply because the optimal observer extracted the same information from all previous transitions independently of whether the choices were rewarded or not. Thus, the kernels for T -+ , T +generated by the DBM were the exact negative of the T ++ kernel. This was because, for example a R-R+ repetition ( T -+ =1) provided the same evidence favoring the alternation of the rewarded side than the L+R+ alternation ( T ++ = -1). Using the same rationale, the kernel for T -in the DBM was exactly equal to the T ++ kernel. Second, the kernels of the DBM computed after error trials did not vanish (see black curves); instead, they were the negative mirror image of the corresponding kernels computed in after-correct trials (compare black and orange curves). This is simply because in the DBM accumulated evidence favored the repetition/alternation of the previous stimulus category c t-1 . In the GLM the transition evidence z T , i.e. repetition z T >0 or alternation z T <0, was defined with respect to the last reponse r t-1 (i.e. the transition bias was defined as T = z T × r t-1 ). Because after errors the stimulus category and choice had opposite sign ( r t-1 = -c t-1 ), this caused the sign of the transition kernel after errors had to be reversed. For instance, if after a long series of correct repetitions there was an incorrect response, R + R + R + R + Rthe DBM would be biased towards repeating the side of the last stimulus category L. This bias built from a series of repetitions, would be captured in the GLM as accumulated evidence to alternate ( z T <0) the last response R . e-f , If the ideal observer was informed that the unconditioned probabilities to be rewarded in the Left vs the Right side were P( L )=P( R )=0.5, the DBM would display a null lateral kernel: the side of the rewarded response would not provide any information about the repeating probability. A more general ideal observer tracking both (1)

Behavioral training procedure
Training started with three sessions of handling and habituation to the behavioral box followed by five training stages.
• First stage: A lateralized sound was played from one speaker. 200ms after stimulus onset, water was delivered from the corresponding side port independently of the animals' behavior.
When the animals collected the reward, the sound was stopped and the trial finished. The side of the sound and water delivery alternated in blocks of 50 trials. After 400 trials the side was randomly interleaved. The first session lasted one night (with free access to food). Animals advanced to the next stage after completing 600 trials across sessions.
• Second stage: Rats learned to self-initiate the trials by poking in the center port. An LED in the center port was lit after the inter-trial-interval (∼1 s) indicating the animal they could start a new trial. The LED remained on for 300ms cueing the animals a fixation period in which they had to remain inside the port. Just after the LED switched off, the stimulus was presented (i) in both speakers in the frequency discrimination task and (ii) in the corresponding speaker in the level discrimination task. The rat withdrew from the center port at any time after stimulus onset, but the sound went on until the animal entered in a side port. Poking in the correct side port resulted in rewarded delivery, whereas incorrect poking was punished with a time out 5s and with a bright light. When the rats had learned the task (performance higher than 90%) we moved to the third stage.
• Third stage: The sound now stopped when the animals left the center port simultaneously the center led offset. The rewards were only available for 4s. When 85% performance was reached animals, advanced to the next stage.
• Fourth stage: the stimulus difficulty was slightly increased on each session by presenting intermediate stimulus evidences. The final evidence values were adjusted to sample the psychometric curve uniformly. Once the final values were achieved the rats went on to the last stage.
• Fifth stage: the rats were exposed to the Repeating and Alternating blocks while keeping the trial structure unchanged.
In animal groups 2 and 3, stages three and four were reversed. Moreover, the decrease in the duration of the stimulus, from lasting until the animal poked on the side ports to lasting only until the withdrawal from the center poke, was done progressively across and within sessions using an automated adaptive method. The complete training process lasted on average 59 sessions per animal.

GLM model
In order to understand the relative influence of the different features of recent history and current sensory stimulation onto rats decisions, we built a Generalized Linear Model (GLM) where these different features are linearly summed to give rise to the probability that the rat selects the Right port in each trial t (Eq. ??) [Corrado et al., 2009, Busse et al., 2011, Gold et al., 2008, Akrami et al., 2018, Braun et al., 2018, Fründ et al., 2014, Nogueira et al., 2017 Feature weights were fitted to individual rat decisions. Features were grouped in different modules: 1. The Current Stimulus module summarizes the influence of the current auditory stimulus on the decision, i.e. the strength of stimulus-response associations. The module includes a regressor S t,f representing the t-th trial intensity difference between the two sounds in frame f , for f = 1, 2 . . . 8 (the influence of frames with f > 8 could be neglected as reaction times longer than 400 ms, i.e. 8 frames, represented on average less than 2.9% of the trials).
2. The After-effect module captures the biases due to the presentations of previous stimuli. Features consisted of the overall sensory evidence of each trial S sum t = Σ f S t,f , i.e. the sum of stimulus evidence over all frames, for previous trials t = −1, −2, . . . − 5. Another grouped regressor S t,f was created by summing the corresponding value of trials t = −6, −7, . . . − 10 .This feature captures any specific potentiation (resp. adaptation) to previously heard tones, that would lead to a bias towards (resp. away from) the side associated with the tone that dominated in previous trials. The regressors capturing this feature were built as for the Lateral module (see below). Although the physical stimulus was strongly correlated with the rewarded side (i.e. stimulus category), fluctuations across frames and variability of the stimulus duration allowed to isolate the effect of the physical stimulus from the category (e.g. for stimuli with low stimulus strength, the sign of the overall physical stimulus S sum t was often at odds with the stimulus category, facilitating the separate weighting of the two features).
3. The Lateral module summarizes the contribution of the side and outcome of responses from previous trials to the lateral bias in each response, i.e. a history-dependent approach or avoidance bias towards either port. The module includes the following regressors: (a) The side of correct responses r + t = r t δ Ot,1 , taking value -1 for correct Left responses, 1 for correct Right responses, and 0 for incorrect responses. The output variable O t equals +1 when the trial t was rewarded and -1 when it was not. The δ is the Kronecker operator, i.e. δ i,j = 1 if i = j, and δ i,j = 0 if i = j. Positive weighting of this feature capture a propensity to opt for previously rewarded responses (i.e win-stay).
(b) The side of incorrect responses r − t = r t δ Ot,−1 , taking value -1 for incorrect Left responses, 1 for incorrect Right responses, and 0 for correct responses. Negative weighting of this regressor would capture a propensity to go away from previously non-rewarded responses (i.e lose-switch).
We used regressors r o t , with o ∈ {+, −}, from each of the five preceding trials (i.e. t = −1, −2, . . . − 5) to measure the influence of previous history over that short window. Another grouped regressor r o t was created by summing the corresponding value of trials t = −6, −7, . . . − 10 .
4. The Transition module describes how repetitions and alternations in previous trials affect the propensity to select the same response as in the previous trial (i.e. repeat) or the other one (i.e. alternate). Transition regressors, taking values +1 for repetition and -1 for alternation, were separated depending on the outcome on the two successive trials making up the transition: (a) The transitions between two consecutively correct responses T ++ t = r + t−1 r + t , taking value +1 for repetition between the two correct trials (denoted in Figs. 2 and 7 as X + X + and meaning two consecutive correct Left responses or two consecutive correct Right responses), -1 for alternation between the trials (Y + X + , Left followed by Right or Right followed by Left, both correct), and 0 if either trial t or t − 1 was incorrect.
(b) The transitions from a correct to an incorrect response T +− t = r + t−1 r − t , taking value +1 for repetition between the two trials (e.g. Left correct followed by Left incorrect), -1 for alternation between the trials (e.g. Left correct followed by Right incorrect), and 0 for other trials.
(c) The transitions from an incorrect to a correct response T −+ t = r − t−1 r + t , taking value +1 for repetition between the two trials (e.g. Left incorrect followed by Left correct), -1 for alternation between the trials (e.g. Left incorrect followed by Right correct), and 0 for other trials.

(d) The transition between two consecutively incorrect responses T
taking value +1 for repetition between the two incorrect trials, -1 for alternation between two incorrect trials, 0 if either trial t or t − 1 was correct.
As for the lateral module, we used one regressor for each of the five preceding trials and a grouped regressor for the summed value of trials t = −6, −7, . . . − 10, obtaining a total of 4 × 6 = 24 regressors T o,q t , with o, q ∈ {+, −}. Note that regressors r + t−2 and r − t−2 were excluded from the fitting because they are redundant with transition regressors for the immediately preceding trial (see section 2.3. 'Indeterminacy on the Influence of transition at previous trial vs side of response two trials back ').
The probability of a Right response at trial t was then modeled as a sigmoidal function of the sum of π L and π R represent the lapse rates for Left and Right responses (i.e. a fixed probability to choose the Left or Right response at any trial independently of current sensory evidence and previous history); Φ is the cumulative of the standard normal function; β is the fixed side bias representing the preference of the animal, independent of recent history, to choose one response over the other. . The fitting procedure involves a generalized Expectation-Maximization algorithm, implemented in Matlab [Fründ et al., 2014]. We used L2 regularization with a regularization term λ = 1.

Impact of each module's estimated bias
We thus computed the variance of each of the terms γ X t and normalized by the total variance, i.e. including the explained and unexplained behavioral variability [Fründ et al., 2014]. Indeed, when using the probit link function (Φ −1 ), the unexplained variability has normal distribution of unit variance; this is because the probit regression model p(r) = Φ( k w k X k ) is equivalent to r = H( k w k X k + η), where η is some noise process emitted with standard normal distribution.
Normalized variance N V X for module X has the great advantage that it is largely invariant to the addition or subtraction of uncorrelated regressors in the GLM, just as the percentage of variance explained in linear regression.

Indeterminacy on the Influence of transition at previous trial vs side of response two trials back
While the GLM analysis allows to tear apart the effect of lateral and transition history-dependent biases, one indeterminacy remains in this analysis about the respective contribution of the previous transition (T t−1 , i.e. the transition at t − 1) and the side of response two trials back (r t−2 , i.e. the side at t − 2) when we perform the analysis separately for trials following an error and following a rewarded response. This can be seen in the following way: a weight ω for T t−1 means that if trial t − 1 is a repetition trial, i.e. XX, there will be a bias w towards repeating X at trial t. If on the contrary, trial t − 1 is an alternation trial, i.e. XY , there will be a bias −ω towards repeating Y , which is equivalent to a bias ω towards selecting X at trial t (denoting X the side of response at trial t − 2 and Y the alternate response). In both cases increasing the weight for T t−1 is equivalent to increasing the weight for r t−2 of the same amount, i.e. increasing the lateral bias towards the side selected two trials back. When performing the analysis separately for trials after rewarded responses (first four rows in Table 1) and trials after error responses (last four rows in Table 1) there is an equivalence between the following pairs of regressors: after correct responses, regressors T ++ t−1 vs. r + t−2 (yellow cells in Table 1), and T −+ t−1 vs. r − t−2 (gray cells); after error responses, regressors T +− t−1 vs. r + t−2 (blue cells), and T −− t−1 vs. r − t−2 (green cells). In the GLM analysis, we thus computed a single weight for each pair of identical regressors (e.g. T ++ t−1 and r + t−2 after correct trials) that accounts for the contribution of both regressors in the pair to the history-dependent bias. We then attributed a posteriori the fitted weight to either the corresponding transition regressor or lateral regressor, selecting the one that was most compatible with its value in the preceding trial (i.e. r t−3 for r t−2 ; T t−2 for T t−1 ). In all four cases except Table 1. Correspondence between T t−1 and r t−2 regressors T ++ t−1 , the lateral regressor was selected, as it provided a nice interpolation of the corresponding values from trials t − 1 and t − 3, whereas the corresponding transition regressor (T −+ , T +− , and T −− ) had values non-significantly different from zero for earlier trials (see lighter dots in Figure  4a). By contrast, the transition regressor was selected in the case of T ++ t−1 , as it corresponded approximately to an exponential extrapolation of the weights of T ++ for earlier trials (see lighter dot in Figure 4b left), and was not compatible with the much smaller values of r + t−3 . Moreover, in the sensory+lateral GLM model that did not feature a transition module, the fitted kernels for the lateral module were non-monotonic, because the weight for r + t−2 was much larger than for r + t−1 ( Supplementary Fig. 8b). This peculiar effect of a stronger impact of an event further in time is readily accounted for by the fact that this peak mostly represents the influence of the previous transition, rather than of the side of response two trials back. The weights attribution was supported by correlation analyses with the neighbouring weights of T t−2 and r t−3 across the 25 animals. After correct trials, the weight of the undetermined regressor (T ++ t−1 , r + t−2 ) correlated strongly with T ++ t−2 (r = 0.67, p < 0.001) but not with r + t−3 (p > 0.1). By contrast, the weight of (T −+ t−1 ,r − t−2 ) correlated with r − t−3 (r = 0.60, p = 0.0016) but not with T −+ t−2 (p > 0.2). After error responses, the weight of (T +− t−1 ,r + t−2 ) correlated with r + t−3 (r=0.71, p<0.001) but not with T +− t−2 (p > 0.2). The weight of (T −− t−1 , r − t−2 ) did not correlate significantly with either T −− t−2 or r − t−3 (p > 0.2).

Alternative GLM models
As control models, we also fitted rat individual data to the following variants of the GLM: (1) a sensory model in which all regressors related to history (after-effect, lateral and transition) were removed, (2) a sensory+lateral model in which all history regressors except the lateral were removed and (3) a sensory+transition all history terms except the transition were removed. For both sensory and sensory+transition models, we added as a regressor the response at the previous trial r t−1 , in order to grasp any overall fixed repeating bias. Model comparison was performed using Bayesian Information Criterion (BIC) (Supplementary Fig. 8a). Comparison using the corrected Akaike Information Criterion (AICc) provided similar results. We also assessed whether both repeating and alternating events played a separate role in the formation of the transition bias, and whether these roles were mirror images of each other suggesting that animals were indeed conceptualizing both patterns, i.e. Rep vs Alt, as the opposite sides of the same bias. For this, we compared our canonical model with a variant model where the transition regressors T o,q t described above were replaced by the following ones ( Supplementary Fig. 8c): 1. The repetition of two consecutively correct responses Rep ++ t taking value 1 for repetition between the two correct trials (i.e. X + X + ), and 0 otherwise.
2. The alternation of two consecutively correct responses Alt ++ t taking value 1 for alternation between the two correct trials (i.e. Y + X + ), and 0 otherwise.
3. The generic transition between successive responses T t , independently of the outcome of these responses, taking value 1 for any repetition (i.e. XX) and -1 for any alternation (i.e. Y X).
2.5 Testing the complete reset hypothesis versus the gating hypothesis 3 Dynamic variable model of behaviour

Standard model with modulated transition bias
To implement the gating hypothesis in which transition evidence is maintained in memory after errors but transiently does not affect choices, we developed a compact model in which the accumulated transition evidence z T was passed on from trial to trial depending on whether the last choice was a repetition or an alternation. This variable maintained a running estimate of the transition statistics and its transformation onto the transition bias γ T was gated by second variable c T by setting γ T = c T × z T × r t−1 . This modulatory variable c T was updated only based on each trial outcome. Similarly to the transition evidence, the model also contains a variable z L that maintained the lateral evidence which, in the canonical version of the model (see below), had no modulatory mechanism and its value was therefore simply used as the lateral bias γ L = z L . Compared with the GLM, for each of the 10 previous trials in order to make a decision, implementing the latent variable model would reduce the working memory load to simply maintaining the value of three latent variables. We used a simple expression for the updating rules of the variables z L and z T : λ X = (λ + X , λ − X ), with X = L, T , represents the leak parameters which take a different value depending on the outcome of the current trial O t thus allowing a rapid decay following errors (i.e. an aftererror reset could be obtained by λ − X = 1). ∆ L = (∆ + L , ∆ − L ) and ∆ T = (∆ ++ T , ∆ +− T , ∆ −+ T , ∆ −− T ) are the update parameters, which take different values depending on the last trial's outcome O t for the lateral evidence or on the last two trials outcome (O t , O t−1 ) for the transition evidence. The model assumed the symmetries (1) of the effect of Right (r t = +1) and Left (r t = −1) responses on z L (Eq. 2) and (2) of the effect of repetitions (r t r t−1 = +1) and alternations (r t r t−1 = −1) on z T (Eq. 3).
The modulatory variable c T was bounded between 0 (no influence of the transition bias onto decision) and 1 (maximal influence onto decision) and followed the following update dynamics: Thus, there were two real-valued parameters ∆ C = (∆ + C , ∆ − C ) that determined how the gating variable is updated following a correct trial or an error. Moreover, depending on the value of ∆ C obtained in the fitting procedure the updating was different: positive ∆ C values led to an increase in c T , while a negative values led to a decrease (Eq. 4). The value of ∆ C was bounded between -1 (c T where the prefactor n O+ ( n OtOt−1 ) represents the number of trials yielding each outcome O t (combination of outcomes O t , O t−1 ). Using Eqs. 8-9, one parameter ∆ x in each of the update vectors (∆ L and ∆ T ) could then be determined by the value of all other parameters. We thus removed it from the list of free parameters and used the following equation to compute the gradient for the other (free) update parameters ∆ y :