Synaptic mechanisms of interference in working memory

Information from preceding trials of cognitive tasks can bias performance in the current trial, a phenomenon referred to as interference. Subjects performing visual working memory tasks exhibit interference in their trial-to-trial response correlations: the recalled target location in the current trial is biased in the direction of the target presented on the previous trial. We present modeling work that (a) develops a probabilistic inference model of this history-dependent bias, and (b) links our probabilistic model to computations of a recurrent network wherein short-term facilitation accounts for the dynamics of the observed bias. Network connectivity is reshaped dynamically during each trial, providing a mechanism for generating predictions from prior trial observations. Applying timescale separation methods, we can obtain a low-dimensional description of the trial-to-trial bias based on the history of target locations. The model has response statistics whose mean is centered at the true target location across many trials, typical of such visual working memory tasks. Furthermore, we demonstrate task protocols for which the plastic model performs better than a model with static connectivity: repetitively presented targets are better retained in working memory than targets drawn from uncorrelated sequences.

Inference model for updating target predictions. Interference increases error in working memory tasks with independent trials, but may improve performance in tasks with probabilistically structured sequences of visual targets. We propose this as a biological origin of interference: subjects assume some predictable temporal structure in their environment. In fact, sequential Bayesian updating can account for interference observed in working memory, given specific constraints on a probabilistic updating algorithm. The observer attempts to predict the probability of observing target angle θ n+1 = θ in trial n + 1, given the targets θ 1:n = {θ 1 , θ 2 , …, θ n } observed in the previous n trials ( Fig. 2A). However, the target θ j on the j th trial will only help predict the target θ n+1 on the n + 1 th trial if the distribution s n+1 (θ) from which targets are drawn remains the same between trial j and trial n + 1 30 . The observer assumes the distribution from which presented targets are drawn changes stochastically at a fixed rate ε θ = + s : P( ( ) n 1 ≢ θ s ( )) n . Most visual working memory protocols fix the distribution of target angles throughout the task (ε = 0) 1,3,6,13 , as we do for most of our study, so the observer employs a potentially incorrect model to estimate this distribution (ε > 0). Subjects in psychophysical tasks can have a strong bias toward assuming environments change on a timescale of several seconds 21 , and this bias is not easily trained away 23,31 . Combining these features of the model, the observer updates their predictive distribution for the target during the (n + 1) th trial.
Our algorithm is based on models that compute a predictive distribution for a stochastically moving target, given a sequence of noisy observations 30,32 . The predictive distribution is computed using sequential analysis 22,33 : Prior to trial n + 1, the observer has seen n targets θ 1:n = {θ 1 , θ 2 , …, θ n }. The observer computes j n j 1 (Fig. 2B), the probability of observing the target θ n+1 in the (n + 1) th trial assuming the underlying probability distribution from which targets are sampled does not change from trial j to n + 1 (s n+1 (θ) ≡ s j (θ)), for each trial j = 1, …, n. The true distribution of target angles θ remains uniform throughout most our study, so the observer is applying suboptimal inference. Further details of our Bayesian nonparametric model are given in Methods.
The observer thus computes a predictive distribution L n+1,θ = P(θ n+1 = θ|θ 1:n , ε), using the previous targets θ 1:n ( Fig. 2A) to predict the subsequent target θ n+1 . If the observer assumes the distribution s n+1 (θ) from which targets are drawn in trial n + 1 changes stochastically with a rate ε ∈ (0, 1), recent observations will be weighted more in determining L n+1,θ 21,22,30 . Each observation θ j contributes to the current estimate of L n+1,θ via the probability θ θ f ( ) j (Fig. 2B). Observations are weighted by assuming the observer has a fixed belief about the value ε, specifying the average number of trials they expect the distribution s n (θ) to remain the same. Leveraging techniques in probabilistic inference (See Methods), we find  1 j and = P : 1/360 0 is the uniform density for −180° ≤ θ < 180°. To understand Eq. (1), it is instructive to examine limits of the parameter ε that admit approximations or exact updates.
Static environments (ε → 0). In the limit ε → 0, the observer assumes the environment is static, so the predictive distribution is comprised of equal weightings of each observation (See Fig. 2C and 34,35 ):  Figure 1. Interference in visuospatial working memory, and our corresponding recurrent network model with STF. (A) A visuospatial working memory task was administered in consecutive trials (schematics adapted from Papadimitriou et al. 13 ). The subject fixates on the central (blue) dot and a target (red dot) appears at location θ n on trial n. After the target disappears, the subject retains a memory of the target location during the delayperiod (T D n and + T D n 1 , 0-6000 ms). Lastly, the subject makes a saccade (r n and r n+1 ) to the remembered target location. Papadimitriou et al. 13 found a systematic impact of the relative location (θ n − θ n+1 ) of the trial n target on the trial n + 1 response r n+1 . (B) Response biases in trial n + 1 θ 〈 − 〉 θ + + + ( ) r n n 1 1 n 1 depend on the relative location of the target (θ n − θ n+1 ) in trial n. Responses err in the direction of the previous target θ n , but this tendency is non-monotonic in θ n − θ n+1 . (C,D) The maximum average bias in trial n + 1 decreases with intertrial interval T I n (panel C) and increases with the trial n + 1 delay-period + T D n 1 (panel D). (E) Schematic of our recurrent network model, showing excitatory (triangle) and inhibitory (circles) neurons. Connections between excitatory cells are distance-dependent. Effects of the inhibitory population are fast and spatially uniform, so excitatory and inhibitory populations are merged into single variable u(x, t). STF increases the strength of recently used synapses, described by the variable q(x, t). (F) A tuned input during the cue period (T C ) generates a bump of neural activity u(x, t) centered at x = θ n that persists during the delay-period of trial n (T D n ) and ceases after the response. After the intertrial interval (T I n ), the bump initially centered at x = θ n+1 drifts towards the position of the bump in the previous trial (dotted line) due to the attractive force of STF. Input fluctuations are ignored here to highlight the bias in a single trial.
SCientifiC REPORTS | (2018) 8:7879 | DOI:10.1038/s41598-018-25958-9 As has been shown previously, Eq. (2) can be written iteratively 36 : , n suggesting such a computation could be implemented and represented by neural circuits. Temporal integration of tuned inputs has been demonstrated in both neural recordings [37][38][39] and circuit models 35,36,40 of decision-making tasks. Most oculomotor delayed-response tasks use a distribution of targets s(θ) that is constant across trials 1,3,6,13 . Therefore, Eq. (2) is the optimal strategy for obtaining an estimate of s(θ), assuming the observer has a correct representation of the probability θ For instance, if the distribution s(θ) were peaked, repeated observations θ 1:n would gradually improve the observer's estimate of that peak in Eq. (2). In changing environments (ε > 0), recently observed targets are weighted more strongly than older targets, and the predictive distribution should down-weight the influence of past targets at a rate that increases with ε 22 .
Rapidly-changing environment (ε ≈ 1). Our work focuses on the limit where the environment changes rapidly, ε ≈ 1 ( ε < −  0 (1 ) 1), to account for biases that depend on the previous trial's target θ n (See Methods for other cases). In this case, the predictive distribution for trial n + 1 is a single peaked function centered at θ n (Fig. 2C). The observer assumes the environment changes fast enough that each subsequent target is likely drawn from a new distribution ( θ ). This is a suboptimal strategy, but matches the typical trends of interference in working memory. Applying this assumption to Eq. (1), the formula for L n+1,θ is dominated by terms of order (1 − ε) and larger. Truncating to  ε − (1 ) and normalizing the update equation (See Methods) then yields n 1, 0 n Thus, the dominant contribution from θ 1:n to θ +  L n 1, is the target θ n observed during the previous trial n (Fig. 2C), similar to the behavioral findings in Papadimitriou et al. 13 .

Figure 2.
Updating the predictive distribution L n+1,θ . The observer infers the predictive distribution for the subsequent target θ n+1 from prior observations θ 1:n , assuming a specific change rate ε of the environment: . (A) A sequence of presented targets: θ 1:3 . Note the environment is typically static, , peaked and centered at θ j , showing the observer's assumed probability that θ n+1 = θ, if θ j is observed on trial j and the distribution remains the same in between (s n+1 (θ) ≡ s j (θ)). (C) Evolution of the predictive distribution θ θθ ε for static (ε = 0); slowly-changing (ε = 0.1); and rapidly-changing (ε = 0.8) environments. In static environments, all observations θ 1:3 are weighted equally whereas in the rapidly-changing environment, the most recent observation dominates. Note, sequential computations are trivial in the limit of a constantly-changing environment ε → 1, since the observer assumes the environment is reset after each trial. Prior observations provide no information about the present distribution, so the predictive distribution is always uniform: ≡ θ + L P n 1, 0 . In summary, a probabilistic inference model that assumes the distribution of targets is predictable over short timescales leads to response biases that depend mostly on the previous trial. We now demonstrate that this predictive distribution can be incorporated into a low-dimensional attractor model which describes the degradation of target memory during the delay-period of visual working memory tasks 10,41,42 . Incorporating suboptimal predictions into working memory. We model the loading, storage, and recall of a target angle θ using a low-dimensional attractor model spanning the space of possible target angles θ ∈ [−180, 180)°. These dynamics can be implemented in recurrent neuronal networks with slow excitation and broad inhibition 6,9,43 . Before examining the effects of neural architecture, we discuss how to incorporate the predictive distribution update, Eq. (3), into an associated low-dimensional model. Our analysis links the update of the predictive distribution to the spatial organization of attractors in a network. Importantly, working memory is degraded by dynamic fluctuations, so the stored target angle wanders diffusively during the delay-period 6,9,42 .
During the delay-period of a single trial, the stored target angle θ(t) evolves according to a stochastic differential equation 10 : Here θ(t) is restricted to the periodic domain θ ∈ [−180, 180)° and dξ is a standard white noise process. Equation (4) can be derived as the low-dimensional projection for the location of a bump attractor in a recurrent network. The potential gradient − θ ′( ) models spatial heterogeneity in neural architecture that shapes attractor dynamics (Fig. 3A). During trial n + 1, we label the potential θ θ = + ( ): ( ) n 1   . Classic models of bump attractors on a ring assume distance-dependent connectivity 9,43 . The case θ + ′ ( ) n 1  ≢ 0 accounts for spatial heterogeneity in connectivity that may arise from a combination of training and synaptic plasticity 10,44 , or random components of synaptic architecture 45 . Potential models of bump attractors have been derived in detail for recurrent networks 46,47 , and agree well with the qualitative dynamics of spiking network models 10,42 . The potential landscape of Eq. (4) is assumed to be updated during each trial, so at the beginning of trial n + 1 it has the form θ , the potential is flat, so θ(t) evolves along a line attractor 46 . On the other hand, when the potential is heterogeneous,  θ + ( ) n 1 ≢ 0, θ(t) tends to drift toward one of a finite number of discrete attractors 10,42 . We will incorporate a process whereby previous targets are used to update the potential, so  θ + ( ) n 1 is typically heterogeneous. The observer sees the target at the beginning of trial n + 1, θ(0) = θ n+1 (Fig. 3A), and the angle θ(t) evolves according to Eq. (4) during the delay-period, lasting T D time units. After the delay-period, θ(T D ) is the recalled angle. Depending on the underlying potential θ ( ) n  , there will be a strong bias to a subset of possible targets. We derive a correspondence between the probabilistic inference model and attractor model by assuming stationarity of  θ + ( ) n 1 within each trial (See Methods). In the recurrent network model (Fig. 1E), we take these within-trial dynamics into account. Freezing  θ + ( ) n 1 during a trial allows us to relate the statistics of the position θ(t) to the shape of the potential. Specifically, we relate the stationary density of Eq. (4) to the desired predictive distribution L n+1,θ (See Methods). Thus, if information about the current trial's target θ n+1 is degraded, the probability distribution associated with the recalled target angle θ is L n+1,θ . Focusing on interference trends in Fig. 1, we aim to have the attractor structure of Eq. (4) represent the predictive distribution in Eq. (3). Our calculations relate the potential function in trial n + 1 to the probability generated by the trial n target (Fig. 3A) as can be implemented by a decaying plasticity process that facilitates synapses from neurons tuned to the previous target θ n . The predictive distribution L n+1,θ is therefore encoded by a potential  θ + ( ) n 1 that shapes the dynamics of the attractor model. As we will show, this can be accomplished via STF (Fig. 3B).
Short-term facilitation generates interference in working memory. We now show a neuronal network model describing neural activity u(x, t) subject to STF q(x, t) can incorporate predictive distribution updates derived above. Predictions are stored in the dynamically changing synaptic weights of a recurrent neuronal network. The recurrent network model spatially labels neurons according to their target orientation preference, determining the distance-dependent structure of inputs to the network. This is captured by a network with local excitation and effective inhibition that is fast and broad. Connectivity is shaped dynamically by STF (Fig. 1E). See Methods for more details.
A sequence of delayed-response protocols is implemented in the recurrent network by specifying a spatiotemporal input I(x, t) across trials (top of Fig. 1F). Each trial has a cue period of time length T C ; a delay-period of time length + T D n 1 ; and an intertrial period of time length + T I n 1 before the next target is presented. The network receives a peaked current centered at the neurons preferring the presented target angle θ n+1 during the cue period of trial n + 1; no external input during the delay-period; and a strong inactivating current after the delay-period 6,9 . The resulting bump attractor drifts in the direction of the bump from trial n, due to the STF at the location of the trial n bump (Figs 1F and 3B).
The mechanism underlying intertrial bias is determined by projecting our recurrent network model to a low-dimensional system that extends Eq. (4) to account for STF. To reduce the recurrent network, we project the fast dynamics of bump solutions to an equation for the bump's position θ(t) in trial n + 1 28 q(x, t) determines an evolving potential function θ t ( , )  that shapes the bump's position (Fig. 4). We use timescale separation methods (See Methods) to derive a set of stochastic differential equations, which approximates the motion of the bump's position θ(t) and the location of STF θ q (t): q q during trial n + 1 (t n < t < t n+1 ). The slowly-evolving potential gradient −  θ θ ∂ ∂ t ( , ) shaping the dynamics of θ(t) is a mixture of STF contributions from trial n (decaying t ( ) n  ) and trial n + 1 (increasing  + t ( ) n 1 ). The functions t ( ) n  obey linear dynamics as shown in Fig. 4 (and see Methods). The bump position θ(t) moves towards the minimum of this dynamic potential,  θ θ t argmin [ ( , )] (Fig. 4). The variable θ q (t) is the location of STF originating in trial n + 1, and its position slowly moves toward the bump location θ(t) via the scaled circular difference d(θ). The parametrized timescale τ of STF is inversely related to the observer's perceived environmental change rate ε in Eq. (3), since increasing ε corresponds to decreasing τ. While our derivation (See Methods) is performed assuming STF is slow and weak, we find the approximation agrees well with the full system for stronger STF.
The presence of STF provides two contributions to the slow dynamics of the bump position θ(t). The memory of the previous trial's target θ n is reflected by the potential  θ θ − ( ) n , whose effect decays slowly during trial n + 1. This attracts θ(t), but the movement of θ(t) towards θ n is slowed by the onset of the STF variable initially centered  Fig. 2C). In the low-dimensional system, with dynamics described by Eq. (4), this probability is represented by a potential function θ + ( ) at θ n+1 . The STF variable's center-of-mass θ q (t) slowly drifts towards θ n , which allows θ(t) to drift there as well, . This accounts for the slow build-up of the bias that increases with the length of the delay-period 13 .
Target-and time-dependent trends match experimental observations. We now demonstrate that the bias observed in the visual working memory experiments of Papadimitriou et al. 13 can be accounted for by our recurrent network model (Fig. 1E) and our low-dimensional description of bump motion dynamics (Fig. 4).
To represent a sequence of working memory trials, targets (θ 1 , θ 2 , θ 3 , …) were presented to the recurrent network, and the center-of-mass of the bump was recorded at the end of each delay-period, representing the response (r 1 , r 2 , r 3 , …) (See Methods). The bias of responses was determined by computing the difference between the response and the presented target, r n − θ n . Means and variances of the bias were determined under each condition. Our results are summarized in Fig. 5, focusing on three conditions considered in Papadimitriou et al. 13 . First, we calculated the bias when conditioning on the angle between the trial n and trial n + 1 targets, θ n − θ n+1 (Fig. 5A). Positive (negative) angles lead to positive (negative) bias; i.e. the bump drifts in the direction of the previous target θ n . The bias is graded with the difference, θ n − θ n+1 . To expose this effect, we averaged across trials, since the recurrent network incorporates dynamic input fluctuations, as in bump attractor models of visuospatial working memory 6,9 . We also calculated the peak bias as a function of the intertrial interval (ITI), the time between the trial n response (r n ) and the trial n + 1 target presentation θ n+1 . Consistent with Papadimitriou et al. 13 , the peak bias decreased with the ITI (Fig. 5B). The mechanism for this decrease is the slow decay in the STF of synapses utilized in the previous trial. Finally, the peak bias increased with the delay within a trial, since persistent activity was slowly attracted to the location of the previous target (Fig. 5C). This slow saturation arises due to the slow kinetics of STF within a trial. The bias produced is self-reinforcing, as the synapses originating from the newly-activated neurons undergo STF.
Not only did our recurrent network model recapitulate the main findings of Papadimitriou et al. 13 , we also found our low-dimensional description of the bump and STF variable dynamics had these properties (blue curves in Fig. 5). The mechanics underlying the bias can be described with a model of a particle evolving in a slowly changing potential (Fig. 4), shaped by the dynamics of STF. Having established a mechanism for the bias, we consider how different protocols determine the unconditional statistics of responses.
Task protocol shapes ensemble statistics. Visual working memory tasks are often designed such that sequential target locations are independent 6,9 . In such protocols, there is no advantage in using previous trial information to predict targets within the current trial. Nonetheless, these biases persist in the intertrial response correlations discussed in Papadimitriou et al. 13 and Fig. 5. On the other hand, interference might be advantageous for tasks in which successive target angles θ n+1 depend on the previous θ n . Consider object motion tracking tasks with transiently occluded objects 48,49 . The object's location prior to occlusion predicts its subsequent location when it reappears, so object memory that persists beyond a single trial can be useful for more naturally-inspired tasks. We demonstrate this idea by comparing the network's performance in working memory tasks whose targets are drawn from distributions with different intertrial dependencies (Fig. 6). As a control, we consider the case of independent targets θ n and θ n+1 (Fig. 6A). Responses are normally distributed about the true target angle. The dynamics of the bump encoding the target are shaped by both input fluctuations and a bias in the direction of the previous target on individual trials. However, the directional bias is not apparent in the entire distribution of response angles, since it samples from all possible pairs (θ n+1 , θ n ). An ensemble-wide measure of performance is given by the standard deviation of the response distribution (σ ≈ 4.42). When the current target angle depends on the previous angle, the relative response distribution narrows (Fig. 6B). Memory of the previous trial's target θ n stabilizes the memory of the current trial's target θ n+1 , decreasing the standard deviation of responses (σ ≈ 3.20). There is a high probability the current target θ n+1 will be close to the previous target θ n , so the timescale of the network's underlying inference process is reasonably well matched to the environment. However, such effects can be deleterious when the previous angle θ n is skewed in comparison to the current angle θ n+1 . Protocols with this angle dependence lead to a systematic bias in the relative response distribution, so its peak is shifted away from zero (Fig. 6C).
Our neuronal network model predicts that, if an intertrial bias is present in the computations of a neural circuit, it should be detectable by varying the intertrial dependence of target angles θ n . Furthermore, when there are strong local correlations between adjacent trials (P(θ n+1 , θ n ) is large for |θ n+1 − θ n | small), responses are more accurate than for protocols with independent adjacent trial angles. Since the strength of the bias increases as the intertrial interval is decreased, due to the decay of STF, we expect the effect to be more pronounced for trials taken closer together.
Two timescales of memory degradation. Wimmer et al. 6 have shown that both the normal distribution of saccade endpoints and observed changes in neural firing rates during the delay-period can be replicated by a diffusing bump attractor model 6 . We have shown that a recurrent network with STF ( Fig. 1E) still leads to a normal distribution of predicted response angles (Fig. 6A). Our model also provides new predictions for the dynamics of memory degradation, which we now compare with the standard diffusing bump attractor model 9,47 (Fig. 7). In a network with STF (Fig. 7A), bump trajectories evolve in a history-dependent fashion (Fig. 7B). Initially, bumps diffuse freely, but are eventually drawn to their starting location by facilitated synapses (See also Fig. 4). This leads to two distinct phases of diffusion, as shown in plots of the bump variance (Fig. 7C). Rapid diffusion occurs initially, as the bump equilibrates to the quasistationary density determined by the slowly evolving potential (Fig. 4). Slower diffusion occurs subsequently, as spatial heterogeneity in synaptic architecture gradually responds to changes in bump position via STF. Stabilizing effects of STF on bump attractors have been analyzed previously 28 , but our identification of these multiple timescale dynamics is novel. This feature of the bump dynamics is not present in networks with static synapses (Fig. 7D). Here, bumps evolve as a noise-driven particle over a flat potential landscape (Fig. 7E), described by Brownian motion: a memoryless stochastic process 41,46 . Variance in bump position scales purely linearly with time (Fig. 7F), and the diffusion coefficient can be computed using a low-dimensional approximation 47 .
The qualitative differences between the bump attractor with and without dynamic synapses should be detectable in both behavioral and neurophysiological recordings 6 . Moreover, the observed intertrial bias identified in recent analyses of behavioral data requires some mechanism for transferring information between trials that is distinct from neural activity 13 , as dynamic synapses are in our model. In total, our model provides both an intuition for the behavioral motivation as well as neurophysiological mechanisms that produce such interference.

Discussion
Neural circuit models of visual working memory tend to use neural activity variables as the encoders of target locations. Our computational models account for interference in visual working memory using both suboptimal Bayesian inference and STF acting on a recurrent network model of delay-period activity. The timescale and prior target dependence of attractive biases we observe correspond to psychophysical observations of behavioral experiments in monkeys 13 . STF evolves dynamically over seconds 45,50 , matching the kinetics of interference in visual working memory. The link we have drawn between our two models suggests neural circuits can implement probabilistic inference using short-term plasticity.
Experimental predictions. More complete descriptions of the neural mechanics of visual working memory could be accomplished by analyzing the effects of correlations in sequential target presentations. Since responses in subsequent trials are shaped by the previous trial's target 13 , computational models can be validated by determining how well their response distributions reflect trial-to-trial target correlations (Fig. 6). It is also possible that the introduction of target sequences whose distributions change in time could impact quantitative features of interference. For instance, implementing tasks with target sequences that have multiple trial correlations may extend the timescale of interference beyond a single trial. Furthermore, our model predicts that multiple timescales emerge in the statistics of delay-period activity during a working memory task (Fig. 7). Variance of recall error increases sublinearly in our model, consistent with a recent reanalysis of psychophysical data of saccades to remembered visual targets 4,51 . The dynamics of our model are thus inconsistent with the purely linear diffusion of recall error common in bump attractor models with static synapses 6,9 .
The idea that STF may play a role in working memory is not new 27,52 , and there is evidence that prefrontal cortex neurons exhibit dynamic patterns of activity during the delay-period, suggestive of an underlying modulatory process 53 . However, it remains unclear how the presence of STF may shape the encoding of working memories. Our model suggests STF can transfer attractive biases between trials. Recent findings on the biophysics of STF could be harnessed to examine how blocking STF shapes behavioral biases in monkey experiments 54 . We predict that reducing the effects of the STF would both decrease the systematic bias in responses and increase the amplitude of errors, since the stabilizing effect of STF on the persistent activity will be diminished 28 . Figure 6. Response distribution is shaped by dependence between target angles in adjacent trials P(θ n+1 |θ n ). (A) Visual working memory protocols typically use a sequence of targets with no trial-to-trial dependence (uniform P(θ n+1 , θ n ), shown for θ n ≡ 0°) 6,9 . Relative responses (r n − θ n ) are normally distributed about the true target. (B) Current target θ n+1 depends on the previous target θ n according to a locally peaked distribution. The response distribution narrows (note decreased standard deviation σ), since the target θ n+1 is often close to the previous target θ n . (C) Current target θ n+1 is skewed clockwise from previous angle θ n . Responses are thus skewed counter-clockwise towards the previous target (note average response r is shifted). Numerical methods are described in Methods. Comparison with previous work. The work of Papadimitriou et al. 13,55 also contains modeling studies, accounting for some aspects of their experimental observations. Our computational model differs from and extends their findings in several important ways. We propose that interference can arise as a suboptimal inference process, which can be implemented by concrete synaptic mechanisms. This conclusion can only be drawn from a tractable model, allowing us to reduce our recurrent network's dynamics to the low-dimensional system, Eq. (6). Furthermore, Papadimitriou et al. 13 employ a two-store model of memory that is not linked to a specific physiological mechanism, whereas we propose STF and use a well tested model of its kinetics 56 . Lastly, Papadimitriou et al. 55 present recorded data from the frontal eye fields showing no firing rate tuning to the previous target during the current target onset. While this observation contradicts their purely activity-based description of the bias proposed in Papadimitriou et al. 13 , this is not an issue for the STF-based bias we propose here. The mechanism we propose is gradual and initially silent within the current trial, revealing its effects toward the end of the delay period, so it is consistent with the findings of Papadimitriou et al. 55 .

Alternative neurophysiological mechanisms for intertrial bias.
Our study accounts for biases observed by Papadimitriou et al. 13 , who identified an attraction between the previous target and current response. Strengthening synapses that originate from recently active neurons can attract neural activity states in subsequent trials. This is consistent with recent experiments showing latent and "activity-silent" working memories can be reactivated in humans using transcranial magnetic stimulation 57 , suggesting working memory is maintained by mechanisms other than target-tuned persistent neural activity 27,53 . The principle of using short-term plasticity to store memories of visual working memory targets could be extended to account for longer timescales and more intricate statistical structures. Short-term depression (STD) could effect a repulsive bias on subsequent responses, since neural activity would be less likely to persist in recently-activated depressed regions of the network 58 . In this way, STD could encode a predictive distribution for targets that are anti-correlated to the previously present target.
Other physiological mechanisms could also shape network responses to encode past observations in a predictive distribution. Long-term plasticity is a more viable candidate for encoding predictive distributions that accumulate observations over long timescales. Consider a protocol that uses the same distribution of target angles throughout an entire experiment, but this distribution is biased towards a discrete set of possible angles 42 . For a recurrent network to represent this distribution, it must retain information about past target presentations over a long timescale. Many biophysical processes underlying plasticity are slow enough to encode information from such lengthy sequences 59,60 . Furthermore, the distributed nature of working memory suggests that there may be brain regions whose task-relevant neural activity partially persists from one trial to the next 61 . Such activity could shape low-level sensory interpretations of targets in subsequent trials. Synaptic plasticity can stabilize working memory. Several modeling studies of working memory have focused on the computational capability of synaptic dynamics 62 . For instance, STF can prolong the lifetime of working memories in spatially heterogeneous networks, since facilitated synapses slow the systematic drift of bump attractor states 28,63 . This is related to our finding that STF reduces the diffusion of bumps in response to dynamic fluctuations (Fig. 7B), generating two timescales of memory degradation, corresponding to the bump variance (Fig. 7C). This scaling may be detectable in neural recordings or behavioral data, since recall errors may saturate if stabilized by STF. Facilitation can also account for experimentally observed increases in spike train irregularity during the delay-period in circuit models that support tuned persistent activity 64 . Alternatively, homeostatic synaptic scaling can compensate for spatial heterogeneity, which would otherwise cause persistent states to drift 10 . However, the short homeostatic timescales often suggested in models do not often match experimental observations 65 .
Models of working memory have also replaced persistent neural firing with stimulus-selective STF, so neuronal spiking is only required for recall at the end of the delay-period 27 . One advantage of this model is that multiple items can be stored in the dynamic efficacy of synapses, and the item capacity can be regulated by external excitation for different task load demands 29 . Our model proposes that STF plays a supporting rather than a primary role, and there is extensive neurophysiological evidence corroborating persistent neural activity as a primary working memory encoder 6,66 .

Robust working memory via excitatory/inhibitory balance. Computational modeling studies have
demonstrated that a balance of fast inhibition and slow excitation can stabilize networks, so they accurately integrate inputs 40,46,67 . Drift in the representation of a continuous parameter can be reduced by incorporating negative-derivative feedback into the firing rate dynamics of a network, similar to introducing strong friction into the mechanics of particle motion on a sloped landscape 68 . Fast inhibition balanced by slower excitation produces negative feedback that is proportional to the time-derivative of population activity. A related mechanism can be implemented in spiking networks wherein fast inhibition rapidly prevents runaway excitation, and the resulting network still elicits highly irregular activity characteristic of cortical population discharges 69 . Mutually inhibiting balanced networks are similarly capable of representing working memory of continuous parameters 70 , and extending our framework by incorporating STF into this paradigm would be a fruitful direction of future study.
Extensions to multi-item working memory. Working memory can store multiple items at once, and the neural mechanisms of interference between simultaneously stored items are the focus of ongoing work 71,72 . While there is consensus that working memory is a limited resource allocated across stored items, controversy remains over whether resource allocation is quantized (e.g., slots) 73,74 or continuous (e.g., fluid) 71,75 . Spatially-organized neural circuit models can recapitulate inter-item biases observed in multi-item working memory experiments, and provide a theory for how network interactions produce such errors 76,77 . In these models, each remembered item corresponds to an activity bump, and the spatial scale of lateral inhibition determines the relationship between recall error and item number 78 . The model provides a theory for attractive bias and forgetting of items since nearby activity bumps merge with one another. This is related to the mechanism of attractive bias in our model, but a significant difference is that previous models only required localized excitation whereas we use STF. It would be interesting to identify the temporal dynamics of biases in multi-item working memory to see if they suggest slower timescale processes like short-term plasticity.
Tuning short-term plasticity to the environmental timescale. We have not identified a mechanism whereby our network model's timescale of inference could be tuned to learn the inherent timescale of the environment. There is recent evidence from decision-making experiments that humans can learn the timescale on which their environment changes and use this information to weight their observations toward a decision 21,79 . Our model suggests that the trial-history inference utilized by subjects in Papadimitriou et al. 13 is significantly suboptimal, perhaps because it is difficult to infer the timescale of relevant past-trial information. The complexity, sensitivity, and resource expense of optimal inference in most contexts likely makes it impossible to implement exactly in neural circuits 80,81 . This may explain why humans often use suboptimal methods for accumulating evidence 21,23,82 . Plasticity processes that determine the timescale of evidence accumulation may be shaped across generations by evolution, or across a lifetime of development. Nonetheless, metaplasticity processes can internally tune the dynamics of plasticity responses in networks without changing synaptic efficacy itself, and these changes could occur in a reward-dependent way 83,84 . Recently, a model of reward-based metaplasticity was proposed to account for adaptive learning observed in a probabilistic reversal learning task 85 . Such a process could modify the timescale and other features of short-term plasticity in ways that improve task performance in working memory as well.

Conclusions
Our results suggest that interference observed in visual working memory tasks can be accounted for by a persistently active neural circuit with STF. Importantly, interference is graded by the time between trials and during a trial. The interplay of synaptic and neural processes involved in interference may have arisen as a robust system for processing visual information that changes on the timescale of seconds. More work is need to determine how information about the environment stretches across multiple timescales to shape responses in cognitive tasks. We expect that identifying the neural origin of such biases will improve our understanding of how working memory fits into the brain's information-processing hierarchy.

Methods
Assumptions of the inference model. Our model performs nonparametric density estimation to approximate the distribution s n+1 (θ) from which a target θ will be drawn in trial n + 1. The observer assumes the possible distributions s(θ) are drawn from a function space s ∈ S according to the prior p(s). We assume that marginalizing over this space of distributions yields the uniform density . One possibility is that the distribution s n+1 (θ) is constructed by drawing N-tuples a and ψ from a uniform distribution over the hypercubes [0, a max ] N and [−180°, 180°) N and using the entries in an exponential distribution of a sum of cosines: where ω j = jπ/180 and  s is a normalization constant. For instance, when N = 1, is defined under static conditions (s n+1 (θ) ≡ s n (θ)) to separate the dynamic effects of sampling distribution s n (θ) changes. We are performing nonparametric Bayesian estimation of the distribution, and the probability f θ′ (θ) is already marginalized over the space of distributions s(θ). Thus, we do not model the intermediate step of inferring the probability of each distribution s(θ) and marginalizing, but it could be computed by integrating over the prior on the function space, Each observation θ′ would give the probability f(s|θ′) that the current distribution is s(θ). Integrating over the space of all distributions s ∈ S provides the probability the next target will be θ, based on the previous observation θ′ alone and the assumption that the distribution remains the same from trial n to n + 1. Further details on the difference between parametric and nonparametric Bayesian estimation of densities can be found in Orbanz and Teh 86 . Note, we assume self-conjugacy of f θ′ (θ) = f θ (θ′), which follows since the order of observations does not matter while the environment remains fixed. This relationship will also make the predictiveness of our model more apparent. It is important to note that the observer assumes the form of f θ′ (θ), but this is not necessarily the distribution an ideal observer should use. For illustration, we consider a family of distributions given by an exponential of cosines: 87 . A distribution like Eq. (7) would emerge from a generative model with distance-dependent spatial correlations in the ensemble of produced targets. The example f θ′ (θ) we use for comparison with our recurrent network with STF is close to the case of Eq. (7) with N = 1. A description of the parameters and variables in our model is provided in Table 1.
Derivation of the probabilistic inference model. The observer's predictive distribution L n+1,θ = P(θ n+1 |θ 1:n ,ε) is derived by computing the probability of observing θ n+1 given each prior observation θ j , j = 1, …, n. Importantly, we must compute the probability of each run length l n = l, l = 0, …, n, corresponding to the number of trials the assumed underlying distribution s n (θ) has remained the same 30,32 . Knowing the probability of each run length will inform us of how much to weight each observation θ j , j = 1, …, n. In particular, l n = n indicates the environment has remained the same since the first trial, and l n = 0 indicates the environment changes between trial n and n + 1. Summing over all possible run lengths, the marginal predictive distribution is where θ θ | = + l l P( , ) n n n l 1 1 : is the conditional predictive distribution assuming run length l n = l with the special case θ θ | = = + l P( 0, ) P n n n 1 1 : 0 0 (the uniform distribution), and P(l n = l|θ 1:n ) is the conditional probability of the run Symbol Description θ n target observed on trial n s n+1 (θ) unknown target distribution observer attempts to infer before trial n + 1 θ θ f ( ) n probability of observing target θ in the next trial given θ n was observed previously, and the environment has not changed since ε assumed environmental change rate used to discount evidence L n+1,θ predictive distribution: inferred probability of seeing target θ in trial n + 1, given past observations θ 1:n and assumed change rate ε l n run-length: number of trials prior to trial n + 1 with same distribution P 0 uniform prior: distribution defining θ ∈ [−180°, 180°) with same probability length l n = l given the series of target angles θ 1:n . We further simplify Eq. (8) as follows: First, utilizing sequential analysis, we find that if the present run length is l n = l, the conditional predictive distribution is given by the product of probabilties from the last l observations 22 : 1 j , and we have utilized our self-conjugacy assumption for f θ′ (θ) ≡ f θ (θ′). Next, we assume that observations provide no information about the present run length r n , which would be a consequence of the observer making no a priori assumptions on the overall distribution from which targets θ 1:n are drawn. Thus, the observer only uses their knowledge of the change rate of the environment ε to determine the probability of a given run length l n = l, and the conditional probability can be computed Plugging Eqs (9-10) into the update Eq. (8), we find the probability of the next target being at angle θ n+1 = θ, given that the previous n targets were θ 1:n , is: . The last equality holds because the unconditioned probability of a particular target location is uniform P 0 . Applying this assumption to Eq. (1) and truncating to  ε ( ), we have Limit of rapidly-changing environment (ε ≈ 1). Here, we examine the case ε ≈ 1 ( ε < −  0 (1 ) 1), a rapidly-changing environment. Applying this assumption to Eq. (1), we find L n+1,θ is dominated by terms of order (1 − ε) and larger. Terms of order (1 − ε) 2 are much smaller. For instance, we can approximate to linear order, dropping terms of n n 1, 0 n Furthermore, we ensure the expression in Eq. (11) is normalized by writing Alternatively, we can truncate by multiplying through by ) and normalizing to yield  (Figs 2 and 3A). Higher order approximations are obtained by keeping more terms from Eq. (1); e.g., a second order approximation yields  where p n+1 (θ, t) is the probability density corresponding to the target angle estimate θ at time t. The initial estimate of the target is exact, θ(0) = θ n+1 , so p n+1 (θ, 0) = δ(θ − θ n+1 ) is the initial condition. We summarize the constituent variables and model parameters in Table 2.
We now derive the form of θ that leads to a stationary density corresponding the predictive distribution L n+1,θ in the limit t → ∞ in Eq. (12). The stationary density θ is analogous to a predictive distribution represented by Eq. (4) since it is the probability the system represents when no information about the current trial's target θ n+1 remains. Thus, we build a rule to update θ to mirror the update of L n+1,θ in Eq. (3). To obtain this result, we match the stationary density for Eq. (12) to the updated predictive distribution: Solving Eq. (12) for its stationary solution, we find that during trial n + 1:  Table 3, and the evolution equations consist of one stochastic integrodifferential equation and one auxiliary differential equation: where u(x, t) describes the evolution of the normalized synaptic input at location x. The model Eq. (15) can be derived as the large system size limit of a population of synaptically coupled spiking neurons 89 , and similar dynamics have been validated in spiking networks with lateral inhibitory connectivity 6,9 . We fix the timescale of dynamics by setting τ u = 10 ms, so time evolves according to units of a typical excitatory synaptic time constant 90 .  This population rate model can be explicitly analyzed to link the architecture of the network to a low-dimensional description of the dynamics of a bump attractor as described by Eq. (4).
Each location x in the network receives recurrent coupling defined by the weight function w(x − y) via a convolution . We take this function to be peaked when x = y and decreasing as the distance |x − y| grows, in line with anatomical studies of delay-period neurons in prefrontal cortex 8 . We do not separately model excitatory and inhibitory populations, but Eq. (15) can be derived from a model with distinct excitatory and inhibitory populations in the limit of fast inhibitory synapses 43,67 . Thus, we have combined excitatory and inhibitory populations, so w(x − y) takes on both positive and negative values. Our analysis can be applied to a general class of distance-dependent connectivity functions, given by an arbitrary sum of cosines where ω n = nπ/180, and we will use a single cosine to illustrate in examples: w(x − y) = cos(ω 1 (x − y)). The nonlinearity F(u) converts the normalized synaptic input u(x, t) into a normalized firing rate, F(u) ∈ [0, 1]. We take this to be sigmoidal F(u) = 1/[1 + e −γ(u−κ) ] 91 , with a gain of γ = 20 and a threshold of κ = 0.1 in numerical simulations. In the high-gain limit (γ → ∞), a Heaviside step function F(u) = H(u − κ) allows for explicit calculations 43,89 .
Recurrent coupling is shaped by STF in active regions of the network (F(u) > 0), as described by the variable q(x, t) ∈ [0, q + ]; q + > 0 and β determine the maximal increase in synaptic utilization and the rate at which facilitation occurs 26,56 . For our numerical simulations, we consider the parameter values q + = 2 and β = 0.01, consistent with previous models employing facilitation in working memory circuits [27][28][29] and experimental findings for facilitation responses in prefrontal cortex 45,50 . The timescale of plasticity is slow, τ =  1000ms 10 ms, consistent with experimental measurements 26 . Our qualitative results are robust to parameter changes. Information from the previous trial is maintained by the slow-decaying kinetics of the facilitation variable q(x, t), even in the absence of neural activity 27,29 .
Effects of the target and the response are described by the deterministic spatiotemporal input I(x, t), which we discuss more in detail below. The noise process W(x, t) is white in time and has an increment with mean 〈dW(x, t)〉 ≡ 0 and spatial correlation function 〈dW(x, t)dW(y, s)〉 = C(x − y)δ(t − s)dtds. In numerical simulations, we take our correlation function to be with σ W = 0.005, so the model recapitulates the typical 1-5% standard deviation in saccade endpoints observed in oculomotor delayed-response tasks with delay-periods from 1-10 s 1,4,6 .
Implementing sequential delayed-response task protocol. A series of oculomotor delayed-response tasks is executed by the network Eq. (15) by specifying a schedule of peaked inputs occurring during the cue periods of length T C , no input during trial n's delay-period of length T D n , and brief and strong inhibitory input of length T A after the response has been recorded, and then no input until the next trial. This is described by the spatiotemporal function for all n = 1, 2, 3, …, where t n is the starting time of the n th trial which has cue period T C , delay-period T D n , inactivation period T A , and subsequent intertrial interval T I n . Note that the delay and intertrial interval times may vary trial-to-trial, but the cue is always presented for the same period of time as in 13 . The amplitude of the cue-related stimulus is controlled by I 0 , and I 1 controls is sharpness. Activity from trial n is ceased by the global inactivating stimulus of amplitude I R .
In numerical simulations, we fix the parameters T C = 150 ms; T A = 500 ms; I 0 = I 1 = 1; and I R = 2. Target locations θ n are drawn from a uniform probability mass function (pmf) for the discrete set of angles θ n ∈ {−180°, −162°, …, 162°} to generate statistics in Fig. 5A, which adequately resolves the bias effect curves for comparison with the results  and θ n randomly as in Fig. 5A and identifying the θ n that produces the maximal bias for each value of T I n . Delay-periods are varied to produce Fig. 5C by drawing T D n randomly from a uniform pmf for the discrete set of times ∈ … T {0, 200, , 5000} ms for fixed θ n with ε = 0.5; μ = 0 for local correlations (Fig. 6B) and μ = 90 for skewed correlations (Fig. 6C).
The recurrent network, Eq. (15), is assumed to encode the initial target θ n during trial n via the center-of-mass θ(t) of the corresponding bump attractor. Representation of the cue at the end of the trial is determined by performing a readout on the neural activity u(x, t) at the end of the delay time for trial n: = + + t t T T n C D n . One way of doing this would be to compute a circular mean over x weighted by u(x, t), but since u(x, t) is a roughly symmetric and peaked function in x, computing θ = t ux t ( ): argmax ( , ) n n C D n ) is an accurate and efficient approximation 6,42 . The bias and relative saccade endpoint on each trial n are then determined by computing the difference θ(t) − θ n (Figs 5, 6 and 7).
Deriving the low-dimensional description of bump motion. We analyze the mechanisms by which STF shapes the bias on subsequent trials by deriving a low-dimensional description for the motion of the bump position θ(t). To begin, note that in the absence of facilitation (β ≡ 0), the variable q(x, t) ≡ 0. In the absence of noise (W(x, t) ≡ 0), the resulting deterministic Eq. (15) has stationary bump solutions that are well studied and defined by the implicit equation 43 Assuming the stimulus I(x, t) presented during the cue period of trial n (t ∈ [t n , t n + T C )) is strong enough to form a stationary bump solution, the impact of the facilitation variable q(x, t) and noise W(x, t) on u(x, t) during the delay-period ( ∈ + n C n C D n ) can be determined perturbatively, assuming | |  q 1 and | |  W d 1 . Since τ τ  u , u(x, t) will rapidly equilibrate to a quasi-steady-state determined by the profile of q(x, t). We thus approximate the neural activity dynamics as u(x, t) ≈ U(x − θ(t)) + Φ(x, t), where θ(t) describes the dynamics of the bump center-of-mass during the delay-period ( θ | |  1 and θ | |  d 1 ), and Φ(x, t) describes perturbations to the bump's shape (|Φ|  1). Plugging this approximation into Eq. (15) and truncating to linear order yields From numerical simulations, we know that the synaptic input variable remains finite, so any terms in the approximation u ≈ U + Φ should also be bounded, including Φ(x, t). Therefore, we require a bounded solution to Eq. (16) by requiring the right hand side is orthogonal to the nullspace V(x) of the adjoint linear operator n 1 n where α is a scaling constant and t n+1 is the starting time of trial n + 1 in the original time units t = τt s /τ u . The form of the probability f θ′ (θ) that can be represented is therefore restricted by the dynamics of the facilitation variable q(x, t). We can perform a direct calculation to identify how q(x, t) relates to the predictive distribution it represents in the following special case.
sign( ) (2 ), q q q q q q calculates the shorter difference on the periodic domain. As in our recurrent network, we use the parameters κ = 0.1; q + = 2; β = 0.01; and τ/τ u = 100 to compare with network simulations in Fig. 5.
Numerical simulations of the neuronal network model. Numerical simulations of the recurrent network Eq. (15) were done in MATLAB using an Euler-Maruyama method with timestep dt = 0.1 ms and spatial step dx = 0.18° with initial conditions generated randomly by starting u(x, 0) ≡ q(x, 0) ≡ 0 and allowing the system to evolve in response to the dynamic fluctuations for t = 2 s prior to applying the sequence of stimuli I(x, t) described for each numerical experiment in Figs 5, 6 and 7. Numerical simulations of Eq. (6) were also performed using an Euler-Maruyama method with timestep dt = 0.1 ms. The effects of the target θ n on each trial n were incorporated by holding θ(t) = θ n during the cue period t ∈ [t n , t n + T C ). Otherwise, the dynamics were allowed to evolve as described.
Data analysis. MATLAB was used for statistical analysis of all numerical simulations. The bias effects in Fig. 5 were determined by identifying the centroid of the bump at the end of the delay-period. Means were computed across 10 5 simulations each, and standard deviations were determined by taking the square root of the var command applied to the vector of endpoints. Histograms in Fig. 6 were computed for 10 5 simulations using the hist and bar commands applied to the vector of endpoints for each correlation condition. Bump positions were computed in Fig. 7 by determining the centroid of the bump at each timepoint, and 10 5 simulations were then used to determine the standard deviation and variance plots (using var again).