## Abstract

Sequential and persistent activity models are two prominent models of short-term memory in neural circuits. In persistent activity models, memories are represented in persistent or nearly persistent activity patterns across a population of neurons, whereas in sequential models, memories are represented dynamically by a sequential activity pattern across the population. Experimental evidence for both models has been reported previously. However, it has been unclear under what conditions these two qualitatively different types of solutions emerge in neural circuits. Here, we address this question by training recurrent neural networks on several short-term memory tasks under a wide range of circuit and task manipulations. We show that both sequential and nearly persistent solutions are part of a spectrum that emerges naturally in trained networks under different conditions. Our results help to clarify some seemingly contradictory experimental results on the existence of sequential versus persistent activity-based short-term memory mechanisms in the brain.

## Access options

Subscribe to Journal

Get full journal access for 1 year

$225.00

only $18.75 per issue

All prices are NET prices.

VAT will be added later in the checkout.

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

## Code availability

The code for reproducing the experiments and analyses reported in this article is available at https://github.com/eminorhan/recurrent-memory.

## Data availability

The raw simulation data used for generating each figure are available upon request.

## Additional information

**Publisher’s note:** Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Change history

### 25 January 2019

In the version of this article initially published online, a word was misprinted in the abstract. Extra letters were removed from the word “Experimentalrep” to correct it to “Experimental”. The error has been corrected in the print, PDF and HTML versions of this article.

### 06 February 2019

The original and corrected figures are shown in the accompanying Publisher Correction.

## References

- 1.
Fuster, J. M. & Alexander, G. E. Neuron activity related to short-term memory.

*Science***173**, 652–654 (1971). - 2.
Wang, X. J. Synaptic reverberation underlying mnemonic persistent activity.

*Trends Neurosci.***24**, 455–463 (2001). - 3.
Goldman, M. S. Memory without feedback in a neural network.

*Neuron***61**, 621–634 (2009). - 4.
Druckmann, S. & Chklovskii, D. B. Neural circuits underlying persistent representations despite time varying activity.

*Curr. Biol.***22**, 2095–2103 (2012). - 5.
Murray, J. D. et al. Stable population coding for working memory coexists with heterogeneous neural dynamics in prefrontal cortex.

*Proc. Natl Acad. Sci. USA***114**, 394–399 (2017). - 6.
Lundqvist, M., Herman, P. & Miller, E. K. Working memory: delay activity, yes! Persistent activity? Maybe not.

*J. Neurosci.***38**, 7013–7019 (2018). - 7.
Constantinidis, C. et al. Persistent spiking activity underlies working memory.

*J. Neurosci.***38**, 7020–7028 (2018). - 8.
Funahashi, S., Bruce, C. J. & Goldman-Rakic, P. S. Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex.

*J. Neurophysiol.***61**, 331–349 (1989). - 9.
Miller, E. K., Erickson, C. A. & Desimone, R. Neural mechanisms of visual working memory in prefrontal cortex of the macaque.

*J. Neurosci.***16**, 5154–5167 (1996). - 10.
Romo, R., Brody, C. D., Hernández, A. & Lemus, L. Neural correlates of parametric working memory in the prefrontal cortex.

*Nature***399**, 470–473 (1999). - 11.
Goard, M. J., Pho, G. N., Woodson, J. & Sur, M. Distinct roles of visual, parietal, and frontal motor cortices in memory-guided sensorimotor decisions.

*eLife***5**, e13764 (2016). - 12.
Guo, Z. V. et al. Maintenance of persistent activity in a frontal thalamocortical loop.

*Nature***545**, 181–186 (2017). - 13.
Baeg, E. H. et al. Dynamics of population code for working memory in the prefrontal cortex.

*Neuron***40**, 177–188 (2003). - 14.
Fujisawa, S., Amarasingham, A., Harrison, M. T. & Buzsáki, G. Behavior-dependent short-term assembly dynamics in the medial prefrontal cortex.

*Nat. Neurosci.***11**, 823–833 (2008). - 15.
MacDonald, C. J., Lepage, K. Q., Eden, U. T. & Eichenbaum, H. Hippocampal ‘time cells’ bridge the gap in memory for discontiguous events.

*Neuron***71**, 737–749 (2011). - 16.
Harvey, C. D., Coen, P. & Tank, D. W. Choice-specific sequences in parietal cortex during a virtual-navigation decision task.

*Nature***484**, 62–68 (2012). - 17.
Schmitt, L. I. et al. Thalamic amplification of cortical connectivity sustains attentional control.

*Nature***545**, 219–223 (2017). - 18.
Scott, B. B. et al. Fronto-parietal cortical circuits encode accumulated evidence with a diversity of timescales.

*Neuron***95**, 385–398 (2017). - 19.
Murray, J. D. et al. A hierarchy of intrinsic timescales across cortex.

*Nat. Neurosci.***17**, 1661–1663 (2014). - 20.
Runyan, C. A., Piasini, E., Panzeri, S. & Harvey, C. D. Distinct timescales of population coding across cortex.

*Nature***548**, 92–96 (2017). - 21.
Sussillo, D., Churchland, M. M., Kaufman, M. T. & Shenoy, K. V. A neural network that finds a naturalistic solution for the production of muscle activity.

*Nat. Neurosci.***18**, 1025–1033 (2015). - 22.
Cueva, C. J. & Wei, X. X. Emergence of grid-like representations by training recurrent neural networks to perform spatial localization. Preprint at https://arxiv.org/abs/1803.07770 (2018).

- 23.
Banino, A. et al. Vector-based navigation using grid-like representations in artificial agents.

*Nature***557**, 429–433 (2018). - 24.
Wilken, P. & Ma, W. J. A detection theory account of change detection.

*J. Vis.***4**, 1120–1135 (2004). - 25.
Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function.

*IEEE Trans. Inf. Theory***39**, 930–945 (1993). - 26.
Zucker, R. S. & Regehr, W. G. Short-term synaptic plasticity.

*Annu. Rev. Physiol.***64**, 355–405 (2002). - 27.
Mongillo, G., Barak, O. & Tsodyks, M. Synaptic theory of working memory.

*Science***319**, 1543–1546 (2008). - 28.
Rose, N. S. et al. Reactivation of latent working memories with transcranial magnetic stimulation.

*Science***354**, 1136–1139 (2016). - 29.
Wolff, M. J., Jochim, J., Akyürek, E. G. & Stokes, M. G. Dynamic hidden states underlying working-memory-guided behavior.

*Nat. Neurosci.***20**, 864–871 (2017). - 30.
Hinton, G. E. & Plaut, D. C. Using fast weights to deblur old memories.

*Proc. 9th Annual Conference of the Cognitive Science Society*, 177–186 (Erlbaum, 1987). - 31.
Sompolinsky, H. & Kanter, I. Temporal association in asymmetric neural networks.

*Phys. Rev. Lett.***57**, 2861–2864 (1986). - 32.
Fiete, I. R., Senn, W., Wang, C. Z. H. & Hahnloser, R. H. R. Spike-time-dependent plasticity and heterosynaptic competition organize networks to produce long scale-free sequences of neural activity.

*Neuron***65**, 563–576 (2010). - 33.
Klampfl, S. & Maass, W. Emergence of dynamic memory traces in cortical microcircuit models through STDP.

*J. Neurosci.***33**, 11515–11529 (2013). - 34.
Krumin, M., Lee, J. J., Harris, K. D. & Carandini, M. Decision and navigation in mouse parietal cortex.

*ELife***7**, e42583 (2018). - 35.
Rajan, K., Harvey, C. D. & Tank, D. W. Recurrent network models of sequence generation and memory.

*Neuron***90**, 128–142 (2016). - 36.
Orhan, A. E. & Ma, W. J. Efficient probabilistic inference in generic neural networks trained with non-probabilistic feedback.

*Nat. Commun.***8**, 138 (2017). - 37.
Ganguli, S., Huh, D. & Sompolinsky, H. Memory traces in dynamical systems.

*Proc. Natl Acad. Sci. USA***105**, 18970–18975 (2008). - 38.
Clevert, D. A., Unterthiner, T., Hochreiter, S. Fast and accurate deep network learning by exponential linear units (ELUs). Preprint at https://arxiv.org/abs/1511.07289 (2016).

- 39.
Glorot X., Bordes A., Bengio Y. Deep sparse rectifier neural networks. In

*Proc. 14th International Conference on Artificial Intelligence and Statistics*(2011). - 40.
Mante, V., Sussillo, D., Shenoy, K. V. & Newsome, W. T. Context-dependent computation by recurrent dynamics in prefrontal cortex.

*Nature***503**, 78–84 (2013). - 41.
Yamins, D. L. & DiCarlo, J. J. Using goal-driven deep learning models to understand sensory cortex.

*Nat. Neurosci.***19**, 356–365 (2016). - 42.
Wang, J., Narain, D., Hosseini, E. A. & Jazayeri, M. Flexible timing by temporal scaling of cortical responses.

*Nat. Neurosci.***21**, 102–110 (2018). - 43.
Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2015).

- 44.
Keshvari, S., van den Berg, R. & Ma, W. J. No evidence for an item limit in change detection.

*PLoS Comput. Biol.***9**, e1002927 (2013).

## Acknowledgements

This work was supported by grant no. R01EY020958 from the National Eye Institute. We thank the staff at the High-Performance Computing Cluster at New York University, especially S. Wang, for their help with troubleshooting.

## Author information

### Affiliations

#### Department of Neuroscience, Baylor College of Medicine, Houston, TX, USA

- A. Emin Orhan

#### Center for Neural Science, New York University, New York, NY, USA

- Wei Ji Ma

#### Department of Psychology, New York University, New York, NY, USA

- Wei Ji Ma

### Authors

### Search for A. Emin Orhan in:

### Search for Wei Ji Ma in:

### Contributions

A.E.O. conceived the study and developed the research plan with input from W.J.M. In several iterations, A.E.O. performed the experiments and the analyses. A.E.O. and W.J.M. then discussed the results, which helped refine the experiments and the analyses. A.E.O. wrote the initial draft of the paper. A.E.O. and W.J.M. reviewed and edited later iterations of the paper.

### Competing interests

The authors declare no competing interests.

### Corresponding author

Correspondence to A. Emin Orhan.

## Integrated supplementary information

### Supplementary Fig. 1 Initial, untrained network dynamics for different (

*λ*_{0},*σ*_{0}) values.The heat maps show the normalized responses of the recurrent units to a unit pulse delivered at time

*t*= 0 to all units. Here,*λ*_{0}takes 10 uniformly-spaced values between 0.8 and 0.98 (columns) and*σ*_{0}takes 10 uniformly-spaced values between 0 and 0.4025 (rows).### Supplementary Fig. 2 Normalized responses of the recurrent units in networks trained with strong initial network coupling and no regularization.

Each plot corresponds to an example trial from one of the six basic tasks. The SIs of the trials are indicated at the top of the plots. Trials are ordered by increasing SI from left to right. All trials shown here are from networks trained with

*λ*_{0}= 0.96,*σ*_{0}= 0.313,*ρ*= 0. After training, all networks shown here achieved a test set performance within 25% of the optimal performance. In Supplementary Figs. 2–5, only the active recurrent units are shown.### Supplementary Fig. 3 Normalized responses of the recurrent units in networks trained with weak initial network coupling and no regularization.

Each plot corresponds to an example trial from one of the six basic tasks. The SIs of the trials are indicated at the top of the plots. Trials are ordered by increasing SI from left to right. All trials shown here are from networks trained with

*λ*_{0}= 0.96,*σ*_{0}= 0.134,*ρ*= 0. After training, all networks shown here achieved a test set performance within 50% of the optimal performance.### Supplementary Fig. 4 Normalized responses of the recurrent units in networks trained with strong initial network coupling and strong regularization.

Each plot corresponds to an example trial from one of the six basic tasks. The SIs of the trials are indicated at the top of the plots. Trials are ordered by increasing SI from left to right. All trials shown here are from networks trained with

*λ*_{0}= 0.96,*σ*_{0}= 0.313,*ρ*= 10^{−3}. After training, all networks shown here achieved a test set performance within 50% of the optimal performance.### Supplementary Fig. 5 Normalized responses of the recurrent units in networks trained with weak initial network coupling and strong regularization.

Each plot corresponds to an example trial from one of the six basic tasks. The SIs of the trials are indicated at the top of the plots. Trials are ordered by increasing SI from left to right. All trials shown here are from networks trained with

*λ*_{0}= 0.96,*σ*_{0}= 0.134,*ρ*= 10^{−3}. After training, all networks shown here achieved a test set performance within 50% of the optimal performance.### Supplementary Fig. 6 Average normalized activity of recurrent units in an example network trained in the 2AFC task.

The network shown here was trained with

*λ*_{0}= 0.96,*σ*_{0}= 0.313,*ρ*= 0. After training, the network achieved a test set performance within 0.1% of the optimal performance. As in ref.^{16}, we divided the recurrent units into left-preferring and right-preferring ones based on whether they responded more strongly during correct left choices or during correct right choices. The upper panel shows the average normalized responses of the left-preferring units in the correct left and correct right trials, respectively. Similarly, the lower panel shows the average normalized responses of the right-preferring units in the correct left and correct right trials. As reported in ref.^{16}, the trained network developed choice-specific sequences in the 2AFC task (cf. Figure 2c in ref.^{16}). Only the most active 150 units from each group are shown in this figure; as always, the original network contained 500 recurrent units. This figure also demonstrates that the sequences are consistent from trial to trial, since the sequential activity pattern does not disappear when the responses are averaged over multiple trials.### Supplementary Fig. 7 A simplified model of recurrent dynamics.

A simplified model that only incorporated the ReLU nonlinearity and the mean recurrent connection weight profiles shown in the upper panel (with no fluctuations around the mean) qualitatively captured the difference between the emergent sequential vs. persistent activity patterns (lower panel, left and right plots respectively). The networks simulated here had 500 recurrent units (only the most active 50 units are shown in the lower panel). All recurrent units received a unit pulse input at

*t*= 0. The self-recurrence term in the recurrent connectivity matrix (not shown in the upper panel for clarity) was set to 1 in both cases. In the sequential case, the off-diagonal band was set to 0.09 in the forward direction and 0.01 in the backward direction, that is*W*_{i,i−1}= 0.09 and*W*_{i−1,i}= 0.01. The recurrent units did not have a bias term and they did not receive any direct inputs during the trial other than the unit pulse injected at the beginning of the trial.### Supplementary Fig. 8 Results for the clipped ReLU networks.

The clipped ReLU nonlinearity is similar to ReLU except that it is bounded above by a maximum value: that is

*f*(*x*) =*clip*(*x*,*r*_{min},*r*_{max}), where*r*_{min}= 0 and*r*_{max}= 100.**a**SI increased significantly with*σ*_{0}. Linear regression slope: 0.55 ± 0.28,*R*^{2}= 0.01 (two-sided Wald test,*n*= 280 experimental conditions,*p*= 0.049). In**a**–**c**, solid black lines are the linear fits and shaded regions are 95% confidence intervals for the linear regression.**b**SI decreased significantly with*λ*_{0}. Linear regression slope: −3.87 ± 0.66,*R*^{2}= 0.11 (two-sided Wald test,*n*= 280 experimental conditions,*p*= 0.000). Note that this result differs from the corresponding result in the case of ReLU networks, where*λ*_{0}did not have a significant effect on the SI (Fig. 2c).**c**SI decreased significantly with*ρ*. Linear regression slope: −418 ± 64,*R*^{2}= 0.13 (two-sided Wald test,*n*= 280 experimental conditions,*p*= 0.000).**d**SI as a function of task. Overall, the ordering of the tasks by SI was similar to that obtained with the ReLU nonlinearity (Fig. 3a). However, note that training was substantially more difficult with the clipped ReLU nonlinearity than with the ReLU nonlinearity. Across all tasks and all conditions, ReLU networks had a training success (defined as reaching within 50% of the optimal performance) of ~60%, whereas the clipped ReLU networks had a training success of only ~9.3%. In particular, we were not able to successfully train any networks in the CD task and very few in the 2AFC task. As a consequence, some of the differences between the tasks ended up not being significant in the clipped ReLU case. Error bars represent mean ± standard errors across different hyperparameter settings. Exact sample sizes for the derived statistics shown in**d**are reported in Supplementary Table 1.**e**,**f**Recurrent connection weight profiles (as in Fig. 6a–c) in conditions where SI > 4.8 and in conditions where SI < 3, respectively. The weights were smaller in magnitude in**f**, because most of the low SI networks were trained under strong regularization. Solid lines represent mean weights and shaded regions represent standard deviations of weights. Both means and standard deviations are averages over multiple networks.### Supplementary Fig. 9 Changing the amount of input noise.

In these simulations, we set

*ρ*= 0 and varied the gain of the input population(s),*g*.*g*= 1 corresponds to the original case reported in the main text; lower and higher values of*g*correspond to higher and lower amounts of input noise, respectively.**a**Combined across all noise conditions, SI increased significantly with*σ*_{0}. Linear regression slope: 0.76 ± 0.08,*R*^{2}= 0.04 (two-sided Wald test,*n*= 2239 experimental conditions,*p*= 0.000). In**a**–**c**, solid black lines are the linear fits and shaded regions are 95% confidence intervals for the linear regression.**b***λ*_{0}did not have a significant effect on SI (two-sided Wald test,*n*= 2239 experimental conditions,*p*= 0.958).**c**The input gain*g*slightly increased the SI. Linear regression slope: 0.04 ± 0.02,*R*^{2}= 0.003 (two-sided Wald test,*n*= 2239 experimental conditions,*p*= 0.003).**d**Again, combined across all input noise levels, the ordering of the tasks by SI was similar to that obtained in the main set of experiments, where*g*= 1 (Fig. 3a). Error bars represent mean ± standard errors across different hyperparameter settings and noise levels. Exact sample sizes for the derived statistics shown in**d**are reported in Supplementary Table 1.### Supplementary Fig. 10 Results for the lowest level of input noise (

*g*= 2.5).**a**SI increased significantly with*σ*_{0}. Linear regression slope: 0.76 ± 0.18,*R*^{2}= 0.05 (two-sided Wald test,*n*= 365 experimental conditions,*p*= 0.000). In**a-b**, solid black lines are the linear fits and shaded regions are 95% confidence intervals for the linear regression.**b***λ*_{0}did not have a significant effect on SI (two-sided Wald test,*n*= 365 experimental conditions,*p*= 0.253).**c**The ordering of the tasks by SI was similar to that obtained in the main set of experiments. Error bars represent mean ± standard errors across different hyperparameter settings. Exact sample sizes for the derived statistics shown in**c**are reported in Supplementary Table 1.**d, e**Recurrent connection weight profiles (as in Fig. 6a–c) in conditions where SI > 4.9 and in conditions where SI < 2.8, respectively. Solid lines represent mean weights and shaded regions represent standard deviations of weights. Both means and standard deviations are averages over multiple networks.### Supplementary Fig. 11 Results for the highest level of input noise (

*g*= 0.5).**a**SI increased significantly with*σ*_{0}Linear regression slope: 0.91 ± 0.21,*R*^{2}= 0.05 (two-sided Wald test,*n*= 361 experimental conditions,*p*= 0.000). In**a**,**b**, solid black lines are the linear fits and shaded regions are 95% confidence intervals for the linear regression.**b***λ*_{0}did not have a significant effect on SI (two-sided Wald test,*n*= 361 experimental conditions,*p*= 0.457).**c**The ordering of the tasks by SI was similar to that obtained in the main set of experiments. Error bars represent mean ± standard errors across different hyperparameter settings. Exact sample sizes for the derived statistics shown in**c**are reported in Supplementary Table 1.**d**,**e**Recurrent connection weight profiles (as in Fig. 6a–c) in conditions where SI > 4.6 and in conditions where SI < 2.3, respectively. Solid lines represent mean weights and shaded regions represent standard deviations of weights. Both means and standard deviations are averages over multiple networks.### Supplementary Fig. 12 Schur decomposition of trained and random connectivity matrices.

**a**Schur mode interaction matrices for the mean recurrent connectivity patterns shown in Fig. 6a–c. Only significant Schur modes with at least one interaction of magnitude greater than 0.04 with another Schur mode are shown here.**b**The corresponding significant Schur modes. Networks with more sequential activity (SI > 5) have more high-frequency Schur modes than networks with less sequential activity (SI < 2.5). The random networks are close to normal.### Supplementary Fig. 13 Results from networks explicitly trained to generate sequential activity as in ref.

^{}35.**a**,**b**are analogous to Fig. 6a, b and show the recurrent weight profiles obtained in trained networks with ReLU and tanh nonlinearities, respectively.**c**,**d**show example trials for the corresponding networks (trained with the same initial condition). Only networks with sequentiality index larger than 5.45 were included in the results shown here.### Supplementary Fig. 14 Circuit mechanism that generates sequential vs. persistent activity in networks with alternative activation functions.

This figure is analogous to Fig. 6a, b, but the results shown are for networks with the exponential linear (elu) activation function (

**a**) and networks with the softplus activation function (**b**). Note that the elu activation function typically produced larger SIs than softplus, hence slightly different SI thresholds were used in the two cases to determine low and high SI networks.

## Supplementary information

### Supplementary Text and Figures

Supplementary Figs. 1–14 and Supplementary Table 1

### Reporting Summary

## Rights and permissions

To obtain permission to re-use content from this article visit RightsLink.