Introduction

Understanding the principles governing long-term memory is a major challenge in theoretical neuroscience. The brain is capable of storing information for the lifetime of the animal, while continually learning new information, so the brain must face the stability—plasticity dilemma: keep changing in order to learn new memories, but do so without erasing existing information. In humans, forgetting curves (retrieval probability vs. age of memory, sometimes referred to as retention curves), are found experimentally to be gracefully decaying with memory age, allowing for non-zero probability of retrieval for memories tens of years of age1,2,3,4. While retrieval probability curves monotonically decrease with memory age, the lifetime of individual memories is more intricate, and seemingly stochastic—we might not be able to retrieve a memory from last week, but can retrieve a much older one. Thus, the retrieval of memories does not depend on the memories’ age alone.

Early attractor neural network models of long-term memory suffer from catastrophic forgetting: when the number of encoded memories is lower than a critical value, memories are retrievable with high precision, but when it is above that critical value, none of the memories can be retrieved5,6,7. Incorporating synaptic decay into the circuit enables continual learning, such that at any point in time recent memories are stable. However, the predicted forgetting curves exhibit a critical memory age, all memories newer than some age are almost perfectly retrievable, while all older ones are destroyed (a palimpsestic behavior-old information is deleted in favor of new information)8,9,10,11,12,13. This is in contrast to the gracefully decaying forgetting curves in humans. Furthermore, the critical age is of the order of synaptic decay time, hence memories older than this time cannot be retrieved. Another class of memory models which avoid catastrophic forgetting are models with bounded (continuous or discrete) synaptic strengths14,15. These models also give rise to a palimpsestic behavior qualitatively similar to synaptic decay models: only new memories up until a critical age are retrievable, while the older memories are not.

One of the main methods of studying the mechanisms of human memory is through memory disorders. Amnesic patients show a variety of patterns of forgetting. One is anterograde amnesia-reduced memory retrieval of events encoded after the onset of the disturbance to the circuit, presumably due to the inability to encode or store new memories. Another pattern is temporally-graded retrograde amnesia—when the probability of retrieval of memories encoded a short time before the pathology onset is lower than that of older events, giving rise to non-monotonic forgetting curves (an effect also known as Ribot’s law). Retrograde amnesia is typically explained by invoking memory consolidation theory, suggesting that memories must go through a stabilization process that is disrupted by the proximal onset of the disturbance16,17,18,19,20,21,22.

In addition to possible cellular mechanisms, memory consolidation at the system level is mediated through a rehearsal process—reactivating memories in wakefulness or during sleep23,24,25,26,27. Several computational models have been proposed for memory consolidation through rehearsals28,29,30,31,32,33,34,35,36. However, all reported results were confined to a small number of memories; none demonstrated memory functionality and forgetting curves in a large circuit with a number of retrievable memories scaling with the number of neurons. None of the models obtain the scaling of capacity and memory lifetime with the number of neurons and other intrinsic parameters.

Here we present a neural network model for lifelong continual learning and memory consolidation. Our model continuously stores patterns of activity by Hebbian learning, and combines synaptic decay with stochastic nonlinear reactivation of memories. Our model generates intricate and rich memory forgetting behavior. Retrieval probability curves decay smoothly with memory age (exponentially or even as a power law), with characteristic times that can be orders of magnitude longer than the synaptic decay time. In addition, due to the stochasticity of the consolidation process, there is a large variability in the survival of individual memories of the same age. We show that at any given time, the number of retrievable memories scales as a power of the number of neurons, exhibiting adequate memory functionality expected for a robust neuronal circuit with distributed memories. The power approaches unity for high rehearsal rate and the capacity approaches linear behavior with the number of neurons.

Perturbations of the model circuit give rise to complex patterns of memory deficits, temporally-graded retrograde and anterograde amnesia, the details of which depend on the size as well as the nature of the perturbation. Our theory relates global measures of memory functionality (memory capacity, characteristic memory lifetime) to intrinsic cellular and circuit parameters, such as synaptic decay rate and reactivation statistics, and provides new insight into how the brain builds and maintains the body of memories available for retrieval at each point in an animal’s life.

Results

Model conceptual description

We model the lifelong memory acquisition and forgetting processes using a recurrent neural network, continuously experiencing Hebbian learning of new activity patterns (”memories”). Following initial memorization, memories are strengthened by stochastic rehearsal events, each increasing the Hebbian contribution of the rehearsed memory to the network synaptic connectivity matrix (‘the memory efficacy’), thereby enhancing their retrievability. In addition, the synapses spontaneously degrade with time, mimicking the well-known synaptic turnover in biological neural networks. For simplicity, our model consists of a single neuronal population, and every synapse is allowed to be positive or negative. The retrieval of a memory at a certain moment in time is determined by both its current efficacy as well as its random interference with other stored memories. The rate of rehearsals of a given memory is self-consistently determined by its retrievability, such that a more easily retrievable memory will be revisited more frequently than a less retrievable one, and a memory that loses its retrievability will not be rehearsed anymore. In this framework, memories begin their life with a fixed initial efficacy, which is subsequently increased by each rehearsal, and decays between rehearsals. As long as a memory is revisited frequently enough, its amplitude will be larger than the threshold efficacy required for retrieval. However, due to the stochastic nature of the rehearsals process, there will be a period of time eventually when the time lag between two rehearsals will be too long, such that the efficacy of the memory will drop too low, making the memory irretrievable, and therefore forgotten. The age at which a certain memory will be forgotten is therefore stochastic and may range from the order of the synaptic decay timescale to several orders of magnitude longer. In the following sections we give a detailed description of our memory model, present its properties and explain how these properties emerge from our model’s dynamics.

Model details

Our model is based on the sparse version of the Hopfield attractor network model of associative memory5,6. Memories are sparse7,37,38,39,40,41, uncorrelated N-dimensional binary activation patterns (N is the number of neurons) and are stored as fixed points of a recurrent neural network dynamics with binary neurons. We assume that the neural activation threshold is adjusted dynamically so that the population activity level maintains the same sparsity as the memories (see “Methods”, in “The effects of structural perturbations on memory function” this assumption is modified). Synaptic dynamics are governed by three processes: Deterministic exponential synaptic decay8,11 (first term in Eq. (1)), Hebbian learning42 of new memories (second term in Eq. (1)), and Hebbian consolidation of old memories following their reactivation (third term in Eq. (1)),

$$\begin{aligned} J_{ij}(t+\Delta t)=(1-\Delta t / \tau )J_{ij}(t)+\sum _{l}\xi _{i}^{l}\xi _{j}^{l}\,\delta _{t,l}+b\sum _{k,\{t_k\}} \xi _{i}^{k}\xi _{j}^{k}\,\delta _{t,t_k} \end{aligned}$$
(1)

Here \(J_{ij}(t)\) is the strength of the synapse between neurons i and j at time t (symmetric in i and j), \(\xi _i^l\) is the i-th element of the memory introduced first at time \(t=l\) and it is given by:

$$\begin{aligned} \xi _i^l= {\left\{ \begin{array}{ll} \frac{1-f}{\sqrt{Nf(1-f)}} \text { with prob. } f \\ -\frac{f}{\sqrt{Nf(1-f)}} \text { with prob. } 1-f \end{array}\right. } \end{aligned}$$
(2)

Here f is the fraction of neurons active in a memory state (the sparseness level). According to the above equation, new memories enter in each time interval \(\Delta t\) and synapses decay at a rate \(1/\tau\), representing the finite lifetime of synapses43. The last term represents a Hebbian strengthening of old memories following a sequence of reactivation events that occur for memory k at times denoted by \(t_{k}\) (which will be specified below). The factor b denotes the size of synaptic modification due to a single consolidation event of an old memory, assumed to be smaller than the Hebbian amplitude of learning a new memory (i.e. \(b< 1\)). The resulting connectivity matrix can be written as:

$$\begin{aligned} J_{ij}(t)=\sum _{l} A_{l}(t) \xi _{i}^{l} \xi _{j}^{l} \end{aligned}$$
(3)

\(A_l(t)\) is the efficacy of memory l at time t. The ability to recall a memory depends on the level of noise, which originates from random interference with other memories. Its variance is proportional to the sum of the squares of all efficacies (see “Methods”):

$$\begin{aligned} \Delta ^{2}(t)= \frac{f}{N}\sum _{n} A_n^{2}(t) \end{aligned}$$
(4)

We define the critical efficacy \(A_c\), as the efficacy for which a memory pattern loses its stability. \(A_c\) is proportional to the interference noise \(\Delta\), with the proportionality constant depending only on the sparseness:

$$\begin{aligned} A_c=a(f) \cdot \Delta \end{aligned}$$
(5)

Due to the reduction of overlap between memories, The critical efficacy is increased when f is decreased. The factor a(f) can be approximated by (see Supplementary Information (SI) section 1):

$$\begin{aligned} a(f) \approx 1.44 \sqrt{2\log {\left( \frac{1.9}{f}\right) }} \end{aligned}$$
(6)

This approximation holds for the f regime we consider throughout this study. For \(f=0.01\) (which we will use throughout the paper), \(a \approx 4.7\).

Pure forgetting

Without rehearsals, our model is similar to previous models of associative memory with forgetting8,10,11, in which memory efficacies decay exponentially with age, \(A_l(t)=\exp (-(t-l)/\tau )\) (Fig. 1a). Using Eq. (4), the interference noise equals \(\Delta ^2\approx f \tau /(2N)\). For \(\tau >\tau _0\), where \(\tau _0=2N/(f a^2(f))\) (see Eq. (5)), \(A_c\) increases above unity (the initial efficacy) and no memory will be retrievable. This global catastrophic forgetting is similar to the behavior of the Hopfield model after reaching memory capacity, where the interference effect is too strong and all memory states lose their stability5,6. If \(\tau < \tau _{0}\), recent memories are retrievable, while memories older than a critical age \(t_0 = \frac{\tau }{2} \log \left( \frac{\tau _0}{\tau } \right)\) are forgotten (Fig. 1b). Thus, for short decay times, this model allows for continual learning of recent memories without global catastrophic forgetting. However, it predicts an unrealistic age-dependent catastrophic forgetting, where all memories up to a critical age are almost perfectly retrievable, and all older memories are completely forgotten. This sharp transition happens despite the graceful exponential decay of efficacies with age, and results from the collective effects of memory stability in the network.

In what follows we will show that when stochastic rehearsals are taken into account, the behavior changes dramatically, generating more realistic memory forgetting trajectories and allowing for lifelong memories.

Figure 1
figure 1

(a) Pure forgetting. A memory efficacy trajectory as a function of time (solid line). The critical efficacy \(A_c\) is plotted as a dashed line. (b) Overlap of the network state with a memory state as a function of the memory age. The overlap is a measure of memory retrievability—after initializing the network near a memory state, the overlap of the nearby attractor network activity will be close to unity for retrievable memories and small compared to one for irretrievable memories. Here \(N=8000\), \(f=0.01\), \(\tau =2240\). The catastrophic age here is \(\sim 1.73\tau\), resulting in a capacity (number of retrievable memories) of  0.5N. Note the very large value of \(\tau\) needed to support this capacity—this will be addressed in later sections.

Nonlinear stochastic reactivation

To specify the statistics of reactivations, we revert to the continuous time version of Eq. (1) which yields the efficacies dynamics:

$$\begin{aligned} \frac{dA_{l}}{dt}=-\frac{1}{\tau } A_{l}+bR_{l}(t) \end{aligned}$$
(7)

With \(A_{l}(t)=0\) for \(t<l\) and \(A_{l}(l)=1\). The reactivations are modeled as a point process

$$\begin{aligned} R_{l}(t)=\sum _{\{t_l\}}\delta (t-t_l), \end{aligned}$$
(8)

where \(t_l\) are the times at which memory l was rehearsed. To specify the rate of the reactivation process, we hypothesize that this process is more likely to yield a Hebbian strengthening of memories with a large basin of attraction. The rationale is that during reactivation periods, the system is more likely to visit memories with large basins of attraction and stay there for a significant period of time triggering their Hebbian strengthening . In particular, memories that at some point in time lost their stability and are not attractors of the dynamics (i.e., have vanishing basin of attraction) will not be reactivated, will experience fast pure decay, and will be forgotten. Hence we model reactivation events as inhomogeneous Poisson processes, with mean rate \(r_l (t)\equiv \langle R_l(t) \rangle\) , which is proportional to the memory’s basin of attraction size:

$$\begin{aligned} r_l (t)=\lambda F(A_l(t)/A_c(t)) \end{aligned}$$
(9)

where \(\lambda\) denotes the maximal reactivation rate. As in Eqs. (4) and (5), at all times \(A_c(t)=a(f) \cdot \Delta (t)\) . The nonlinear function F denotes the size of the basin of a memory and depends on the ratio of the memory efficacy over the critical capacity \(A_c\) (Fig. 2a). We derive the function F by numerical calculation of memories’ basin of attraction size for different values of \(A/A_c\) (see “Methods” section and SI for details). At any given time, only memories with non-zero basin size (i.e., \(A_l(t)>A_c \rightarrow F>0\)) are retrievable and might be reactivated. Note that since the interference \(\Delta (t)\) depends on the efficacies of all memories (Eq. (4)), the reactivation rates of all memories are coupled in Eq.  (9) via \(A_c\).

The approach to steady state of memory consolidation

It is useful to first consider the average dynamics, replacing the reactivation point process by its mean rate, Eq. (9). For a given \(A_c\), the resulting self-consistent equation for the steady state efficacies,

$$\begin{aligned} A_{\text {fp}}=b\lambda \tau F(A_{\text {fp}}/A_c), \end{aligned}$$
(10)

possesses two stable fixed points: one at zero and another one when the two competing processes, decay and reactivation, balance each other (\(A_{\text {fp}}\), Fig 2a). Due to the rapid saturation of the function F, for most of the parameter regime \(A_{\text {fp}}\sim b\lambda \tau\).

To fully understand the system’s behavior we need to consider the dynamics of \(A_c\) itself as well as the stochastic nature of the process. Initially, when the first memories enter the system, \(A_c\propto \Delta\) is very small and the memory efficacies consolidate around the value \(A_{\text {fp}}\sim b\lambda \tau\). As more memories are encoded, the interference grows and so does the critical efficacy (red line in Fig. 2b). When the critical efficacy is large enough, fluctuations in reactivation times lead some memory efficacies to drop below \(A_c\), making these memories irretrievable. A steady state is achieved when the flux of memories arriving at the system and consolidated is balanced by the rate of memories forgetting due to the drop of their efficacy below \(A_c\). At this stage, \(A_c\) reaches a fixed equilibrium value and so does the mean number of retrievable memories. The specific identity of the retrievable memories varies with time—some are forgotten while new ones are being consolidated. The distribution of efficacies at equilibrium (Fig. 2c) consists of two modes: The first is the contribution of the forgotten memories, below \(A_c\), which diverges at small A as \(p(A)=\tau / A\). The second, above \(A_c\), is a mode around \(A_{\text {fp}}\) representing the retrievable memories.

\(A_c\) increases as the amplitude b and number of reactivations per decay timescale \(\lambda \tau\) increase, due to increased interference (Fig. 2d). For moderate reactivation strength, \(A_c\) is well below both the encoding strength \(A(0)=1\) and the consolidation fixed point as seen in the examples in Fig. 2b,c. As reactivation strength grows, \(A_c\) increases and approaches 1, affecting adversely the consolidation process, as will be seen in the next section.

Figure 2
figure 2

Stochastic memory dynamics. (a) Blue: basin of attraction size F as a function of memory efficacy A. Orange dashed: \(A/\tau\), the negative of the deterministic decay term in Eq. (7). Importantly, F is zero for \(A<A_c\). Here \(\lambda \tau =5\), \(A_c=0.4\). (b) Example memory efficacies vs. age of the system. Memories enter with efficacy \(A(0)=1\), rehearsal efficacy \(b=0.3\). Most of them increase towards \(A_{fp}\approx b\lambda \tau \approx 1.5\), and fluctuate around it. Large enough fluctuations can take efficacies below \(A_c\) (e.g, cyan curve at \(age/\tau \approx 40\), yellow curve at \(age/\tau \approx 120\)). Some memories are alive for a very short time (e.g., green curve) and some for very long (e.g., red, blue curves). (c) Distribution of memory efficacies after saturation of \(A_c\). (d) Equilibrium values of \(A_c\) as a function of \(b\lambda \tau\) for different \(\lambda \tau\) values. Here \(\tau =160,\; N=8000\).

The forgetting curve

Importantly, in our model, the time of forgetting of memories at a given age is highly variable, ranging from a fraction of the decay time \(\tau\) (for unfortunate memories that weren’t rehearsed fast enough after learning), and up to hundreds of \(\tau\) for well-rehearsed memories (Fig. 2b). Nevertheless, on average, memory retrievability decreases with memory age, and this is captured by the forgetting curve—the probability of retrieving a memory as a function of its age, after a steady state has been reached (Fig. 3a,b). This curve exhibits an exponential tail with a long time constant, denoted as the consolidation time \(\tau _c\)—a direct result of the long time required for a large fluctuation in reactivation rates to form such that consolidated efficacies decrease from around \(A_{\text {fp}}\) to \(A_c\) (SI). The consolidation time enhancement factor \(\tau _c / \tau\) can reach several orders of magnitude, allowing memories to survive for very long times compared to the intrinsic timescales of the system (as shown in Fig. 3c,d). For fixed consolidation parameters \(\lambda \tau\) and b, consolidation time normalized by \(\tau\) decreases with \(\tau\) due to increased interference (Fig. 3d).

For fixed \(\tau\), consolidation time increases sharply with \(\lambda \tau\) and b (Fig. 3c). However, increasing these parameters may adversely affect the consolidation process. When \(b\lambda \tau\) is of order 1, most of the memories experience consolidation, as is the case in Fig. 3a . However when \(b\lambda \tau \gg 1\), \(A_c\) is close to the encoding efficacy (Fig. 2d). This causes a significant number of memories not to get consolidated. Therefore, the forgetting curve exhibits an initial fast decay with a characteristic time \(\tau\), in addition to the slow decay time \(\tau _c\) (Fig. 3b). To quantify this effect, we measure the consolidation probability of memories, \(p_c\), defined as the chance of a memory efficacy to reach \(A_{fp}\), and therefore become part of long-lived memories. \(p_c\) decreases as a function of the reactivation strength from \(p_c=1\), for \(b\lambda \tau <1\), to zero for strong reactivation (Fig. 3d). The consolidation probability \(p_c\) together with \(\tau _c\) are the key consolidation parameters.

Figure 3
figure 3

(a, b) The forgetting curve. The probability of retrieval as a function of memory age. Blue dots: full network simulation results (see “Methods”). Red solid lines: results of a mean field approximation (“Methods”). An exponential fit with characteristic decay time \(\approx 18\tau\) is shown in green (dash-dot line) in (a), and a double exponential fit with characteristic decay times \(\approx \tau\) and \(\approx 38\tau\) in (b). The retrieval probability for pure forgetting is shown in black (dashed line). In (a) \(N=8000\), \(\tau =160\), \(\lambda \tau =5\), \(b=0.3\). In (b) same parameters as (a) except \(b=0.25,\; \lambda \tau =10\). (c) Blue (left y axis): Consolidation probability vs. \(b\lambda \tau\) for different \(\lambda \tau\) values. Green (right y axis): Consolidation time \(\tau _c\) normalized by synaptic decay time \(\tau\) vs. \(b \lambda \tau\) for different \(\lambda \tau\) values. Here \(\tau =160\). (d) Consolidation time \(\tau _c\) normalized by synaptic decay time \(\tau\) vs. \(\tau\) for different \(\lambda \tau\) values. Blue curve: \(\lambda \tau =5, b=0.3\). Green curve: \(\lambda \tau =10,b=0.25\).

Capacity increases as power law with network size

We define the network’s memory capacity as the number of memories retrievable (memories with \(A>A_c\)) in the equilibrium phase after long encoding time (Fig. 4). The capacity can be evaluated as the area under the forgetting curve. Hence, it can be approximated as \((1-p_c)\tau +p_c\tau _c\), where \(p_c\tau _c\) is the contribution from consolidated memories and fraction of memories that are consolidated and the first term is the contribution from unconsolidated memories. As seen previously, \(\tau _c\) increases with \(b\lambda \tau\), while \(p_c\) decreases with \(b\lambda \tau\)—less memories are consolidated, but the consolidated ones live longer. Maximal capacity is achieved when \(b \lambda \tau \approx A(0)=1\), which is the maximal value that allows for \(100\%\) of the memories to get consolidated.

To assess the efficiency of information storage in the network it is important to evaluate the dependence of the capacity on the network size, N. In previous ’pure forgetting’ models8,10 the synaptic decay time was assumed to scale linearly with N, resulting in memory capacity \(t_0\) which is proportional to N. The same holds for our model. However, this scaling results in extremely large, biologically implausible, synaptic decay times for large networks. Here we assume that \(\tau\) is a property of individual synapses and is independent of network size. Under this condition, the capacity in the pure forgetting model increases only logarithmically with N, Fig. 4c.

Interestingly, we find that in our model, the capacity scales as a power law of the number of neurons, with a power that approaches unity for large \(\lambda \tau\) values (Fig. 4c,d). To approximate the power analytically (for the parameter range where \(p_c \approx 1\)), we first approximate \(A_c\), assuming that the main contribution to the interference noise comes from consolidated, retrievable memories (SI, sec.3):

$$\begin{aligned} A_c \approx \sqrt{\frac{f\cdot a^2(f) p_c \tau _c }{N}}b\lambda \tau \end{aligned}$$
(11)

Now, under the assumption that \(p_c \approx 1\) and that the mean rehearsal rate is \(~\lambda \tau\), we get (SI):

$$\begin{aligned} \text {capacity} \approx \tau _c \approx \tau ^{\frac{1}{1+0.5\lambda \tau }} \left( \frac{N}{f a^2(f)}\right) ^{\frac{\lambda \tau }{2+\lambda \tau }} \end{aligned}$$
(12)

Figure 4c,d show the approximation gives a reasonable fit to the dependence of the capacity on N. Note the significant increase in capacity compared to the pure forgetting model.

Figure 4
figure 4

Memory capacity. (a) The number of retrievable memories divided by N as a function of \(b \lambda \tau\) for different average number of rehearsals per characteristic decay time (\(\lambda \tau\)) values. The dashed line shows the capacity of the pure forgetting model. Here \(N=8000,\; \tau =160\). (b) The number of retrievable memories divided by N as a function of \(\tau\) for different \(\lambda \tau\) values. (c) Capacity vs. N (logarithmic, base 10), solid lines show simulation results, dashed lines are analytical approximation. Here \(\tau =160,\; b=0.3\). (d) The power of N vs. \(\lambda \tau\) (black), and the analytical approximation (red). Here \(\tau =160,\; b=0.3\).

Inhomogeneity in initial memory encoding

So far we have assumed that all memories are encoded initially by Hebbian plasticity with the same amplitude \(A(0)=1\) (eq. 1). In reality, memories might differ in their encoding strength, for instance, due to factors such as attention, or emotional context. Thus, it is important to explore the effect of a distribution of initial encoding strengths. As long as most of the initial efficacies are in the neighborhood of \(b\lambda \tau\), the global memory properties such as \(A_c\), forgetting curve, and capacity are not affected drastically. However, individual memories with initial efficacy below \(A_c\) are forgotten, while memories with A(0) larger than \(b\lambda \tau\) have slightly enhanced consolidation properties, as is confirmed in Fig. 5a for an exponential distribution of A(0) with mean 1.

To better elucidate the effect of inhomogeneity in A(0), we consider in Fig. 5b,c the case of a Bernoulli distribution, \(A(0) \in \{1, a_0\}\) with equal probability. For small \(a_0\) compared to \(b\lambda \tau =1.5\), the consolidation probability for memories with \(A(0)=a_0\) decreases drastically and vanishes for \(a_0\) below \(A_c \approx 0.39\) (Fig. 5b). When \(a_0\) increases above 1, consolidation probability of these memories increases until it reaches 1 for \(a_0 \approx b \lambda \tau\). On the other hand, memories with \(A(0) =1\) are only moderately affected by changing \(a_0\). The mean lifetime of memories with \(A(0)=a_0 < 1\) drops considerably (Fig. 5a,c). This is, however, due to averaging the lifetime of all memories including those that did not consolidate. Importantly, in our model, memories with small \(a_0\) that did reach the neighborhood of the fixed point have the same long lifetime as other consolidated memories, independent of the original encoding strengths as shown by the dashed lines in Fig. 5c. Note that inhomogeneous initial efficacy distribution alone will not give rise to memories with lifetime much larger than the synaptic decay time scale, due to the exponential decay—for a lifetime of 10 times the decay time scale, the initial efficacy will have to be \(\text {exp}(10)\) times larger than the critical efficacy. In any initial efficacy distribution, such a memory will be extremely rare, because the critical efficacy is proportional to the square root of the distribution’s second moment.

Figure 5
figure 5

Initial condition distribution. (a) Here A(0) for each memory is drawn from an exponential distribution with unity mean. Blue points are values for single memories, and the red line shows the mean. Note that the spread in lifetimes at each encoding strength is a result of the stochastic rehearsal process, which yields an exponential distribution of lifetimes and is present also in the uniform A(0) case. (b) Consolidation probability vs. \(a_0\), which is the initial efficacy of half of the memories (the others have \(A(0)=1\)). Values for memories introduced with \(A(0)=1\) are shown in blue, and for memories introduced with \(A(0)=a_0\) are shown in red. (c) Memory mean lifetime (time from insertion to forgetting) as a function of \(a_0\). Same scenario and coloring as in (a). Dashed lines are averaged lifetimes of consolidated memories only. Parameters: \(N=8000,\; f=0.01,\; \lambda \tau =5,\; b=0.3\)

The effects of structural perturbations on memory function

In this section we analyze the effect of damage to the circuit on memory storage and retrieval. In previous sections, we assumed for simplicity that the neuronal firing threshold is automatically adjusted to guarantee a fixed mean activation level, f (see “Methods”). Here we assume that the firing threshold is fixed since we anticipate that part of the effect of perturbation is the disruption of the level of activity. Importantly, in the case of constant threshold, the dependence of \(A_c\) on \(\Delta\) is not linear. It has a non-zero value for small \(\Delta\) reflecting the requirement for the encoding efficacy to be large enough for neurons to cross the threshold. Above some critical \(\Delta\), \(A_c\) rises abruptly, causing all memories to lose stability, due to over-activation of the network when the noise level is high (Fig. S4). At equilibrium \(\Delta\) is below but close to the critical value (for the presented parameter range). In this scenario, the properties of the unperturbed system are similar to those of the fixed activity scenario, with a memory capacity that depends on the threshold value. For the presented results we used the threshold value 0.36 which maximizes capacity in unperturbed conditions (See SI).

Noisy synaptic dynamics

We first consider perturbations of the synaptic learning and consolidation processes by adding white noise \(\chi\) to the synaptic dynamics, for all \(t\ge t_{\text {onset}}\),with a diffusion coefficient D,

$$\begin{aligned} \begin{aligned}{}&\frac{dJ_{ij}}{dt} =-\frac{1}{\tau } J_{ij}+\sum _{l}\xi _{i}^{l}\xi _{j}^{l}\left( \delta (t-l)+bR_{l}(t)\right) +\chi _{ij}(t), \\&\left\langle \chi _{ij}(t)\right\rangle =0,\\&\left\langle \chi _{ij}(t)\chi _{kl}(t')\right\rangle =D^2\delta (t-t')\delta _{ik}\delta _{jl} \end{aligned} \end{aligned}$$
(13)

The effect of this noise is approximately an additive contribution to the total variance of local fields

$$\begin{aligned} \Delta ^2(t)\approx \frac{1}{N}\sum _{l}A_l^2(t)+\frac{\tau D^2}{2N} \left( 1-\text {exp}\left( -2(t-t_\text {onset})/{\tau } \right) \right) \end{aligned}$$
(14)

After the noise onset, \(\Delta\) increases rapidly above its critical value, causing a sharp increase in \(A_c\), and the blocking of rehearsals for all memories. This in turn causes a rapid decrease in magnitude of stored memory efficacies, leading to a decrease in \(\Delta\) below the critical value and a decrease in \(A_c\) to a value which is between the value before the onset and the value just after the noise onset. This new equilibrium value of \(A_c\), with reduced capacity, occurs over \(\sim \tau\) (Fig. 6a). Although overall reduction in capacity may be mild for moderate D values, there is a large reduction in the retrieval probability of memories that were encoded around the perturbation onset time, due to the sharp transient increase in \(A_c\). In contrast, memories that have already been consolidated suffer only a mild reduction in survival probability (relative to unperturbed memories of the same age). Likewise, newly entered memories have a high probability of consolidation, since they experience the equilibrium value of \(A_c\) and their retrieval probability is similar to the unperturbed case (Fig. 6b).

Random synaptic silencing

Another perturbation we consider is the death of a fraction of the synapses. We model the effect of the synaptic death by multiplying the connectivity matrix \(J_{ij}\) by a binary random dilution \(\{0,1\}\) matrix:

$$\begin{aligned} J_{ij} \rightarrow C_{ij} J_{ij}, \;\; C_{ij}={\left\{ \begin{array}{ll} 1 \text { with prob. } 1-p \\ 0 \text { with prob. } p \end{array}\right. } \end{aligned}$$
(15)

Unlike the additive noise considered above, synaptic dilution process is multiplicative, reducing both the effective efficacy of each memory (by a factor \(1-p\)), and the interference noise \(\Delta\) (by factor \(\sqrt{1-p}\)), and in general reduces the signal to noise ratio (“Methods”). After the dilution onset, \(A_c\) barely changes (due to the weak dependence of \(A_c\) on \(\Delta\) in the constant threshold scenario), while all the efficacies are reduced, causing a reduction of retrievability that spreads over the entire age range ( Fig. 6c,d) and a new equilibrium is achieved slowly. Interestingly, neural adaptation (modeled here as a decrease in the neural activation threshold) can reduce the memory loss due to silencing (also reported in44,45,46), by reducing the minimum efficacy required for activation, i.e., \(A_c\), thereby recovering some of the gap between memory efficacies and \(A_c\) (Fig. 7). Thus, our model predicts a qualitative difference in the effects of the two types of perturbations: synaptic dilution affects memories of all ages, causing a reduction in capacity that develops over a long time and can be partially compensated for by threshold adaptation, while additive synaptic noise results in a deficit largely confined to the time of the perturbation onset, and fast convergence to a new equilibrium.

Figure 6
figure 6

Perturbations and memory deficits. (a) The ratio between the capacity with and without injected noise vs. the diffusion coefficient D. (b) Retrieval probability vs. memory age with noisy synaptic dynamics (\(D=6\)). Noise onset was before: \(5\tau\) (green), \(10\tau\) (blue), \(20\tau\) (purple), \(40\tau\) (brown). The control (black) is simulated with noiseless dynamics. (c) The ratio between the capacity with and without synaptic dilution vs. the silenced synapses fraction p. (d) Retrieval probability vs. memory age for random synaptic dilution (\(p=0.1\)). Coloring as in (b). (e) Same as (c), but with \(p=0.2\). Memories of all ages are affected, with some non-monotonicity caused by the small efficacies of newly learned memories, dropping more easily below \(A_c\). (f) Combination of synaptic dilution and noisy synaptic dynamics, \(D=6\) and \(p=0.1\). Coloring as in (b). Parameters: \(N=8000,\; \tau =160,\; \lambda \tau =10,\; b=0.25\)

Figure 7
figure 7

Effect of threshold adaptation. Blue bars show capacity with threshold optimized for the noiseless case (\(\theta _0=0.36\)). Green bar shows capacity with threshold optimized for low dilution (p=0.1, \(\theta _1=0.31\)). Red bar shows capacity for threshold optimized for high dilution (p=0.2, \(\theta _2=0.29\)). Parameters: \(N=8000,\; \tau =160,\; \lambda \tau =10,\; b=0.25\)

Distribution of synaptic decay times

Experiments showing that the time scale of synaptic and spine turnover is variable43, and observations of power law memory retention curves in some memory studies1,2,3,4,47 encourage consideration of the properties of synaptic dynamics with heterogeneous decay time constants, yielding,

$$\begin{aligned} J_{ij}(t)=\frac{1}{Nf(1-f)}\sum _{l}A_{l}^{ij}(t)\,(\xi _{i}^{l}-f)(\xi _{j}^{l}-f) \end{aligned}$$
(16)

where the efficacies \(A_{l}^{ij}(t)\) contributed by each synapse obey

$$\begin{aligned} \frac{dA_{l}^{ij}}{dt}=-\frac{1}{\tau _{ij}} A_{l}^{ij}+R_{l}(t) \end{aligned}$$
(17)

The mean efficacy of each memory is the average over these contributions,

$$\begin{aligned} A_{l}(t)=\left\langle A_{l}^{ij}(t)\right\rangle _{\tau _{ij}} \end{aligned}$$
(18)

where \(\langle ..\rangle _{\tau _{ij}}\) denotes the average over the distribution of synaptic time constants. Likewise, the noise term is proportional to the sum of second moments of the efficacies:

$$\begin{aligned} \Delta ^2(t)=\frac{f}{N}\sum _{n}\left\langle (A_{n}^{ij}(t))^{2}\right\rangle _{\tau _{ij}} \end{aligned}$$
(19)

As an example, we show the case where the decay time constants are power law (Pareto) distributed , i.e., \(P(\tau )\propto \tau ^{-(\alpha +1)} \; ; \tau \ge \tau _0,\alpha >0\) (“Methods”). In the absence of rehearsals (pure decay), there will be a global catastrophic forgetting for \(\alpha \le 1\), where the mean of the decay rate, and therefore the interference noise, diverges. For \(\alpha >1\) there will be a catastrophic age dependent forgetting, as in the case with a uniform decay time scale (See13 SI). With stochastic nonlinear rehearsals, for large \(\alpha\) the forgetting curve is approximately exponential, similar to the single \(\tau\) case (Fig. 8b). This is because the dominant contribution comes from the shortest time \(\tau _0\). Interestingly, for intermediate values (\(1<\alpha <1.8)\), the forgetting curves have an approximately power-law decay (Fig. 8a). In this regime, the retrieval probability is affected by contributions from a broad range of time constants: neither the minimal \(\tau\) nor outliers with very large values are dominant.

Figure 8
figure 8

Power-law synaptic decay characteristic time distribution. (a) Retrieval probability as a function of memory age on log-log scale (blue), with power-law fit in black (dashed line, slope\(\approx 1\)). Here \(\lambda \tau =5\) for the empirical mean \(\tau\), \(b=0.25\), the power \(\alpha =1.5\). The minimal \(\tau\) is 20. (b) Goodness of fit (\(R^2\)) for the forgetting curve using an exponential function (orange, dash-dot) and power-law function (blue), as a function of the power parameter of the characteristic time distribution.

Discussion

Consolidation time scale

We have proposed a stochastic self-amplified memory consolidation mechanism and showed that it leads to smooth forgetting curves that extend much longer than the synaptic decay time. Our model provides estimates for the global long-term memory properties such as the capacity of the network, the shape of the forgetting curve and the average lifetime of memories. Translating synaptic decay time to realistic times is hard. In rodents, spine turnover time is estimated to be of the order of several weeks in the hippocampus and up to a year in the cortex43,48,49,50,51. In humans, these times may be longer given the lower metabolic rate52,53; however, there are no direct experimental evidence. In addition, the model synaptic decay time \(\tau\) is in units of the mean inverse rate of encoding of episodic memories, which is hard to estimate, but it is likely to be of the order of weeks. Thus, assuming a human spine turnover time of the order of months yields \(\tau\) of the order of tens of months, which could lead to mean memory lifetime of several years. At present, these estimates are speculative.

Memory deficiencies

We have considered two types of perturbations to the memory circuit: synaptic death and increased synaptic noise. Both types of damage result in reduced retrievability of memories introduced prior to the damage onset, a phenomenon known in the literature as retrograde amnesia16,17,18,19. Due to the consolidation effect in our model, the amnesia is temporally graded: memories learned just before the noise onset are more severely affected than older ones, because they didn’t have enough time to consolidate, and were more fragile at the onset time. This effect is more prominent in the case of increased synaptic noise than synaptic dilution, due to the sharper drop in basins of attraction size after the noise addition. Perturbations cause a drop in retrieval performance of new memories entering after the perturbation onset, a manifestation of anterograde amnesia. This effect is temporally graded as well, being more severe for memories introduced just after the onset, and is especially prominent deficit in the additive noise case. Another interesting difference is the approach to a new equilibrium, which is fast in the case of additive noise but slow in the dilution. In addition, threshold adaptation can compensate part of the memory dysfunction caused by dilution, but not by additive noise.

Our model also allows for exploration of transient perturbations (SI), where the damage lasts for a finite time window54. In this case there is again a temporally-graded retrograde amnesia. Interestingly, new memories introduced after the end of the event not only regain retrievability, but can even improve their retrievability compared to control. This is due to the increased forgetting rate during the event, which results in lower interference noise and increased rehearsal rate after the event end.

The predictions of our model should be contrasted with the pure decay model where similar perturbations reduce the capacity (maximal age for retrievable memories), but don’t introduce any non-monotonicity in the forgetting curve, which is still a step function, but with a reduced width.

Relation to previous models

In the classical theory of systems memory consolidation16,17,18,20, the interaction between the hippocampus (HC) and the cortex plays a central role, with HC storing memories for short periods of time, and following rehearsals, memories are transmitted to the cortex for long-term storage. In the recent Multiple Trace Theory (MTT)20,29 autobiographic memories are stored for long term memory in both HC and cortex and consolidated through rehearsals that establish multiple memory traces in HC. This model shares some key elements with our theory, such as ongoing, life-long consolidation of memories and rehearsals which make memories more robust to perturbations. However, it is unclear how MMT can scale to large numbers of stored memories. In addition, in29 the rehearsal statistics (new trace formation) don’t depend on the robustness of the memories, nor does the model take into account interference between memories.

In the pseudo-rehearsal model28, new memories are learned in batches. After each batch is learned, there is a series of presudo-rehearsals, which are learning of fixed point states that are found by random initialization of the network’s state. This model implements stochastic rehearsals explicitly, and it reduces loss of old information by the pseudo-rehearsals. While in our model we use Hebbian, one-shot learning of new memories, the pseudo-rehearsal model relies on multi step gradient descent of each memory batch, which is generally a less biologically plausible learning algorithm. In addition, the authors present results only for a single, small size network and it is not clear how the results scale with the size of the network, nor how the increase of memory life time depends on the different properties of the model.

A few studies used neural network models where rehearsals are modeled as random visits of learned memories31,33,36, or implicit rehearsals (via memory traces embedded in the noise correlations30). In33 the authors shows the effects of rehearsals on a large number of learned memories. Yet they do not explicitly model the ”Hippocampus” part of the model where rehearsals occur, and they do not provide quantitative relations between the model’s performance and the different parameters, such as number of neurons in the ”neocortex” part. In30,31,36 The authors study the effect of rehearsals in networks with specific size, small number of memories, that don’t scale with the network size. It is not clear how their results scale with the network size, and how does the enhancement in memory lifetime depends on the different network properties. The authors of30,36 consider rehearsals of a finite batch of previously encoded memories rather than with life-long learning as in our study. Benna and Fusi13 studied memory storage with complex synapses, where a consolidation process is implemented in the dynamics of synapses, with a cascade of synaptic characteristic times. They show that their mechanism gives rise to a power-law decay of the signal-to-noise ratio (SNR, equivalent to \(A_l(t)/\Delta (t)\) in our model) with age. However, this model still exhibits a deterministic catastrophic age-dependent forgetting, such that all memories older than a critical age are non-retrievable, whereas all newer memories are almost perfectly retrievable. A recent phenomenological model47 derives a power-law form for memory retention curves with a power of 1 or smaller. A power close to 1 for intermediate ages is consistent with our result for a power law distribution of synaptic decay time. However, at present, it is unclear whether the experimental paradigms and time scales in which a power law is observed are relevant to life long episodic memory.

Fiebig and Lansner35 proposed a three component model, each with different synaptic decay rate, which performs continual learning with self-generated rehearsals. Similar to our work, they study the effects of perturbations and show similarities to human data. However, this work does not provide an analysis of the model, and does not explore the dependence on the different parameters such as network size and synaptic decay time and rehearsal rates. Comparison with our results is hampered also by the fact that synaptic decay in their model is an active process, dependent on memory arrivals among other factors. In general, none of the past models provide quantitative analysis of the memory capacity and the memory lifetime statistics, while enabling lifelong learning and avoiding global and critical-age catastrophic forgetting.

Catastrophic forgetting in memory models vs. machine learning models

There is a fundamental difference between the catastrophic forgetting nature in long term memory network models, which we address here, and what is called catastrophic forgetting in machine learning, and deep learning especially55,56,57. In the deep learning literature, it is assumed that if all the data was available all the time the model was able to learn from it and successfully solve the relevant task. In other words, the model is in a regime below its capacity limits, and the problem is the online, incremental presentation of the data, which causes stability-plasticity issues. In contrast, when modeling long term memory we assume that there is too much information to be stored- even if the data to be stored was available all the time and learning wasn’t online, still we would encounter catastrophic forgetting due to crossing of the model capacity limit.

Limitations and future work

In this paper we don’t explicitly model the rehearsals process—how the system moves between activation states and visits different attractors. Possible mechanisms could be destabilization of attractors by adaptation36,58 or transitions induced by random initialization processes31. These mechanisms will generate a rate of rehearsals per memory that depends on its basin of attraction size, as in our model, but whether the rate is simply proportional to the basin’s size as we assume is yet to be tested.

Our model can be extended in a variety of ways, including more biologically plausible neuronal and synaptic integration. For example, detailed neuron models with rich morphologies and dendritic structure might influence the memory capacity and lifetime, in addition to the number of neurons. In addition, our model does not obey Dale’s law, and restricting neurons to be either excitatory or inhibitory, while also allowing different long term and short term plasticity dynamics for the different populations, will likely introduce richer memory properties.

Studying more synaptic decay characteristic time distributions, such as bimodal distributions where one synapse population decays much slower than another population, could also give rise to interesting memory properties. A relevant example is an inhibitory population with slow synaptic decay, interacting with an excitatory population with faster synaptic decay. We limit ourselves in this work to the mechanism of synaptic decay, while other mechanisms, such as bounded, discrete synapses14,15 can also prevent global catastrophic forgetting. Due to the qualitative similarity of the behavior of such models to synaptic decay models (palimpsestic behavior with critical age catastrophic forgetting), we expect that this model will respond similarly to the introduction of stochastic rehearsals. Additionally, our framework allows for analyzing the effect of other types of perturbations, such as post-traumatic stress disorder amnesia59,60.

Conclusions

The stochastic nonlinear rehearsal mechanism proposed in our work is, to the best of our knowledge, the first large-scale memory model that gives rise to realistic gracefully decaying forgetting probability curves, with exponential or power law tails depending on synaptic decay rate distribution. Our model’s capacity scales as a power law of the number of neurons, with a power that approaches unity for a large mean number of rehearsal events per synaptic decay time. Our model’s capacity interpolates between two extreme cases—for low rehearsal rate it approaches the pure forgetting case (logarithmic scaling of the capacity with the number of neurons), where learning is incremental and synapses decay, and for high rehearsal rate it approaches linear scaling of capacity with N, as in the static case without synaptic decay, where memories are stored (not necessarily incremental, they can be available forever) until capacity is reached and then all memories become irretrievable. Our model predicts that the onset of perturbation to the circuit, in the form of synaptic noise, leads to non-monotonic memory deficits affecting more strongly memories encoded around perturbation onset time, which have not yet a chance to consolidate. The richness of the model behavior in normal and diseased conditions provides a theoretical framework for predictions and testing against empirical data on human memory.

Methods

Network model

As described in section II, memories are modeled as sparse, uncorrelated N dimensional activation patterns (N is the number of neurons), and the synaptic dynamics are governed by three processes: deterministic synaptic decay with rate \(1/\tau\), Hebbian learning of new memories, and rehearsal of old memories, which is the central novelty of our model. In continuous time, Eq. (1) becomes:

$$\begin{aligned} \frac{dJ_{ij}}{dt}=-\frac{1}{\tau } J_{ij}+\sum _{l}\xi _{i}^{l}\xi _{j}^{l}\delta (t-l)+b\sum _{k}\xi _{i}^{k}\xi _{j}^{k}R_{k}(t)=-\frac{1}{\tau }J_{ij}+\sum _{l}\xi _{i}^{l}\xi _{j}^{l}\left( \delta (t-l)+bR_{l}(t)\right) \end{aligned}$$
(20)

The rehearsals are modeled as a point process

$$\begin{aligned} R_{l}(t)=\sum _{\{t_{l}^{n}\}}\delta (t-t_{l}^{n}) \end{aligned}$$
(21)

Inserting the ansatz (3), we get that \(A_{l}\) obeys the differential equation:

$$\begin{aligned} \frac{dA_{l}}{dt}=-\frac{1}{\tau } A_{l}+bR_{l}(t) \end{aligned}$$
(22)

With \(A_{l}(t)=0\) for \(t<l\) and \(A_{l}(l)=1\).

The single neuron dynamics are binary, and given by:

$$\begin{aligned} \sigma _i(t+dt)=\Theta (h_i(t)-\theta ) \end{aligned}$$
(23)

where \(\sigma _(t)\) is the state of neuron i at time t, \(\Theta (x)\) is the Heaviside step function, \(h_i(t)\) is the local field (total input received by neuron i at time t):

$$\begin{aligned} h_i(t)=\sum _j J_{ij} \sigma _j(t) \end{aligned}$$
(24)

and \(\theta\) is a threshold, set at every time step such that the total activation of the network is maintained and equal to fN. This can be thought of as the effect of an inhibitory population, regulating the total population activity (practically, in full simulations, at each time step we choose the fN neurons with the largest local fields and set their state to one, and all the others are set to zero).

Mean field equations, basins of attraction

We would like to find the relation between memory stability, measured by the memory pattern’s basin of attraction size, and the efficacy of the memory and all other memories in the system. First, we define two useful quantities: \(f_{+}^{l}\) is the probability of a neuron to be active in the current state given that it is active in the memory pattern \(\xi ^{l}\). \(f_{-}^{l}\) is the probability of a neuron to be active in the current state given that it is not active in the memory pattern \(\xi ^{l}\). In other words, \(f_{+}^{l}\) is the fraction out of the neurons active in memory state l that are active in the current state. \(f_{-}^{l}\) is the fraction out of the neurons not active in memory state l that are active in the current state. We will omit the l dependence of \(f_{\pm }\) from now on. In terms of these quantities,

$$\begin{aligned} f=\frac{1}{N}\sum _{i=1}^{N}\sigma _{i}(t)=ff_{+}+(1-f)f_{-} \end{aligned}$$
(25)

For clarity, in this section instead of the memory patterns definition we use above (Eq. (2)) we define the patterns in an equivalent, more explicit way:

$$\begin{aligned} \xi _i^l= {\left\{ \begin{array}{ll} 1 \text { with prob. } f \\ 0 \text { with prob. } 1-f \end{array}\right. } \end{aligned}$$
(26)

, and we normalize the connectivity matrix accordingly.

The overlap between memory pattern l and the system’s state \(\sigma\)

$$\begin{aligned} \begin{aligned} M_{l}&\equiv \frac{1}{Nf(1-f)}\sum _{j=1}^{N}(\xi _{j}^{l}-f)\sigma _{j}=\frac{1}{Nf(1-f)}\left( \sum _{j=1}^{N}\xi _{j}^{l}\sigma _{j}(t)-f\sum _{j=1}^{N}\sigma _{j}(t)\right) \\&=\frac{1}{Nf(1-f)}(Nff_{+}-Nf^{2})= \frac{1}{(1-f)}(f_{+}-(ff_{+}+(1-f)f_{-})=f_{+}-f_{-} \end{aligned} \end{aligned}$$
(27)

Now, assume that the current state is close to the memory state \(\xi ^{l}\). The input to neuron i which is active in memory pattern l (\(\xi _{i}^{l}=1\)):

$$\begin{aligned} \begin{aligned} h_{i}^{+}&=\frac{1}{Nf(1-f)}\sum _{j=1;\;j\ne i}^{N}\sum _{n=1}^{T}A_n(\xi _{i}^{n}-f)(\xi _{j}^{n}-f)\sigma _{j}(t)= \\&= A_l(1-f)M_{l}+\frac{1}{Nf(1-f)}\sum _{j=1;\;j\ne i}^{N}\sum _{n=1;\;n\ne l}^{T}A_n(\xi _{i}^{n}-f)(\xi _{j}^{n}-f)\sigma _{j}(t) \end{aligned} \end{aligned}$$
(28)

Averaging \(h_{i}^{+}\) over memories realizations gives \(A_l(1-f)M_{l}\), and the variance:

$$\begin{aligned} \Delta ^{2}=\frac{1}{N^{2}f^{2}(1-f)^{2}}\sum _{j,j'=1;\;j,j'\ne i}^{N}\sum _{n,n'=1;\;n,n'\ne l}^{T}A_n A_{n'} \langle (\xi _{i}^{n}-f)(\xi _{i}^{n'}-f)(\xi _{j}^{n}-f)(\xi _{j'}^{n'}-f)\sigma _{j}\sigma _{j'}\rangle \end{aligned}$$
(29)

Now, because the system state is close to the memory state \(\xi ^{l}\), we can assume it is uncorrelated with all other memory states. Hence, there are contributions only from terms with \(j=j',\;n=n'\):

$$\begin{aligned} \Delta ^{2}=f\cdot \frac{1}{N}\sum _{n=1;\;n\ne l}^{T}A_n^{2} \end{aligned}$$
(30)

Applying the central limit theorem, we approximate \(h_{i}^{+}\) by a Gaussian variable with mean \(A_l(1-f)M_{l}\) and variance \(\Delta ^2\). Now, \(f_{+}\) is the probability for \(h_{i}^{+}\) to be larger than \(\theta\), which is given by the complimentary error function, \(H(x)=\frac{1}{\sqrt{2\pi }}\int _{x}^{\infty }\text {exp}(-0.5t^{2})dt\):

$$\begin{aligned} f_{+}=H\left( \frac{\theta -A_l(1-f)M_{l}}{\Delta }\right) \end{aligned}$$
(31)

In a similar way (same noise term, mean equals \(-A_lM_{l}f\)) we find for \(f_{-}\) (for \(f\ll 1\)) :

$$\begin{aligned} f_{-}=H\left( \frac{\theta +A_lM_{l}f}{\Delta }\right) \end{aligned}$$
(32)

Note that we didn’t set the threshold \(\theta\), but instead demanded a constant population activation f. Equations (25), (27), (31) and (32) allow us to write an equation for the overlap dynamics:

$$\begin{aligned} M_l(t+1)=G(M_l(t), A_l(t)/\Delta (t)) \end{aligned}$$
(33)

where

$$\begin{aligned} \begin{array}{cc} G(M_l, A_l/\Delta )=H\left( \theta ^*(M_l, A_l/\Delta )-\frac{A_l(1-f)M_{l}}{\Delta }\right) - H\left( \theta ^*(M_l, A_l/\Delta )+\frac{A_l M_{l}f}{\Delta }\right) \\ = H\left( H^{-1}(f(1-M_l))-\frac{A_l M_{l}}{\Delta }\right) -f(1-M_l) \end{array} \end{aligned}$$
(34)

and

$$\begin{aligned} \theta ^*(M_l, A_l/\Delta )=H^{-1}(f(1-M_l))-f\cdot M_l\cdot A_l/\Delta \end{aligned}$$
(35)

Therefore, the equation for the overlap fixed points is:

$$\begin{aligned} M_l=H\left( H^{-1}(f(1-M_l))-\frac{A_l M_{l}}{\Delta }\right) -f(1-M_l) \end{aligned}$$
(36)

Now, we numerically find the fixed points for M at a given \(A/\Delta\) by running the dynamics described by Eq. (33). Typically (for large enough \(A/\Delta\)), there will be a stable fixed point at \(M=0\), a stable fixed point \(0<M_s \le 1\) and an unstable fixed point \(0<M_{us}<M_s\). We approximate the basin of attraction size as the distance \(M_s-M_{us}\). This way, we obtain the basin size as a function of \(A/\Delta\), \(F(A/\Delta )\). We check the validity of our approximations by simulating a full neural network model and checking numerically the basin of attraction sizes, and find good agreement (SI), which is the basis for the good agreement in retention curves between the mean field simulations and the full network simulation (Fig. 3a,b) . We define the critical efficacy \(A_c\) as the efficacy for which \(M_s=M_{us}\) (meaning, the non-zero overlap solution loses stability). This happens approximately when \(M_s=M_{us} \approx 0.85\). As one can see from Eq. (36), the fixed points depends only on the ratio \(A/\Delta\) and on f, and therefore \(A_c/\Delta\) is only a function of f, and we can write \(A_c=a(f) \Delta\). For \(f=0.01\) (the typical value we use throughout the manuscript) we find numerically that \(a(f)\approx 4.7\) . Analytical approximation for a(f) is given in the SI.

Numerical simulations

In our simulations, we first numerically solve the coupled stochastic differential equations for the efficacies (Eq. (7)). Theoretically our model considers infinite number of memories. However, practically we solve the equations for a finite but large number of memory efficacies, typically \(200\tau - 1000\tau\), chosen such that \(A_c\) saturates to its steady state value. We measure time in units of the lag between the introduction of two consecutive memories (assumed constant). At every integration time step dt (small compared to all characteristic timescales of the system, typically \(dt=0.05/\lambda\)), a rehearsal event might occur for each memory with \(A_l\ge A_c\). We generate a uniform random number between 0 and 1 and compare it to \(\lambda \cdot F(A_l(t)/\Delta (t) \cdot dt\). A rehearsal event of memory l will happen at time t if the uniform random number is smaller than \(\lambda \cdot F(A_l(t)/\Delta (t) \cdot dt\). This approximates the statistics of a non-homogeneous Poisson process. By averaging over many such realizations (typically 500), we calculate the efficacy histogram, capacity (counting how many efficacies are above \(A_c\)) and retrieval probabilities (by checking the probability for the efficacy of a memory introduced l time units into the past to be retrievable now). These calculations are referred to as ”mean field simulations”, and they don’t include generation of random memories and building the connectivity matrix.

Full network simulation

When simulating the full network model, after generating the efficacies, we randomly generate memory patterns (binary vectors of dimension N) to be stored, and build the connectivity matrix according to Eq. (3). Then, to measure retrievability, we initialize the network’s state at a memory pattern, and let the binary neurons dynamics run until they settle to a steady state. Then, we measure the overlap between the pattern and the steady-state activity. We say a memory is retrievable if the overlap is \(\ge 0.85\). For measuring the basin of attraction sizes of the memory patterns, we generate an initial state by randomly flipping the state of units in the memory pattern (conserving the total activation fN), and run the dynamics until convergence. Then we measure the overlap between the final state and the memory pattern. We keep increasing the number of flipped units until the final state has an overlap smaller than \(\ge 0.85\) with the memory state. We define the normalized basin size as the maximal number of flips allowing for a large overlap divided by 2fN, the maximal number of flips. Results are shown in the SI.

Noisy synaptic dynamics

The synaptic dynamics in the presence of Gaussian noise is presented in Eq. (13). It is straightforward to show that a solution to the equation can be written as:

$$\begin{aligned} J_{ij}(t)=\sum _{l}\xi _{i}^{l}\xi _{j}^{l}A_{l}(t)+\intop _{0}^{t}e^{-(t-t')/\tau }\chi _{ij}(t')dt' \end{aligned}$$
(37)

with \(A_{l}(t)\) obeying Eq. (7) as before. The nonlinear effect of the noise arises through the self consistent requirement, that the rehearsal rate of memory l is proportional to the basin of attraction size of this memory, which depends on \(A_l\) and on \(\Delta\). We calculate \(\Delta\) with the injected noise (Eq. (14)) the same way we calculated \(\Delta\) without noise above. Here there is a non trivial mixed term involving the average \(\langle A_l(t) \chi _{ij}(t) \rangle\), which we found numerically to be negligible for the parameter range we are interested in.

Synaptic dilution

The random silencing is done by multiplying the connectivity matrix \(J_{ij}\) by a binary matrix:

$$\begin{aligned} J_{ij} \rightarrow C_{ij} J_{ij}, \;\; C_{ij}={\left\{ \begin{array}{ll} 1 \text { with prob. } 1-p \\ 0 \text { with prob. } p \end{array}\right. } \end{aligned}$$
(38)

We would like to calculate the effect of the dilution on the memory efficacies dynamics, and for that we need to find the effect on \(A_l(t)\) and on \(\Delta (t)\). Let us calculate the local field near memory l as before:

$$\begin{aligned} {h^l_{i}}^{+}=\sum _{j=1;\;j\ne i}^{N} C_{ij} J_{ij} \sigma _{j}(t) \end{aligned}$$
(39)

Taking the mean over memories and over \(C_{ij}\) realizations (denoted by []) we get:

$$\begin{aligned} \left[ {h^l_{i}}^{+} \right] =(1-p) \left\langle \sum _{j=1;\;j\ne i}^{N} J_{ij} \sigma _{j}(t) \right\rangle = (1-p) (1-f) M_l A_l \end{aligned}$$
(40)

Here 〈 〉 denotes average over memories realizations. As one can see, the efficacies are scaled by a factor of \(1-p\). We assumed here we can neglect correlations between \(A_l\) and \(C_{ij}\). The second moment:

$$\begin{aligned} \left[ \left( {h^l_{i}}^{+} \right) ^2 \right] = \sum _{j,j'\ne i} \left[ J_{ij} J_{ij'} C_{ij} C_{ij'} \sigma _j \sigma _j' \right] = \sum _{j,j'\ne i} \left( (1-p) \delta _{jj'} +(1-p)^2(1-\delta _{jj'}\right) \left\langle J_{ij} J_{ij'} \sigma _j \sigma _j' \right\rangle \end{aligned}$$
(41)

There are two contributions arising from the \(C_{ij}\) randomness. Now, when calculating the local field variance, the term proportional to \((1-p)^2\) is exactly canceled by the squared mean, and we are left with:

$$\begin{aligned} \left[ \left( {h^l_{i}}^{+} - \left[ {h^l_{i}}^{+} \right] \right) ^2 \right] =(1-p)\sum _{j,j'\ne i} \left\langle J_{ij}^2 \sigma _j^2 \right\rangle = (1-p) \tilde{\Delta }^2 \end{aligned}$$
(42)

where \(\tilde{\Delta }^2\) is the local field variance without dilution.

Now, we obtain the efficacies modified dynamics by using these expressions for the signal and noise to calculate the basins of attraction sizes as before.

Non-uniform characteristic decay time

Assuming synapse \(J_{ij}\) has a decay rate \(\epsilon _{ij}\), and all memories have unit initial efficacy. Memory l appears for the first time at time l. The learning dynamics is:

$$\begin{aligned} J_{ij}(t+\Delta t)=(1-\epsilon _{ij}\Delta t)J_{ij}+\sum _{l}\xi _{i}^{l}\xi _{j}^{l}\delta _{t,l}+b\sum _{k,n_{k}}\xi _{i}^{l}\xi _{j}^{l}\delta _{t,t_{k}^{n_{k}}} \end{aligned}$$
(43)

In continuous time,

$$\begin{aligned} \frac{dJ_{ij}}{dt}= & {} -\epsilon _{ij}J_{ij}+\sum _{l}\xi _{i}^{l}\xi _{j}^{l}\delta (t-l)+b\sum _{k}\xi _{i}^{l}\xi _{j}^{l}R_{k}(t)=-\epsilon _{ij}J_{ij}+\sum _{l}\xi _{i}^{l}\xi _{j}^{l}\left( \delta (t-l)+R_{l}(t)\right) \end{aligned}$$
(44)
$$\begin{aligned} R_{k}(t)= & {} b\sum _{\{t_{k}^{n}\}}\delta (t-t_{k}^{n}) \end{aligned}$$
(45)

Assuming all synapses starts at zero value, the solution can be written as:

$$\begin{aligned} J_{ij}(t)=\sum _{l}\xi _{i}^{l}\xi _{j}^{l}\intop _{0}^{t}dt'e^{-\epsilon _{ij}(t-t')}\left( \delta (t-l)+R_{l}(t)\right) =\sum _{l}\xi _{i}^{l}\xi _{j}^{l}A_{l}^{ij}(t) \end{aligned}$$
(46)

Let us define efficacies

$$\begin{aligned} A_{l}^{ij}=\Theta (t-l)\intop _{0}^{t}dt'e^{-\epsilon _{ij}(t-t')}\left( A_{0}\delta (t-l)+R_{l}(t)\right) \end{aligned}$$
(47)

\(A_{l}^{ij}\) obeys the differential equation:

$$\begin{aligned} \frac{dA_{l}^{ij}}{dt}=-\epsilon _{ij}A_{l}^{ij}+R_{l}(t) \end{aligned}$$
(48)

With \(A_{l}^{ij}(t)=0\) for \(t<l\) and \(A_{l}^{ij}(l)=1\).

Given that the decay rates have a probability density \(\rho (\epsilon )\), let us define:

$$\begin{aligned} A_{k}(t)=\left\langle A_{k}^{ij}(t)\right\rangle _{\epsilon _{ij}}=\intop d\epsilon \rho (\epsilon )\left( \Theta (t-l)\intop _{0}^{t}dt'e^{-\epsilon (t-t')}\left( A_{0}\delta (t-l)+R_{l}(t)\right) \right) \end{aligned}$$
(49)

and

$$\begin{aligned} A_{k}^{ij}(t)= & {} A_{k}(t)+\delta A_{k}^{ij}(t) \end{aligned}$$
(50)
$$\begin{aligned} J_{ij}(t)= & {} \sum _{l}\xi _{i}^{l}\xi _{j}^{l}A_{l}(t)+\sum _{l}\xi _{i}^{l}\xi _{j}^{l}\delta A_{l}^{ij}(t) \end{aligned}$$
(51)

Including normalization and sparseness considerations,

$$\begin{aligned} J_{ij}(t)=\frac{1}{Nf(1-f)}\left( \sum _{l}(\xi _{i}^{l}-f)(\xi _{j}^{l}-f)A_{l}(t)+\sum _{l}(\xi _{i}^{l}-f)(\xi _{j}^{l}-f)\delta A_{l}^{ij}(t)\right) \end{aligned}$$
(52)

Now let us calculate the mean local field on neuron i in a state near memory state k, and assume \(\xi _{i}^{k}=1\):

$$\begin{aligned} \begin{array}{c} h_{i}^{k}=\sum _{j\ne i}J_{ij}\sigma _{j}=(1-f)M_{k}A_{k}+\frac{1}{Nf}\sum _{j\ne i}(\xi _{j}^{k}-f)\delta A_{k}^{ij}(t)\sigma _{j}+ \\ +\frac{1}{Nf(1-f)}\left( \sum _{l\ne k,\;j\ne i}(\xi _{i}^{l}-f)(\xi _{j}^{l}-f)\sigma _{j}A_{l}(t)+\sum _{l\ne k,\;j\ne i}(\xi _{i}^{l}-f)(\xi _{j}^{l}-f)\sigma _{j}\delta A_{l}^{ij}(t)\right) \end{array} \end{aligned}$$
(53)

Taking an average over the memories realizations and the decay rates, we get:

$$\begin{aligned} \left\langle h_{i}^{k}\right\rangle _{\xi ,\epsilon }=(1-f)M_{k}A_{k} \end{aligned}$$
(54)

And the variance:

$$\begin{aligned} \left\langle \left( \delta h_{i}^{k}\right) ^{2}\right\rangle _{\xi ,\epsilon }=\frac{f}{N}\sum _{l\ne k}\left\langle (A_{l}^{ij})^{2}\right\rangle _{\epsilon }+\left( \left\langle (A_{k}^{ij})^{2}\right\rangle _{\epsilon }-A_{k}^{2}\right) \frac{1}{N^{2}f^{2}}\sum _{j\ne i}\left\langle (\xi _{j}^{k}-f)^{2}\sigma _{j}^{2}\right\rangle _{\xi } \end{aligned}$$
(55)

The second term does not include summation over all memories, and therefore it is negligible for large N values (the first term is O(1) while the second is \(O(N^{-2}\)). This leads to Eq. (19).

Power law \(\tau\) distribution

For each synapse we generated synaptic decay characteristic times from a power law (Pareto) distribution with density:

$$\begin{aligned} P(\tau _0)= {\left\{ \begin{array}{ll} \alpha \cdot \tau _0 ^{-(\alpha +1)} \;\; &{} \tau _0\ge 1 \\ 0 \;\; &{} \tau _0< 1 \end{array}\right. } \end{aligned}$$
(56)

In this distribution, for \(\alpha <1\) the mean diverges. We scaled the resulting \(\tau _0^{ij}\) values by a uniform factor: \(\tau ^{ij}=2\omega \cdot N \cdot \tau _0^{ij}\). We fixed \(\omega\) value for the average number of rehearsals per mean decay time \(R_0\), and used it to set the \(\lambda\) parameter by dividing \(R_0\) by the empirical average of the generated decay times. Next we solved the stochastic differential Eq. (48) numerically. The rehearsals are generated with time dependent rates proportional to the basin of attraction size, now as a function of the average and variance of the memory efficacies over all synaptic timescales.