Abstract
Animals need to adjust their inferences according to the context they are in. This is required for the multicontext blind source separation (BSS) task, where an agent needs to infer hidden sources from their contextdependent mixtures. The agent is expected to invert this mixing process for all contexts. Here, we show that a neural network that implements the errorgated Hebbian rule (EGHR) with sufficiently redundant sensory inputs can successfully learn this task. After training, the network can perform the multicontext BSS without further updating synapses, by retaining memories of all experienced contexts. This demonstrates an attractive use of the EGHR for dimensionality reduction by extracting lowdimensional sources across contexts. Finally, if there is a common feature shared across contexts, the EGHR can extract it and generalize the task to even inexperienced contexts. The results highlight the utility of the EGHR as a model for perceptual adaptation in animals.
Introduction
Inference of the causes of a sensory input is one of the most essential abilities of animals^{1,2,3} — a famous example is the cocktail party effect, i.e., the ability of a partygoer to distinguish a particular speaker’s voice against a background of crowd noise^{4,5}. This ability has been modelled by blind source separation (BSS) algorithms^{6,7}, by considering that several hidden sources (speakers) independently generate signal trains (voices), while an agent receives mixtures of signals as sensory inputs. A neural network, possibly inside the brain, can invert this mixing process and separate these sensory inputs into hidden sources using a BSS algorithm. Independent component analysis (ICA), achieves BSS by minimizing the dependency between output units^{8,9}. Numerous ICA algorithms have been proposed for both ratecoding^{10,11,12,13} and spiking neural networks^{14}.
Previously, we developed a biologically plausible ICA algorithm, referred to as the errorgated Hebbian rule (EGHR)^{15}. This learning rule can robustly estimate the hidden sources that generate sensory data without supervised signals. Importantly, it can reliably perform ICA in undercomplete conditions^{16}, where the number of inputs is greater than that of outputs. A simple extension of the EGHR can separate sources while removing noise within a singlelayer neural network^{17}, by simultaneously performing principal component analysis (PCA)^{18,19} and ICA. The EGHR is expressed as a product of pre and postsynaptic neuronal activities and a third modulatory factor, each of which can be computed locally (i.e., local learning rule^{16}). In this sense, the EGHR is more biologically plausible than nonlocal engineering ICA algorithms^{10,11,12}. Because of these desirable properties, the EGHR is considered as a candidate mechanism for neurobiological BSS^{20,21,22}, as well as a nextgeneration neuromorphic implementation^{23,24} for energy efficient BSS.
The optimal inference and behavior often depend on context. Indeed, our perception and decisions reflect this context dependency, i.e., cognitive flexibility^{25}. Studies in primates have suggested that a contextualcuedependent dynamic process in the prefrontal cortex controls this behavior^{26,27,28}, and several computational studies have modeled it^{29,30,31,32}. Likewise, context dependence of auditory perceptual inference has been modeled^{33}. In addition to experimental evidence, recent progress in machine learning has also addressed this multicontext problem, in an attempt to create artificial general intelligence^{34,35,36}. By implementing (taskspecific) synaptic consolidation, a neural network can learn a new environment, while retaining past memories, by protecting synaptic strengths that are important to memorizing past environments. Those findings indicate the importance of multicontext processes for cognitive flexibility.
Unlike the abovementioned tasks, BSS in several different contexts has some difficulty. Conventional ICA algorithms assume the same number of input and output neurons^{10,11,12,37,38} and cannot straightforwardly perform a multicontext BSS. After learning, the synaptic strength matrix of these algorithms converges to the inverse of the mixing matrix of the current context (or its permutation or signflip), which is generally different from that in the previous context. Hence, when the network subsequently encounters a previously learnt context, it needs to relearn the synaptic strengths from the very beginning. More involved engineering ICA algorithms, such as the nonholonomic ICA algorithm^{39} and the ICA mixture algorithm^{40,41}, are expected to perform the multicontext BSS. However, a biological implementation of these nonlocal learning rules is unclear. Further, as we show below, they cannot learn to compress redundant inputs by extracting the underlying lowdimensional hidden sources.
Here we show that the EGHR can perform multicontext BSS when a neural network receives redundant sensory inputs. It can retain memories of previously experienced contexts and process the BSS right after contextual switching to a previously learnt context. This suggests that the EGHR can also be used as a powerful data compression method^{42}, since it extracts lowdimensional hidden sources across contexts, despite the proportional increase in data dimensions to the number of contexts. Moreover, when a common feature is shared across contexts, the EGHR can extract it to perform BSS, while filtering out features that vary among contexts. Once the learning is achieved, the network can perform BSS even in an inexperienced context, indicating some generalization capability or transfer learning. We demonstrate that the EGHR with sufficiently redundant sensory inputs learns to distinguish birdsongs from their superpositions and retains this ability even after learning different sets of birdsongs. The rule finds a general representation that is capable of separating an unheard set of birdsongs. Finally, possible neurobiological implementations of the EGHR are discussed.
Results
Errorgated Hebbian rule (EGHR)
In a BSS task, several hidden sources (s) independently generate signal traces, while our agent receives their mixtures as sensory inputs (x). In this study, we considered a multicontext BSS task, in which a set of contexts with different mixing weights was used. Sensory inputs were randomly generated from one of these contexts for a period of time, with k (=1, …, C) being an index of context. Our experimental setup consisted of an N_{s}dimensional vector of hidden sources \(s\equiv {({s}_{1},\ldots ,{s}_{{N}_{s}})}^{T}\) whose elements s_{i} independently follow a nonGaussian distribution p(s_{i}), an N_{x}dimensional vector of sensory inputs \(x\equiv {({x}_{1},\ldots ,{x}_{{N}_{x}})}^{T}\), and an N_{u}dimensional vector of neural outputs \(u\equiv {({u}_{1},\ldots ,{u}_{{N}_{u}})}^{T}\) (Fig. 1). The sensory inputs in the kth condition were generated by transforming the hidden sources, i.e., the socalled generative process:
Here A^{(k)} is the N_{x} × N_{s} mixing matrix for the kth context that defines the magnitude of inputs, when each source generates a signal. To ensure that each A^{(k)} represents a different context and that each context has an ICA solution, column vectors of a block matrix (A^{(1)}, A^{(2)}, …, A^{(C)}) are supposed to be linearly independent of each other. We designed the task such that these contexts appear sequentially or randomly. The neural outputs were expressed as sums of inputs weighted by an N_{u} × N_{x} synaptic strength matrix W, and calculated by:
It is well known that when a presynaptic neuron (x_{j}) and a postsynaptic neuron (u_{i}) fire together, Hebbian plasticity occurs, and the synaptic connection from x_{j} to u_{i}, denoted by W_{ij}, is strengthened^{43,44}. Because this constitutes associative learning, correlations between \({x}_{1},\ldots ,{x}_{{N}_{x}}\) and u_{i} are usually enhanced; thereby, correlations among neural outputs also increase. This process is distinct from separation of signals (i.e., BSS) for which each neural output is expected to encode a specific source. To separate signals, we introduced a global scalar factor (i.e., a third factor) given by the sum of nonlinearlytransformed output units^{15}:
Here p_{0}(u) is the prior distribution that the agent expects the hidden sources to follow; e.g., when p_{0}(u) is a Laplace distribution of mean zero and unit variance, then \(E(u)=\sqrt{2}({u}_{1}+\ldots +{u}_{{N}_{u}})+{\rm{const}}\). We supposed that this global factor modulates Hebbian plasticity. Recent experimental studies have reported that synaptic plasticity can be modulated by various neuromodulators^{45,46,47,48,49}, GABAergic inputs^{50,51}, or glial factors^{52}. Possible neurobiological implementations of the global factor are further discussed in the Discussion section. Overall, the synaptic strength matrix W is updated by the EGHR in the following way:
where \(\dot{W}\) with respect to time, 〈·〉 is the expectation over the input distribution, and g(u) ≡ dE(u)/du is a nonlinear function usually associated with a nonlinear activation function. A constant E_{0} scales the neural outputs; the output scale becomes equivalent to the source scale when \({E}_{0}=\langle \,\,{\rm{l}}{\rm{o}}{\rm{g}}\,{p}_{0}(s)\rangle +1\). In short, the EGHR constitutes a Hebbian learning rule when the global factor is smaller than the threshold (E(u) < E_{0}); otherwise (E(u) > E_{0}), it becomes an antiHebbian rule. This mechanism makes output neurons independent from each other. The detailed derivation and theoretical proofs of the EGHR have been described in our previous reports^{15,17}. Briefly, the EGHR is derived as the gradient descent of the cost function L ≡ 〈(E(u) − E_{0})^{2}〉/2. This is the cost for having dependency among outputs, designed for measuring the nonlinear correlation among elements of u. Hence, the minimization of L makes the elements of u independent of each other. The formal relationship between the EGHR and ICA algorithm based on the infomax principle is described in^{17}.
Memory capacity of the EGHR
First, we analytically show the memory capacity of a neural network established by the EGHR. As the number of contexts increases, larger dimensions of inputs are needed to retain information pertaining to past contexts in the neural network. For simplicity, we supposed that N_{u} = N_{s}. Because the network represents a linear inverse model of the generative processes, the goal of the multicontext BSS is generally given by:
where Ω^{(k)} is an N_{u} × N_{s} matrix equivalent to the identity matrix, up to permutations and signflips. This is because the success of BSS is defined by onebyone mapping from sources to outputs. Thus, the multicontext BSS is successful if and only if a set of mixing matrices (A^{(1)}, …, A^{(C)}) expresses a fullcolumnrank matrix (see Methods for the derivation). Hence, we found that the following conditions are necessary to achieve the multicontext BSS for a generic (A^{(1)}, …, A^{(C)}): (1) the input dimension needs to be equal to or larger than the number of contexts times the number of sources, N_{x} ≥ CN_{s}; and (2) the output dimension needs to be equal to or larger than the source dimension, N_{u} ≥ N_{s}. Note that the neural network learns the information representation that compresses the sensory inputs, because we considered the input dimensions that are much greater than the output dimensions.
The memory capacity of the EGHR was empirically confirmed by numerical simulations (Fig. 2). Here, we supposed that two contexts generated inputs alternately. In each context, sixdimensional inputs were generated from twodimensional sources with different mixing weights, as denoted by A^{(1)} and A^{(2)} (see top and middle rows in Fig. 2A). A neural network consisting of six input and two output neurons received the inputs and changed its synaptic strengths through the EGHR (i.e., training). After training, each neural output came to selectively respond to (i.e., encode) one of the two sources (bottom row in Fig. 2A). Thus, the network achieved separation of the sensory inputs into their sources without being taught the mixing weights (i.e., BSS).
Crucially, the neural network was able to retain the information learnt for all past contexts if provided with sufficiently redundant sensory inputs. This property is illustrated by the trajectories of the BSS error and EGHR cost function in Fig. 2B. We defined the BSS error for context k as the ratio of first to second maximum absolute values averaged for every row and column of matrix K^{(k)} ≡ WA^{(k)} (see Methods for the mathematical definition of the BSS error). Here K^{(k)} expresses the mapping from sources to outputs, which is equivalent to the covariance matrix between hidden sources and neural outputs K^{(k)} = Cov(u, s). This definition of the BSS error was made to ensure that the value was zero if and only if one source mapped onto one output, and vice versa; otherwise the value was positive and less than one. Moreover, the cost function of the EGHR was defined as the expectation of the square of the global factor: L = 〈(E_{0} − E(u))^{2}〉/2. Context 1 (red in Fig. 2A) was provided in the first session. Since synaptic strengths started from a random initial state, the BSS error at the beginning of the first session was large; then, the network learned an optimal set of synaptic strengths, and the error became zero, which was achieved by minimizing the cost function through gradient descent updates. When context 2 (blue in Fig. 2A) was provided for the first time in the second session, the EGHR cost function transiently increased, as it needed to learn the new mixing matrix. An important point was revealed at the first step of the third session, in which context 1 was provided again. The BSS error was significantly smaller than that in the first session and close to zero from the beginning of this session, indicating that the network retained synaptic strengths that were optimized for context 1 even after learning context 2. After several iterations, the BSS error for both contexts converged to zero. The success of learning was also confirmed by the trajectory of the EGHR cost function that also converged to the minimum value (Fig. 2B bottom).
These results show that an undercomplete EGHR increased the speed of readaptation to previously experienced contexts, suggesting that memory of past experiences was preserved within the network. Moreover, the network learned the optimal set of synaptic strengths that entertained both contexts after several iterations. A key feature for this ability is the “null space” in the synaptic strength matrix. While only four (2 × 2) dimensions were required to express a mapping from twodimensional sources to twodimensional outputs in one context, the synaptic strength matrix still comprised eight (2 × 6 − 2 × 2)dimensional degrees of freedom. This freedom spanned a null space in which synaptic strengths were equally optimized with zero BSS error. Similarly, when two different contexts were considered, fourdimensional degrees of freedom remained, as an overlap between the two eightdimensional null spaces. To visualize such a null space, we projected synaptic strengths onto a subspace spanned by the first (PC1) and second (PC2) principal components of the trajectory of synaptic strengths (Fig. 2C). On this PC1PC2 plane, a null space was illustrated as a nullcline. Since the dynamics of synaptic strengths were determined to go down the slope of a cost function for either context 1 or 2, synaptic strengths were started from a random initial state and reached the nullcline of either context 1 or 2, in turn. Crucially, this trajectory converged to the cross point of the two nullclines, where the synaptic strengths entertained both contexts. Because of this, the BSS error reached zero after iterative training; i.e., the network solved ICA for both contexts.
Furthermore, we examined the multicontext BSS by the EGHR using a large number of contexts (Fig. 3). Our agent received redundant (2000dimensional) sensory inputs, comprising 100 sets (contexts) of mixtures of ten hidden sources (1000 sources in total), that were generated as products of the contextdependent mixing matrix and sources. Ten outputs neurons learned to infer each source from their mixtures by updating synaptic strengths through the EGHR. After training, we found that they successfully represented the ten sources for every context, without further updating synaptic strengths, as illustrated by the reduction of the BSS error for all 100 contexts (Fig. 3A) and the convergence of the covariance between sources and outputs to a diagonal matrix (up to permutations and signflips) (Fig. 3B). This was because synaptic strengths had sufficient capacity and were formed to express the inverse of the concatenated mixing matrices from all contexts, which was further confirmed by the convergence of the synaptic strength matrix in the null space (Fig. 3C).
BSS in constantly timevarying environments
In the previous section, we described a general condition for the neural network to achieve the multicontext BSS. In special cases, where the mixing matrices in each context have common features, the neural network can perform the multicontext BSS beyond the maximum number of contexts described above. Here, we show that when contexts are generated from a lowdimensional subspace of mixing matrices and, therefore, are dependent on each other, the EGHR can find the common features and use them to perform the multicontext BSS.
As a corollary of the property of the EGHR when provided with redundant inputs, the EGHR can perform the BSS even when the mixing matrix changes constantly as a function of time (Fig. 4A). Without loss of generality, a timedependent mixing matrix is expressed by the sum of timeinvariant and timevariant components, as follows:
where A^{(0)} is a fullcolumnrank constant matrix with the same size as A(t), A^{(1)} is a fullcolumnrank constant verticallylong rectangular (or square) matrix, and R(t) is a matrix composed of either smoothly or discontinuously changing functions in time. Each component of R(t) is supposed to on average slowly change, i.e., their timederivatives are typically much smaller in magnitude than those of s(t). This condition is required to distinguish whether changes in inputs are caused by changes in the mixing matrix A(t) or the hidden sources s(t). Formally, A(t) expresses infinite contexts along the trajectory of R(t). This is a more complicated setup than the standard BSS in the sense that both sources and the mixing matrix change in time. Nonetheless, the EGHR can achieve BSS for all contexts if a solution of the synaptic strength matrix that satisfies W(A^{(0)}, A^{(1)}) = (Ω, O) exists. Here, Ω represents the identity matrix up to permutations and signflips and O represents a matrix with zero elements. Such a solution generally exists if and only if (A^{(0)}, A^{(1)}) is a fullcolumnrank matrix (see Methods for the derivation). The above condition means that the network performs BSS based on the timeinvariant features A^{(0)} of the mixing matrix, while neglecting the timevarying features A^{(1)}R(t). This can be viewed as a way to compress highdimensional data. This is distinct from the standard dimensionality reduction approach by PCA, which would preferentially extract the timevariant features due to their extra variances. Moreover, the ability to perform dimensionality reduction is an important advantage of the EGHR over conventional ICA algorithms, such as the infomaxbased ICA^{10,11}, natural gradient^{12} and nonholonomic^{39} algorithms, and the ICA mixture model^{40}, because these learning algorithms do not learn effective dimensionality reduction in the multicontext BSS setup due to their construction (see Methods for mathematical explanations).
In the simulation, we supposed R(t) to be a twodimensional rotation matrix, \(R(t)=(\cos \,\omega t,\,\sin \,\omega t;\)\(\sin \,\omega t,\,\cos \,\omega t)\), with an angular frequency of \(\omega =\sqrt{2}\pi /100\). The simulation showed a reduction in the BSS error (Fig. 4B). At the same time, K = WA converged to the identity matrix up to permutations and signflips, K → (0,1; −1,0) in this case, although A continuously changed in time (Fig. 4C,D). As illustrated in Fig. 4E, the synaptic matrix W became perpendicular to the timevarying features A^{(1)} (i.e., WA^{(1)} = O), by a monotonic reduction of the overlap between W and A^{(1)} (defined by the Frobenius norm of their product). After training, the overlap converged to zero. Hence, synaptic strengths were optimized regardless of R(t) at this solution, which enabled the network to perform BSS with a virtually infinite number of contexts. In addition, the neural network that implements the EGHR could learn W perpendicular to A^{(1)} in another simulation setup, where R(t) was a 2 × 2 matrix and its elements were modeled as OrnsteinUhlenbeck (OU) processes with time constant τ = 10^{−3} (Fig. 4F). These results indicate that the EGHR can perform the multicontext BSS with a wide range of timevarying mixing matrices. Indeed, a mathematical analysis shows that multicontext BSS is possible for a general timevarying matrix R(t) as long as it changes slowly enough (see Methods).
Next, we demonstrated the utility of the EGHR, when supplied with redundant inputs, by using natural birdsongs and a timevariant mixing matrix that expressed a natural contextual change. Figure 5 illustrates the BSS task of two birdsongs when birds moved around the agent; thereby, the mixing matrix changed in time according to the positions of the birds (see, the entire movie at http://toyoizumilab.brain.riken.jp/dataset/Isomura2019/Isomura_Toyoizumi_SciRep2019_SupplementaryMovieS1.mp4). To obtain timeindependent features, we assumed that the two birds moved around in nonoverlapping areas. For simplicity, we also assumed that the two birds moved around at different heights. The agent received mixtures of the two birdsongs through six microphones with different direction preferences. In the current context, the zaxis of the birds was timeinvariant and the x and yaxes of the birds were timevariant, although the observer was not informed about this. By tuning synaptic strengths by the EGHR, neural outputs were established to infer each birdsong, while the mixing matrix changed continuously. Crucially, after training, the mapping from the sources to the outputs (K = W A) became constant with time, although matrix A was timedependent. More precisely, the EGHR found a representation where W satisfied W(A^{(0)}, A^{(1)}) = (Ω, O). Hence, neural outputs could separate the two birdsongs, although the amplitudes of the songs recorded by the microphones continuously changed depending on the positions of birds.
Generalization for inexperienced environments
Finally, we examined the generalization capability of the multicontext BSS by the EGHR using natural birdsongs. For the sake of simplicity, we reduced Eq. (6) by considering R(t) that changes discontinuously at the beginning of each session but otherwise is constant. Specifically, we considered the mixing matrix
written using contextindependent matrix A^{(0)}, contextdependent matrices {A^{(1)}, …, A^{(n)}}, and context vector v ≡ (v_{1}, …, v_{n}) that discontinuously changes at the beginning of a new session. The first term in the righthand side of Eq. (7) corresponds to the contextindependent (i.e., constant) component, which should be a fullcolumnrank matrix to provide an ICA solution. Similarly to the case with the continuously timevarying mixing matrix, the EGHR can establish synaptic matrix W that expresses the pseudo inverse of A^{(0)} up to permutations and signflips, while keeping W perpendicular to A^{(1)}, …, A^{(n)}, i.e., W(A^{(0)}, A^{(1)}, …, A^{(n)}) = (Ω, O, …, O). Notably, the EGHR can establish such W by using only a handful samples of v out of combinatorially many possibilities. This is because the mappings from sources to inputs are restricted to be a linear transformation, and thereby, observations with the polynomial (probably quadratic) order number of contexts can identify the mapping for all contexts. This property is particularly useful when v is high dimensional.
In this demonstration, ten sets (contexts) of mixtures of ten birdsongs were introduced to our agent, with redundant sensory inputs composed of 100 mixed sound waves (Fig. 6). Those contexts were defined by random mixing matrices A^{(0)}, A^{(1)}, …, A^{(4)}. We trained the network using only 10 contexts: v = (1,0,0,0), (½,½,0,0), (0,1,0,0), (0,½,½,0), (0,0,1,0), (0,0,½,½), (0,0,0,1), (½,0,0,½), (½,0,½,0), (0,½,0,½). At the beginning of each session, v was randomly selected from the above ten vectors, which provided a discrete random transition among 10 contexts. Ten output neurons learned to infer each birdsong from their mixtures, by updating synaptic strengths through the EGHR. After training, they successfully represented the ten birdsongs without further updating synaptic strengths. Crucially, the network could perform BSS even in an inexperienced context (for example, in v = (¼,¼,¼,¼)). This speaks to the generalization of the multicontext BSS for unseen test contexts.
We quantitatively showed that, as learning progresses, the BSS error for test contexts (defined using 20 randomly sampled v that were inexperienced in the training), as well as for trained contexts, decreased (Fig. 6A). The trajectory of the first two principal components (PC1 and PC2) of K exhibits their convergence to a fixed point at later sessions (Fig. 6B). Here PC1 and PC2 together captured 63.4% of the total variance. Regardless of the given context, matrix K converged to a constant matrix that was the same as the identity matrix up to permutations and signflips. The convergence of W to this fixed point was validated by plotting the trajectories of the overlaps between W and A components (Fig. 6C). While the overlap between W and A^{(0)} increased as learning progressed, the overlap with contextdependent components (A^{(1)}, …, A^{(4)}) decreased and converged to zero, showing that W became perpendicular to A^{(1)}, …, A^{(4)} by the EGHR. We conducted a series of simulations with different initial conditions and confirmed the reliability of convergence, although the convergence speed depended on the initial relative position of W compared to A^{(0)}, A^{(1)}, …, A^{(4)}. Hence, this learnt network could perform BSS with A(v) determined by arbitrary v in the fourdimensional space, without further synaptic updating or transient error, while the network was trained only with 10 contexts. Those results highlight the significant generalization capability of the neural network established by the EGHR and the robustness against inexperienced environments for performing BSS.
Discussion
While a real environment comprises several different contexts, humans and animals retain the experience of past contexts to perform well when they find themselves in the same context in the future. This ability is known as conservation of learning or cognitive flexibility^{25}. Although analogous learning is likely to happen during BSS, the conventional biological BSS algorithms^{37,38} must forget the memory of past contexts to learn a new one. Thereby, when the agent subsequently encounters a previously experienced context, it needs to relearn it from the very beginning. We overcame this limitation by using the described algorithm, the EGHR. The crucial property of the EGHR is that when the number of inputs is larger than the number of sources, the synaptic matrix contains a null space in which synaptic strengths are equally optimized for performing BSS. Hence, with sufficiently redundant inputs, the EGHR can make the synaptic matrix optimal for every experienced context. This is an ability that the conventional biologically plausible BSS algorithms do not have, due to the constraint that the number of inputs and outputs must be equal^{15}; however, we argue that this ability is crucial for animals to perceive and adapt to dynamically changing multicontext environments. It is also crucial for animals to generalize past learning to inexperienced contexts. We also found that, if there is a common feature shared across the training contexts, the EGHR can extract it and generalize the BSS result to inexperienced test contexts. This speaks to a generalization capability and transfer learning, implying the prevention of overfitting to a specific context; alternatively, one might see this as an extraction of a general concept across contexts. Therefore, we argue that the EGHR is a good candidate model for describing the neural mechanism of conservation of learning or cognitive flexibility for BSS.
Moreover, the process of extracting hidden sources in a multicontext BSS setup can be seen as a novel concept of dimensionality reduction^{42}. If the dimensions of input are greater than the product of the number of sources and the number of contexts, the EGHR can extract the lowdimensional sources (up to contextdependent permutations and signflips), while filtering out a large number of contextdependent signals induced by changes in the mixing matrix. ICA algorithms for multicontext BSS^{39,40,41} and undercomplete ICA for compressing data dimensionality^{15,17,53} have been separately developed. Nevertheless, conventional ICA algorithms for multicontext BSS cannot learn efficient dimensionality reduction, and thus, to our knowledge, our study is the first to attempt dimensionality reduction in the multicontext BSS setup. This method is particularly powerful when a common feature is shared across the contexts, because the EGHR can make each neuron encode an identical source across all contexts. Our results are different from those obtained using standard dimensionality reduction approaches by PCA^{18,19}, because PCA is used for extracting subspaces of highvariance principal components and hence would preferentially extract the contextdependent varying features, given that each source has the same variance. Therefore, our study proposes an attractive use of the EGHR for dimensionality reduction.
It is worth noting that the application of standard ICA algorithms to highpass filtered inputs cannot solve the multicontext BSS problem. This is because contextdependent changes in the mixing matrix not only change the means of the inputs, which can be removed by highpass filtering, but also change the gain of how fluctuations of each source are propagated to input fluctuations. Hence, the difference in contexts cannot be expressed as a linear ICA problem after highpass input filtering. Therefore, selective extraction of contextinvariant features is an advantage of the EGHR. Moreover, if provided with redundant input, the EGHR can solve multicontext BSS even if the context changes continuously in time, as we demonstrated in Figs. 4, 5.
We demonstrated that a neural network learns to distinguish individual birdsongs from their superposition. Young songbirds learn songs by mimicking adult birds’ songs^{54,55,56,57}. A study reported that neurons in songbirds’ higher auditory cortex exhibit a teacher specific activity^{58}. One can imagine those neurons correspond to the expectation of hidden sources (u), as considered in this study. Importantly, the natural environment that young songbirds encounter is dynamic, as we considered in Fig. 5. Therefore, the conventional BSS setup, which assumes a static environment or context, is not suitable for explaining this problem. It is interesting to consider that young songbirds might employ some computational mechanism similar to the EGHR to distinguish a teacher’s song from other songs in a dynamically changing environment.
Biological neural networks implement an EGHRlike learning rule. The main ingredients of the EGHR are Hebbian plasticity and the third scalar factor that modulates it. Hebbian plasticity occurs in the brain depending on the activity level^{44,59}, spike timings^{60,61,62,63}, or burst timings^{64} of pre and postsynaptic neurons. In contrast, the third scalar factor can modify the learning rate and even invert Hebbian to antiHebbian plasticity^{50}, similarly to what we propose for the EGHR. In general, such a modulation forms the basis of a threefactor learning rule, a concept that has recently received attention (see^{20,65,66} for reviews), and is supported by experiments on various neuromodulators and neurotransmitters, such as dopamine^{45,46,47}, noradrenaline^{48,49}, muscarine^{67}, and GABA^{50,51}, as well as glial factors^{52}. (These factors may encode reward^{68,69,70,71,72}, likelihood^{73}, novelty/surprise^{74}, or error from a prior belief^{15,17} to achieve various types of learning, implying the existence of a unified threefactor learning framework.) Importantly, the EGHR only requires such a signal that conveys global information to neurons to achieve learning. Furthermore, a study using in vitro neural networks suggested that neurons perform simple BSS using a plasticity rule that is different from the most basic form of Hebbian plasticity, by which synaptic strengths are updated purely as a product of pre and postsynaptic activity^{75,76}. A candidate implementation of the EGHR can be made for cortical pyramidal cells and inhibitory neurons; the former constituting the EGHR output neurons and encoding the expectations of hidden sources, and the latter constituting the third scalar factor and calculating the nonlinear sum of activity in surrounding pyramidal cells. This view is consistent with the circuit structure reported for the visual cortex^{77,78}. These empirical evidences support the biological plausibility of the EGHR as a candidate model of neuronal BSS.
A local computation of the EGHR is highly desirable for neuromorphic engineering^{23,24,79,80}. The EGHR updates synapses by a simple product of pre and postsynaptic neurons’ activity and a global scalar factor. Because of this, less information transfer between neurons is required, compared to conventional ICA methods that require nonlocal information^{10,11,12}, alltoall plastic lateral inhibition between output neurons^{37,38}, or an additional processing step for decorrelation^{13}. The simplicity of the EGHR is a great advantage when implemented in a neuromorphic chip because it can reduce the space for wiring and the energy consumption. Furthermore, unlike the conventional ICA algorithms that assume an equal number of input and output neurons, a neuromorphic chip that employs the EGHR with redundant inputs would perform BSS in multiple contexts, as allowed by the network memory capacity, without requiring readaptation. The generalization capability of the EGHR, as demonstrated in Fig. 6, is an additional benefit, as the EGHR captures the common features shared across training contexts to perform BSS in inexperienced test contexts.
Notably, although we considered a linear BSS problem in this study, multicontext BSS can be extended to nonlinear BSS, in which the inputs are generated through a nonlinear mixture of sources^{81,82}. To solve this problem, a promising approach would be to use a linear neural network. A recent study showed that when the ratio of inputtosource dimensions and source number are large, a linear neural network can find an optimal linear encoder that separates the true sources through PCA and ICA, thus asymptotically achieving zero BSS error^{83}. Because both the asymptotic linearization and multicontext BSS by the EGHR are based on highdimensional sensory inputs, combining these two might be a useful approach to solve the multicontext and nonlinear BSS problem.
In summary, we demonstrated that the EGHR can retain memories of past contexts and, once the learning is achieved for every context, it can perform multicontext BSS without further updating synapses. Moreover, the EGHR can find common features shared across contexts, if present, and uses them to generalize the learning result to inexperienced contexts. Therefore, the EGHR will be useful for understanding the neural mechanisms of flexible inference and sensory representation under dynamically changing environments, and for creating braininspired artificial general intelligence.
Methods
Model and learning rule
The neural network model and used learning rule (the EGHR) are described in the Results section.
Definition of BSS error
We calculated the maximum and second maximum rows as \(i^{\prime} ={{\rm{argmax}}}_{i}{K}_{ij}^{(k)}\) and \(i^{\prime\prime} ={{\rm{argmax}}}_{i\ne i^{\prime} }{K}_{ij}^{(k)}\) and defined the BSS error of column j by the ratio of the values in the two rows: \({\varepsilon }_{j}^{c}={K}_{{i}^{^{\prime\prime} }j}^{(k)}/{K}_{i^{\prime} j}^{(k)}\). Similarly, the BSS error of row i: \({\varepsilon }_{i}^{r}={K}_{i{j}^{^{\prime\prime} }}^{(k)}/{K}_{ij^{\prime} }^{(k)}\) was obtained from the ratio of the maximum and second maximum columns, where \(j^{\prime} ={{\rm{argmax}}}_{j}{K}_{ij}^{(k)}\) and \(j^{\prime\prime} ={{\rm{argmax}}}_{j\ne j^{\prime} }{K}_{ij}^{(k)}\). The BSS error (for the whole K) was defined as the average of them: \(BSS\,error\equiv \,({\varepsilon }_{1}^{c}+\ldots +{\varepsilon }_{{N}_{s}}^{c})/2{N}_{s}+({\varepsilon }_{1}^{r}+\ldots +{\varepsilon }_{{N}_{u}}^{r})/2{N}_{u}\).
Analysis of BSS solution: existence and linear stability
Supposing that N_{u} = N_{s}, we defined the transform matrix K^{(k)} by
For N_{x} ≥ N_{s}, the ICA for context k is achieved when K^{(k)} is the identical matrix up to permutations and signflips. Hence, when amd only when column vectors of a block matrix (A^{(1)}, …, A^{(C)}) are linearly independent of each other, i.e., if and only if (A^{(1)}, …, A^{(C)}) is a fullcolumnrank matrix, an ICA solution that separates all sources for context 1, …, C exists. Namely, W achieves the multicontext BSS when it satisfies
where Ω^{(k)} is an N_{u} × N_{s} matrix equivalent to the identity matrix up to permutations and signflips. Regarding the ith row of matrix K^{(k)}, as denoted by a row vector \(({K}_{i1}^{(k)},\ldots ,{K}_{i{N}_{s}}^{(k)})\), the achievement of ICA is justified when one element is one and the others are zero. Thus, there are many candidate sets of \(({W}_{i1},\ldots ,{W}_{i{N}_{x}})\) that can achieve ICA, because N_{x} is larger than N_{s}. Our numerical analyses showed that among these potential solutions, the one that is the nearest to the solution for the previous context is likely to be chosen. This can be understood as follows: when the network finds an ICA solution for all contexts, the error (i.e., cost function of the EGHR), including transient periods between two contexts, is minimized; hence, according to the gradient descent, synaptic strengths converge to such a solution as training progresses. Owing to this mechanism, the initial errors converge to zero when previously experienced environments are provided as stimuli.
We showed that W that satisfies K = WA = Ω gives a fixed point for the EGHR cost function, \(\dot{W}=\,\partial L/\partial W=({E}_{0}E(u))g(u){x}^{T}=O\), and thus gives an ICA solution, where A is a vertically long or square fullrank mixing matrix^{15,17}. Regarding BSS with a timevarying mixing matrix, from A = A^{(0)} + A^{(1)}R, the time differential of K yields \(\dot{K}=\dot{W}({A}^{(0)}+{A}^{(1)}R)+W{A}^{(1)}\dot{R}=O\). Here, we assume that A^{(0)} and A^{(1)} are full columnrank matrices and R is a general N_{R} × N_{s} timevarying matrix. Because \(\dot{W}=O\) holds for the fixed point, W gives an ICA solution if and only if \(W{A}^{(1)}\dot{R}=O\). Thus, W needs to satisfy W(A^{(0)}, A^{(1)}) = (Ω, O) to give a multicontext ICA solution. The condition for such an ICA solution to exist was obtained as follows: we considered this as a BSS problem such that
The singular value decomposition is given by \(({A}^{(0)},{A}^{(1)})=US({V}_{0}^{T},{V}_{1}^{T})\), where \(U\in {{\mathbb{R}}}^{{N}_{x}\times ({N}_{s}+{N}_{R})}\), \({V}_{0}\in {{\mathbb{R}}}^{{N}_{s}\times ({N}_{s}+{N}_{R})}\), and \({V}_{1}\in {{\mathbb{R}}}^{{N}_{R}\times ({N}_{s}+{N}_{R})}\) with \({V}_{0}{V}_{1}^{T}=O\) are orthogonal matrices and \(S\in {{\mathbb{R}}}^{({N}_{s}+{N}_{R})\times ({N}_{s}+{N}_{R})}\) is a diagonal matrix of singular values. From this, W = ΩV_{0}S^{−1}X should hold to ensure W(A^{(0)}, A^{(1)}) = (Ω, O), where X is an orthogonal matrix satisfying XU = I. Hence, ICA solutions exist when and only when column vectors of (A^{(0)}, A^{(1)}) are linearly independent of each other.
Moreover, we analyzed a sufficient condition on the time constant of R(t) for the stability of the ICA solution. From our previous analysis, the linear stability for fixed points is determined by the following second differential form^{15,17}:
where \({{\rm{\Phi }}}_{ii}\equiv {\rm{cov}}[\mathrm{log}\,{p}_{0}({s}_{i}),g^{\prime} ({s}_{i}){s}_{i}^{2}]\) and \({{\rm{\Phi }}}_{ij}\equiv {\rm{cov}}[\mathrm{log}\,{p}_{0}({s}_{i}),g^{\prime} ({s}_{i})]{s}_{j}^{2}+{\rm{cov}}[\mathrm{log}\,{p}_{0}({s}_{j}),{s}_{j}^{2}]g^{\prime} ({s}_{i})\) for i ≠ j (note that cov[,] is the covariance). The magnitude of dW is assumed to be negligible due to a small learning rate. The solution is linearly stable when and only when Φ_{ii} > −1 and Φ_{ij}Φ_{ji} > 1. When change in R(t) is sufficiently slower than that of s(t) on average, i.e., when dK = dW(A^{(0)} + A^{(1)}R) + WA^{(1)} dR is sufficiently small, the above linear stability condition determines the stability of the fixed point. However, when R(t) changes faster than or as fast as s(t), dK is no longer a small fluctuation, because of large dR, and therefore K may leave from the neighborhood of the fixed point to the region where the second order approximation is no longer accurate. Therefore, as long as the time constant of R(t) is chosen to ensure the averaged fluctuation is small and thus K is within the neighborhood of the fixed point, the EGHR with a timevarying mixing matrix has the same linear stability condition as the conventional EGHR without context switching.
Analysis of conventional ICA algorithms
Here we show that, unlike the multicontext EGHR, conventional ICA algorithms cannot be used for the dimensionality reduction purpose. Some of the ICA algorithms in consideration are written as \(\dot{W}\propto F(u(t),x(t))W\) or, equivalently,
in each discrete time step t (t = 1, 2, …, T) with learning rate η. The functional F specifies an individual learning rule, namely, the natural gradient algorithm takes F(u, x) = I −〈g(u)u^{T〉}^{12} and the nonholonomic algorithm takes \(F(u,x)=\langle {\rm{d}}{\rm{i}}{\rm{a}}{\rm{g}}[g(u)\odot u]g(u){u}^{T}\rangle \)^{39}, where \(\odot \) expresses the elementwise product of two vectors and diag[⋅] indicates a diagonal matrix comprising a vector. This class of ICA algorithms cannot perform dimensionality reduction. Following Eq. (12), the synaptic strength matrix after training (i.e., at time T) is expressed as
where W_{0} is the initial synaptic matrix. In dimensionality reduction, we are interested in horizontally long N_{u} × N_{x} matrices W and W_{0}, which compress N_{x}dimensional signal x to N_{u}dimensional output u with N_{u} < N_{x}. However, \(\prod _{t=1}^{T}(I+\eta F(u(t),x(t)))\) changes the strength only within the N_{u} × N_{u} degree of freedom, so that this is equivalent to the ICA of N_{u}dimensional signals W_{0}x that is already compressed by the nonoptimal matrix W_{0}. Hence, this class of ICA algorithm can be used for separating already (suboptimally) compressed signals W_{0}x but not for reducing signal dimensions. The infomaxbased ICA algorithm^{10,11} has the same fixed point and linear stability conditions as the natural gradient algorithm; thus, again, it does not perform dimensionality reduction. Next, the ICA mixture model was proposed, which is a combination of ICA and a mixture model, to perform multicontext ICA by assigning one of the multiple ICA models to each context^{40}. In this model, the pseudo inverse of the synaptic matrix W^{k} for the ^{k}th model is updated instead of W^{k} by d(W^{k})^{+}/dt ∝ z^{k}(t)(W^{k}) + (I − g(u(t))u(t)^{T}), where z_{k}(t) ∈ [0, 1] is the probability of the kth model being selected. Similar to Eq. (13), the pseudo inverse of the synaptic strength matrix after training is expressed as
which indicates again that the compression is determined by \({W}_{0}^{k}x\). Therefore, the ICA mixture model does not perform dimensionality reduction, either. Hence, the use of multicontext ICA for dimensionality reduction is our novel contribution to the literature, which is beyond the original proposal of the EGHR or conventional multicontext ICA algorithms.
Simulation protocols
For figure 5: Two birdsongs were downloaded from Xenocanto (https://www.xenocanto.org/132149, https://www.xenocanto.org/133054). Two hidden sources were created by trimming the first 60 s of these songs (with 4410Hz time resolution) and normalizing them, to ensure each source sequence had zero mean and unit variance. During the training, the song sequences were repeated. To add stochasticity, a hidden source was defined by the sum of a song and a whitenoise sequence generated by a Laplace distribution. The mixing matrix was defined by 6 × 2 random matrices, A^{(0)}, A^{(1)}, and a rotation matrix, \(R(t)\equiv (\cos \,\omega t,\,\sin \,\omega t;\,\sin \,\omega t,\,\cos \,\omega t)\). The angular frequency ω was randomly set as −0.1π, 0, or 0.1π [rad/s], by following Markov process with a transition probability of 1/8820. The training time and learning rate were defined by T = 4410 × 6000 [step] and η = 10^{−7}.
For figure 6: Ten birdsongs were downloaded from Xenocanto (the URLs are https://www.xenocanto.org/****** where ****** was replaced with the following numbers: 27060, 64735, 67307, 110303, 121326, 121691, 126481, 132149, 133054, 133862). Ten hidden sources were created in the same manner as described above. The mixing matrix was defined by 100 × 10 random matrices A^{(0)}, A^{(1)}, A^{(2)}, A^{(3)}, A^{(4)}, where \({\hat{A}}^{(1)},\ldots ,{\hat{A}}^{(4)}\) were randomly generated and A^{(0)}, …, A^{(4)} were defined by \({A}^{(0)}=({\hat{A}}^{(1)}+{\hat{A}}^{(2)}+{\hat{A}}^{(3)}+{\hat{A}}^{(4)})/4\) and \({A}^{(k)}={\hat{A}}^{(k)}{A}^{(0)}\), for k = 1, …, 4. This treatment was served to ensure that A^{(1)}, …, A^{(4)} do not involve common features across contexts. The training comprised 120 sessions, with each session continued for T = 4410 × 600 [step]. The context vector v randomly chose one of the following ten vectors, v = (1,0,0,0), (½,½,0,0), (0,1,0,0), (0,½,½,0), (0,0,1,0), (0,0,½,½), (0,0,0,1), (½,0,0,½), (½,0,½,0), (0,½,0,½), at the beginning of each session and maintained the value during the session. The learning rate was defined by η = 2 × 10^{−7}. For the test, 20 randomly generated vectors were used, and their elements were randomly sampled from [0,1] and then normalized to satisfy v_{1} + v_{2} + v_{3} + v_{4} = 1.
References
Helmholtz, H. Treatise on physiological optics Vol. III (Dover Publications, 1925).
Knill, D. C. & Pouget, A. The Bayesian brain: the role of uncertainty in neural coding and computation. Trends Neurosci. 27, 712–719 (2004).
DiCarlo, J. J., Zoccolan, D. & Rust, N. C. How does the brain solve visual object recognition? Neuron 73, 415–434 (2012).
Brown, G. D., Yamada, S. & Sejnowski, T. J. Independent component analysis at the neural cocktail party. Trends Neurosci. 24, 54–63 (2001).
Mesgarani, N. & Chang, E. F. Selective cortical representation of attended speaker in multitalker speech perception. Nature 485, 233–236 (2012).
Belouchrani, A., AbedMeraim, K., Cardoso, J. F. & Moulines, E. A blind source separation technique using secondorder statistics. IEEE Trans. Signal Process. 45, 434–444 (1997).
Cichocki, A., Zdunek, R., Phan, A. H. & Amari, S. I. Nonnegative matrix and tensor factorizations: applications to exploratory multiway data analysis and blind source separation. (John Wiley & Sons, West Sussex, UK, 2009).
Comon, P. Independent component analysis, a new concept? Signal Process. 36, 287–314 (1994).
Comon, P. & Jutten, C. In Comon, P. & Jutten, C. (Eds), Handbook of Blind Source Separation: Independent Component Analysis and Applications. (Orlando, FL: Academic Press, 2010).
Bell, A. J. & Sejnowski, T. J. An informationmaximization approach to blind separation and blind deconvolution. Neural Comput. 7, 1129–1159 (1995).
Bell, A. J. & Sejnowski, T. J. The “independent components” of natural scenes are edge filters. Vision Res. 37, 3327–3338 (1997).
Amari, S. I., Cichocki, A. & Yang, H. H. A new learning algorithm for blind signal separation. Adv. Neural Inf. Process. Syst. 8, 757–763 (1996).
Hyvärinen, A. & Oja, E. A fast fixedpoint algorithm for independent component analysis. Neural Comput. 9, 1483–1492 (1997).
Savin, C., Joshi, P. & Triesch, J. Independent component analysis in spiking neurons. PLoS Comput. Biol. 6, e1000757 (2010).
Isomura, T. & Toyoizumi, T. A local learning rule for independent component analysis. Sci. Rep. 6, 28073 (2016).
Lee, T. W., Girolami, M., Bell, A. J. & Sejnowski, T. J. A unifying informationtheoretic framework for independent component analysis. Comput. Math. Appl. 39, 1–21 (2000).
Isomura, T. & Toyoizumi, T. Errorgated Hebbian rule: A local learning rule for principal and independent component analysis. Sci. Rep. 8, 1835 (2018).
Pearson, K. On lines and planes of closest fit to systems of points in space. Philos. Mag. 2, 559–572 (1901).
Oja, E. Neural networks, principal components, and subspaces. Int. J. Neural Syst. 1, 61–68 (1989).
Kuśmierz, Ł., Isomura, T. & Toyoizumi, T. Learning with three factors: modulating Hebbian plasticity with errors. Curr. Opin. Neurobiol. 46, 170–177 (2017).
Avitan, L. & Goodhill, G. J. Code under construction: neural coding over development. Trends Neurosci. 41, 599–609 (2018).
Goodhill, G. J. Theoretical models of neural development. iScience 8, 183–199 (2018).
Neftci, E. Data and power efficient intelligence with neuromorphic learning machines. iScience 5, 52–68 (2018).
Fouda, M., Neftci, E., Eltawil, A. M. & Kurdahi, F. Independent component analysis using RRAMs. IEEE Trans. Nanotech.; https://doi.org/10.1109/TNANO.2018.2880734 (2018).
Dajani, D. R. & Uddin, L. Q. Demystifying cognitive flexibility: Implications for clinical and developmental neuroscience. Trends Neurosci. 38, 571–578 (2015).
Dehaene, S. & Changeux, J. P. The Wisconsin Card Sorting Test: Theoretical analysis and modeling in a neuronal network. Cereb. Cortex 1, 62–79 (1991).
Gilbert, C. D. & Sigman, M. Brain states: topdown influences in sensory processing. Neuron 54, 677–696 (2007).
Mante, V., Sussillo, D., Shenoy, K. V. & Newsome, W. T. Contextdependent computation by recurrent dynamics in prefrontal cortex. Nature 503, 78–84 (2013).
Song, H. F., Yang, G. R. & Wang, X.J. Training excitatoryinhibitory recurrent neural networks for cognitive tasks: A simple and flexible framework. PLoS Comput. Biol. 12, e1004792 (2016).
Song, H. F., Yang, G. R. & Wang, X.J. Rewardbased training of recurrent neural networks for cognitive and valuebased tasks. eLife 6, 679–684 (2017).
Miconi, T. Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks. eLife 6, 229–256 (2017).
Chaisangmongkon, W., Swaminathan, S. K., Freedman, D. J. & Wang, X. J. Computing by robust transience: how the frontoparietal network performs sequential, categorybased decisions. Neuron 93, 1504–1517 (2017).
Ahrens, M. B., Linden, J. F. & Sahani, M. Nonlinearities and contextual influences in auditory cortical responses modeled with multilinear spectrotemporal methods. J. Neurosci. 28, 1929–1942 (2008).
Yu, D., Deng, L. & Dahl, G. Roles of pretraining and finetuning in contextdependent DBNHMMs for realworld speech recognition. In Proceeding of NIPS Workshop on Deep Learning and Unsupervised Feature Learning. (2010).
Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 114, 3521–3526 (2017).
Zenke, F., Poole, B. & Ganguli, S. Continual learning through synaptic intelligence. International Conference on Machine Learning, 3987–3995; https://arxiv.org/abs/1703.04200 (2017).
Földiák, P. Forming sparse representations by local antiHebbian learning. Biol. Cybern. 64, 165–170 (1990).
Linsker, R. A local learning rule that enables information maximization for arbitrary input distributions. Neural Comput. 9, 1661–1665 (1997).
Amari, S. I., Chen, T. & Cichocki, A. Nonholonomic orthogonal learning algorithms for blind source separation. Neural Comput. 12, 1463–1484 (2000).
Lee, T. W., Lewicki, M. S. & Sejnowski, T. J. ICA mixture models for unsupervised classification of nonGaussian classes and automatic context switching in blind signal separation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1078–1089 (2000).
Hirayama, J. I., Ogawa, T. & Hyvärinen, A. Unifying blind separation and clustering for restingstate EEG/MEG functional connectivity analysis. Neural Comput. 27, 1373–1404 (2015).
Cunningham, J. P. & Ghahramani, Z. Linear dimensionality reduction: Survey, insights, and generalizations. J. Mach. Learn. Res. 16, 2859–2900 (2015).
Hebb, D. O. The Organization of Behavior: A Neuropsychological Theory. (Wiley, New York, 1949).
Bliss, T. V. & Lømo, T. Longlasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. 232, 331–356 (1973).
Reynolds, J. N. J., Hyland, B. I. & Wickens, J. R. A cellular mechanism of rewardrelated learning. Nature 413, 67–70 (2001).
Zhang, J. C., Lau, P. M. & Bi, G. Q. Gain in sensitivity and loss in temporal contrast of STDP by dopaminergic modulation at hippocampal synapses. Proc. Natl. Acad. Sci. USA 106, 13028–13033 (2009).
Salgado, H., Köhr, G. & Treviño, M. Noradrenergic “tone” determines dichotomous control of cortical spiketimingdependent plasticity. Sci. Rep. 2, 417 (2012).
Yagishita, S. et al. A critical time window for dopamine actions on the structural plasticity of dendritic spines. Science 345, 1616–1620 (2014).
Johansen, J. P. et al. Hebbian and neuromodulatory mechanisms interact to trigger associative memory formation. Proc. Natl. Acad. Sci. USA 111, E5584–92 (2014).
Paille, V. et al. GABAergic circuits control spiketimingdependent plasticity. J. Neurosci. 33, 9353–9363 (2013).
Hayama, T. et al. GABA promotes the competitive selection of dendritic spines by controlling local Ca^{2+} signaling. Nat. Neurosci. 16, 1409–1416 (2013).
Ben Achour, S. & Pascual, O. Glia: the many ways to modulate synaptic plasticity. Neurochem. Int. 57, 440–445 (2010).
Porrill, J. & Stone, J. V. Undercomplete independent component analysis for signal separation and dimension reduction. Technical report, University of Sheffield, Department of Psychology. (1998).
Tchernichovski, O., Mitra, P. P., Lints, T. & Nottebohm, F. Dynamics of the vocal imitation process: how a zebra finch learns its song. Science 291, 2564–2569 (2001).
Woolley, S. Early experience shapes vocal neural coding and perception in songbirds. Dev. Psychobiol. 54, 612–631 (2012).
Lipkind, D. et al. Stepwise acquisition of vocal combinatorial capacity in songbirds and human infants. Nature 498, 104–108 (2013).
Lipkind, D. et al. Songbirds work around computational complexity by learning song vocabulary independently of sequence. Nat. Commun. 8, 1247 (2017).
Yanagihara, S. & YazakiSugiyama, Y. Auditory experiencedependent cortical circuit shaping for memory formation in bird song learning. Nat. Commun. 7, 11946 (2016).
Dudek, S. M. & Bear, M. F. Homosynaptic longterm depression in area CA1 of hippocampus and effects of NmethylDaspartate receptor blockade. Proc. Natl. Acad. Sci. USA 89, 4363–4367 (1992).
Markram, H., Lübke, J., Frotscher, M. & Sakmann, B. Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science 275, 213–215 (1997).
Bi, G. Q. & Poo, M. M. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci. 18, 10464–10472 (1998).
Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A. & Poo, M. M. A critical window for cooperation and competition among developing retinotectal synapses. Nature 395, 37–44 (1998).
Feldman, D. E. The spiketiming dependence of plasticity. Neuron 75, 556–571 (2012).
Butts, D. A., Kanold, P. O. & Shatz, C. J. A burstbased “Hebbian” learning rule at retinogeniculate synapses links retinal waves to activitydependent refinement. PLoS Biol. 5, e61 (2007).
Pawlak, V., Wickens, J. R., Kirkwood, A. & Kerr, J. N. Timing is not everything: neuromodulation opens the STDP gate. Front. Synaptic Neurosci. 2, 146 (2010).
Frémaux, N. & Gerstner, W. Neuromodulated spiketimingdependent plasticity, and theory of threefactor learning rules. Front. Neural Circuits 9, 85 (2016).
Seol, G. H. et al. Neuromodulators control the polarity of spiketimingdependent synaptic plasticity. Neuron 55, 919–929 (2007).
Izhikevich, E. M. Solving the distal reward problem through linkage of STDP and dopamine signaling. Cereb. Cortex 17, 2443–2452 (2007).
Florian, R. V. Reinforcement learning through modulation of spiketimingdependent synaptic plasticity. Neural Comput. 19, 1468–1502 (2007).
Legenstein, R., Pecevski, D. & Maass, W. A learning theory for rewardmodulated spiketimingdependent plasticity with application to biofeedback. PLoS Comput. Biol. 4, e1000180 (2008).
Urbanczik, R. & Senn, W. Reinforcement learning in populations of spiking neurons. Nat. Neurosci. 12, 250–252 (2009).
Frémaux, N., Sprekeler, H. & Gerstner, W. Functional requirements for rewardmodulated spiketimingdependent plasticity. J. Neurosci. 30, 13326–13337 (2010).
Brea, J., Senn, W. & Pfister, J. P. Matching recall and storage in sequence learning with spiking neural networks. J. Neurosci. 33, 9565–9575 (2013).
Rezende, D. J. & Gerstner, W. Stochastic variational learning in recurrent spiking networks. Front. Comput. Neurosci. 8, 38 (2014).
Isomura, T., Kotani, K. & Jimbo, Y. Cultured cortical neurons can perform blind source separation according to the freeenergy principle. PLoS Comput. Biol. 11, e1004643 (2015).
Isomura, T. & Friston, K. In vitro neural networks minimise variational free energy. Sci. Rep. 8, 16926 (2018).
Harris, K. D. & MrsicFlogel, T. D. Cortical connectivity and sensory coding. Nature 503, 51–58 (2013).
Hofer, S. B. et al. Differential connectivity and response dynamics of excitatory and inhibitory neurons in visual cortex. Nat. Neurosci. 14, 1045–1052 (2011).
Merolla, P. A. et al. A million spikingneuron integrated circuit with a scalable communication network and interface. Science 345, 668–673 (2014).
Chicca, E., Stefanini, F., Bartolozzi, C. & Indiveri, G. Neuromorphic electronic circuits for building autonomous cognitive systems. Proc. IEEE 102, 1367–1388 (2014).
Lappalainen, H. & Honkela, A. Bayesian nonlinear independent component analysis by multilayer perceptrons. In Advances in independent component analysis (pp. 93–121) (London, UK: Springer, 2000).
Karhunen, J. Nonlinear independent component analysis. In Roberts, S. & Everson, R. (Eds), Independent component analysis: principles and practice (pp. 113–134) (Cambridge, UK: Cambridge University Press, 2001).
Isomura, T. & Toyoizumi, T. On the achievability of blind source separation for highdimensional nonlinear source mixtures. Preprint at, https://arxiv.org/abs/1808.00668 (2018).
Acknowledgements
This work was supported by RIKEN Center for Brain Science (T.I. and T.T.), Brain/MINDS from AMED under Grant Number JP19dm020700 (T.T.), and JSPS KAKENHI Grant Number JP18H05432 (T.T.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
Conceived the idea: T.I. and T.T. Performed the analyses: T.I. and T.T. Wrote the paper: T.I. and T.T.
Corresponding authors
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Isomura, T., Toyoizumi, T. Multicontext blind source separation by errorgated Hebbian rule. Sci Rep 9, 7127 (2019). https://doi.org/10.1038/s4159801943423z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4159801943423z
This article is cited by

Dimensionality reduction to maximize prediction generalization capability
Nature Machine Intelligence (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.