Error-Gated Hebbian Rule: A Local Learning Rule for Principal and Independent Component Analysis

We developed a biologically plausible unsupervised learning algorithm, error-gated Hebbian rule (EGHR)-β, that performs principal component analysis (PCA) and independent component analysis (ICA) in a single-layer feedforward neural network. If parameter β = 1, it can extract the subspace that major principal components span similarly to Oja’s subspace rule for PCA. If β = 0, it can separate independent sources similarly to Bell-Sejnowski’s ICA rule but without requiring the same number of input and output neurons. Unlike these engineering rules, the EGHR-β can be easily implemented in a biological or neuromorphic circuit because it only uses local information available at each synapse. We analytically and numerically demonstrate the reliability of the EGHR-β in extracting and separating major sources given high-dimensional input. By adjusting β, the EGHR-β can extract sources that are missed by the conventional engineering approach that first applies PCA and then ICA. Namely, the proposed rule can successfully extract hidden natural images even in the presence of dominant or non-Gaussian noise components. The results highlight the reliability and utility of the EGHR-β for large-scale parallel computation of PCA and ICA and its future implementation in a neuromorphic hardware.

(1 )( ( ) ) ( ) ( ( ) ( )) , T u x T 0 ICA term P CA term where the dot over W denotes a temporal derivative, g(u) ≡ dE(u)/du is a nonlinear activation function, and E 0 ≡ 1 + 〈E(u)〉 = 1 − 〈logp 0 (u)〉. We will refer to Eq. (3) as the EGHR-β. Note that this definition of E 0 is slightly different from the original definition 1 − 〈logp 0 (s)〉 4 but the resulting behavior turns out to be quite similar (see below for comparison). In the following, E(u), E u (u), and E x (x) are referred to as global factors (global signals) that represent neuron non-specific error signals. The cost function of the EHGR-β consists of the ICA and PCA terms weighted by 1 − β and β, respectively, and its derivative provides a local learning rule for PCA and ICA. As we will see, the ICA term (the first term of Eq. (3)) makes the outputs independent of each other, while the PCA term (the second term) increases the correlation between the output and input squared-norms by decreasing (E u (u) − E x (x)) close to zero. Importantly, the EGHR-β can be represented using only local connections because W is updated according to the product of pre-and post-neurons' activities and the global signal (Fig. 1). This property is highly desirable for parallel computing and neuromorphic engineering (see Discussion). The EGHR-β becomes a local learning rule for ICA when β = 0 and that for PCA when β = 1. More generally, it can extract principal components from high-dimensional inputs while separating signals into individual sources when 0 < β < 1.
The features of EGHR-β. We start by investigating how the PCA and ICA terms of the EGHR-β are related to previously proposed non-local learning rules: Oja's subspace rule for PCA 11 and Bell-Sejnowski's ICA rule 5,6 , respectively. A simple analysis shows that the PCA term of the EGHR-β is equal to Oja's subspace rule for PCA 11 up to a multiplication with a positive definite matrix when the sources independently follow Gaussian distributions (see Eq. (8) in Methods). Next, the ICA term of the EGHR-β is equivalent to Bell-Sejnowski's ICA rule around the neighborhood of ICA solutions when the number of input and output neurons are equal (M = N) and the source distribution is given by p(s) ∝ Π i exp(−b|s i | a ) with positive constants a and b (see Eqs (12)- (13) in Methods). Note that the definition of E 0 in this paper is slightly altered from the original one 4 to straightforwardly demonstrate the relationship with Bell-Sejnowski's rule. However, the resulting ICA performance is similar to the original versionmathematical analyses give the same linear stability condition for ICA solutions (see Methods and Supplementary Information); and numerical simulations show the absence of major spurious solutions when random mixing matrices with up to 20 dimensional sources are studied (Fig. S1) and the robustness of the outcome to the choice of nonlinear function g(u), derived within the sub-or super-Gaussian family (Fig. S2).
Unlike the classical learning rules, the EGHR-β can perform these computations only using local information available at each synapse. Moreover, unlike Bell-Sejnowski's rule, its ICA term can handle a greater number of inputs than the number of output neurons, which makes the EGHR-β a great candidate to perform both dimensionality reduction and separation of independent sources. Notably, beyond the above conditions, the behavior of the EGHR-β can be better than Oja's subspace rule and/or Bell-Sejnowski's ICA rule as we analytically and numerically study in the following. Throughout the result section, we use a uniform prior distribution (p 0 (s i ) = 1/2 3 for |s i | < 3 or 0 for otherwise) to preferentially extract sub-Gaussian sources with negative kurtosis.
We analytically study the existence and stability of the solutions of the EGHR-β (see Eqs (14)- (18) in Methods for details) and find that the EGHR-β can perform PCA without assuming Gaussian sources and ICA without assuming the equal number of input and output neurons. Namely, (1) if β ≈ 1, the only stable fixed point of the EGHR-β is such that the outputs are spanned by the major principal components; hence, the EGHR-β with β ≈ 1 performs PCA (see Case 1 in Methods and Supplementary Methods S2, S3); and (2) if β ≈ 0, the only stable fixed point is such that the outputs represent sub-Gaussian independent sources; hence, the EGHR-β with β ≈ 0 performs ICA (see Case 2 in Methods and S2, S3). These properties are also confirmed by numerical simulations, where four independent Gaussian sources and four independent uniformly-distributed sources with different variances are mixed as inputs (Fig. 2). Typical outputs with β = 0 and 0.8 are illustrated in Fig. 2A. When β = 0.8, the EGHR-β succeeded in extracting the subspace of four major principal components from eight-dimensional data (PCA-like condition), while when β = 0, the EGHR-β succeeded in extracting sub-Gaussian sources (ICA-like condition). Note that we showed the result of β = 0.8 here (rather than β = 1) because, in addition to performing PCA, the EGHR-β can separate independent sub-Gaussian sources. The preference of sources gradually shifts from the ICA-like to the PCA-like one as β increases (Fig. 2B).
For comparison with the EGHR-β, we consider three non-local algorithms: Oja's subspace rule for PCA 11 , Amari's ICA rule 7 , and the cascade of the Oja and Amari rules (see Eqs (5) and (11) in Methods). Note that the results of Bell-Sejnowski's ICA rule 5,6 are the same as those of Amari's ICA rule. PCA 13 and ICA 7 cost functions are used as measures (see also Eqs (6) and (9) in Methods for their details), and plotted as the spread of eigenvalues is continuously changed. As expected from the mathematical analyses (see Methods and Supplementary Methods S2-S4), both the EGHR-β (β = 0.8) and Oja's subspace rule can extract a subspace of major principal components by reducing the normalized PCA cost by a similar amount (Fig. 2C). Note that the Oja's subspace rule achieves the theoretical optimum. Next, we explore how these different methods reduce the ICA cost that assumes a uniform prior distribution p 0 (Fig. 2D). This cost function is minimized if independent uniform sources are extracted as outputs. Interestingly, reducing this cost function is not trivial for conventional learning rules. Amari's ICA rule alone cannot separate sources as it works only when the number of neurons matches that of unknown sources 4 . A common strategy in this scenario is to first apply PCA and then apply ICA to its output. Interestingly, this PCA-to-ICA cascade fails to reduce the cost function because the first PCA step discards the minor uniformly-distributed sources. Only the EGHR-β (β = 0) can separately extract minor sub-Gaussian independent sources (Fig. 2D).
An application to extract natural and artificial images. We demonstrate the performance of the EGHR-β using mixtures of natural and artificial images as inputs. Twelve high-variance colored noise images with zero kurtosis, four pictures of a distinct hedgehog with negative kurtosis, and 84 low-variance white noise images with negative kurtosis were used as sources (Fig. 3A). The color intensities of the individual pixels were converted to real numbers and then centered to be zero mean following 4,14 (see also Methods). The 100 images were superposed to produce 100 mixed images using a random but fixed 100 × 100 rotation matrix (Fig. 3B). One pixel was randomly sampled from the identical position of these 100 mixed images at a time, and fed as input into a one-layer feed-forward neural network that has four output neurons (as in Fig. 1). Synaptic strength matrix W of the model was updated according to the EGHR-β with β = 0, 0.02, or 0.8 (Eq. (3)). Note that we used the uniform distribution as the prior p 0 because the natural images tended to follow a sub-Gaussian distribution with negative kurtosis 4 . For comparison, we introduced the same input into a two-layer feed-forward neural network Pictures reconstructed from neural outputs after training are displayed in Fig. 3C (see also Supplementary Movie S1 for the learning process). We found that the cascade of the Oja and Amari rules (Fig. 3C bottom) extracted mixtures of colored noise images and natural images. These images were extracted because ICA rules generally cannot separate mixed Gaussian sources and the mixed hedgehog image represents the primary principal component of the input owing to the small but non-negligible correlation between the four hedgehog images. Hence, the Oja rule extracted the subspace spanned by the three colored noise images and the mixed hedgehog image as major components, and the following Amari rule simply segregated this non-Gaussian hedgehog image from the rest. Next, the EGHR-β with β = 0.8 extracted four high-variance colored noise components ( Fig. 3C third line). The reason that the EGHR-β dropped the primary principal component (the mixed hedgehog image in Fig. 3C bottom) can be understood from the stability analysis (see Eq. (16) in Methods for details), which shows that if eigenvalues are similar to each other, the EGHR-β solution for β ≈ 1 becomes more stable when the outputs extract sources with positive large kurtosis. Accordingly, the EGHR-β with β = 0.8 extracted Gaussian colored noise images rather than the (barely) primary sub-Gaussian principal component (i.e., the mixed hedgehog image). By contrast, the EGHR-β with β = 0.02 successfully extracted and separated all minor hedgehog images even in the presence of large Gaussian noise (Fig. 3C second line). This β = 0.02 parameter preferentially extracted images with negative kurtosis, while discarding low-variance noise. Finally, the result of EGHR-β with β = 0 varied depending on the initial synaptic weights as it does not efficiently utilize the variance of images. It tended to extract some minor hedgehog images and some mixtures of noise images. Figure 3C top shows an example, where three hedgehog images and one mixed noise image are extracted. Because the β = 0 parameter preferentially extracts independent components with negative kurtosis, the extracted noise image included the low-variance sub-Gaussian noise but the algorithm was tolerant to its contamination with colored noise images (see top right inset panel in Fig. 3C for the magnified image). Therefore, the EGHR-β can flexibly extract either high-variance images or minor natural images with large and negative kurtosis depending on the tuning of β, purely in an unsupervised manner. Furthermore, only the EGHR-β with β slightly larger than 0, but not the cascade of PCA and ICA algorithms, can extract sources with intermediate variance and negative kurtosis, Three inset panels in the right display magnified images, which show that only the result of the EHGR-β with β = 0, but not the others, includes low-variance white noise images. We retrieved these hedgehog pictures from the Caltech101 dataset 40 and processed them accordingly. See Methods for detail on experimental parameters. discarding both high-variance Gaussian noise and low-variance sub-Gaussian noise. This result demonstrates the benefit of performing both PCA and ICA by the same set of neurons.

Discussion
In this study, we developed a novel learning rule for PCA and ICA, the EGHR-β. The EGHR-β can compress data by removing minor components and extracting either principal components or sub-Gaussian sources from a high-dimensional dataset by adjusting the parameter β. The learning rule updates each synaptic strength in a single-layer linear feedforward network based on the sum of PCA and ICA terms, where each term is given by a simple product of pre-and postsynaptic neurons' activity and a global scalar factor. Hence, the proposed scheme is much simpler than conventional ICA methods that require non-local information [5][6][7]15 , dense and plastic lateral inhibition between output neurons [16][17][18] , or an additional preprocessing stage for PCA to remove background noises 11,19 . This simplicity is a great advantage for the EGHR-β because it can reduce the number of processing layers and connections, and the related energy costs, making its implementation in a neuromorphic chip 20 significantly easier.
If sources follow a Gaussian distribution, we showed that the EGHR-β can extract the subspace that principal components span in a way that is mathematically equivalent to the well-known Oja's subspace rule (see Eq. (8) in Methods). Whereas, if sources follow non-Gaussian distributions, the fixed point and the linear stability are influenced by the kurtosis of discarded components. Because of this property, the EGHR-β can robustly perform BSS even in the presence of large Gaussian noise, where a standard cascade of PCA-to-ICA processing cannot. While the EGHR-β generally consists of a sum of PCA and ICA terms, we can approximately express it by a single-term three-factor rule when the source distributions are close to Gaussian. In this case, the postsynaptic factor, g(u), of the ICA term becomes identical to that of the PCA term, u, and, hence, the net global error signal becomes the weighted sum of those for the PCA and ICA terms. Note that an additional mechanism may be required to extract minor sources with positive kurtosis (i.e., super-Gaussian sources) because a solution that extracts super-Gaussian sources can be unstable in the presence of large noise.
In biological neural networks, associative (Hebbian) plasticity occurs depending on the timing of pre-and post-neurons' activity (i.e., a two-factor learning rule) [21][22][23] . However, recent studies show that third factors, such as neuromodulators [24][25][26][27][28][29] , GABAergic inputs 30,31 , and glial factors 32 , can modulate the original associative plasticity in various ways (the so-called three-factor learning rule 33,34 ). The EGHR-β is one of the three-factor learning rules and each of its PCA and ICA terms updates the synaptic strength by the product of pre-and postsynaptic activities and a global error signal. The global error signals are defined as the non-linear sum of output activities, similarly to inhibitory neurons in the visual cortex 35,36 , and they change the learning rate and even invert Hebbian to anti-Hebbian in a manner similar to what has been reported for GABA 31 . Note that the PCA and ICA learning could happen at non-overlapping timing in a biological setup, such as in a wake and sleep condition 37 . Importantly, this process only uses information that actual neurons can access via their synaptic connections to achieve PCA and ICA. Thus, the EGHR-β is a local rule, while conventional methods, such as the Oja and Amari rules 7,11 , use non-local information (synaptic strengths of non-connected neurons) to update synaptic strengths. This demonstrates the utility of the EGHR-β also as a model of learning processes in a biological neural network.
In summary, we developed the EGHR-β by enhancing the original EHGR to handle largely high-dimensional inputs in a biological manner. The EGHR-β would be useful in engineering for improving object recognition accuracy in noisy background. Because the EGHR-β is easily implemented with recently advanced neuromorphic chips and can process the "big data" in parallel with energy efficiency, the EGHR-β is expected to have an impact in various fields such as engineering and life science.

Methods
First, we describe the relationship between the EGHR-β and the original EGHR 4 . Next, for comparison with the EHGR-β, we introduce non-local PCA 11,13 and ICA [5][6][7] rules. Finally, we analyze fixed points and their linear stability of the EHGR-β.

Relationship between the EGHR-β and the original EGHR.
In this paper, the definition of the cost function of the ICA part of the EHGR-β is slightly different from that of the original EGHR 4 . Their relationship is represented by where the left hand side is the cost of the original EGHR, while the first term in the right is the cost of the ICA part of the EHGR-β. The constant factor makes no difference. The second term in the right gives an additional stability to the original EGHR by minimizing the difference between 〈E(u)〉 and 〈E(s)〉. However, since the first term of the right hand side (i.e., the ICA part of the EGHR-β) alone has the ICA ability, this second term is not necessary (see Supplementary Figures S1 and S2). Moreover, their linear stability conditions around ICA solutions are the same. Although only the original EGHR has an additional tr(dK) 2 term in its second-derivative 4 , this does not change the linear stability condition. Indeed, the second-derivative of the ICA part of the EGHR-β is more similar to that of the well-known Bell-Sejnowski's ICA rule 5,6 around ICA solutions as we describe below. Conventional non-local rules for comparison. In this section, we introduce conventional learning rules to perform either PCA or ICA. Unlike the EGHR-β introduced above, all rules introduced here are non-local. For a comparison of PCA, Oja's subspace rule for PCA is considered 11 .
This rule is an enhancement of Oja's original model 38 and can extract a subspace that the first to the Nth principal components span by the N-dimensional neural output. Importantly, Oja's subspace rule is a non-local rule because it needs to calculate the product of W T and u (alternatively, it needs to prepare new neurons y = W T u, but how to extract W T in a biological setting is open to discussion). While Oja's subspace rule does not have a cost function, Xu proposed a similar learning rule that is derived as a gradient descent rule of a cost function and achieves PCA 13 . The cost function is defined by because the purpose of PCA is to obtain a representation using a small number of output units with the least loss. The dynamics of W are defined by (7) is termed the least mean squared error-based PCA 13 . Empirically, the second term converges to zero quickly. Consequently, the least mean squared error-based PCA finds the same solution as Oja's subspace rule (Eq. (5)). We use this cost function in Fig. 2 to quantify the success of PCA.
Indeed, when sources follow a unit Gaussian distribution, the PCA term of the EGHR-β becomes Oja's subspace rule 11 except a multiplication with a positive definite matrix. Suppose β = 1 and x follow a Gaussian distribution with zero mean and variance of AA T . From Bussgang theorem 39 , the EGHR-β becomes This is equivalent to Oja's subspace rule up to a multiplication with positive definite matrix AA T . For a comparison with non-Gaussian sources, see the fixed point analysis of Case 1 below, where their fixed points are also similar.
In addition, for a comparison of BSS ability, Amari's ICA rule is considered 7 . The cost function of Amari's ICA rule is defined by the Kullback-Leibler divergence 9 between p(u) and p 0 (u).
A K L 0 0 The gradient of L A gives Bell-Sejnowski's non-local ICA rule 5,6 A T T while the natural gradient of L A gives Amari's non-local ICA rule 7 The ICA term of the EGHR-β is close to Bell-Sejnowski's ICA rule 5,6 . Suppose M = N, β = 0, and u = Ks with square matrix K ≡ WA. From Lemma S1.1 in Supplementary Methods S1, the EGHR-β becomes Kg dg where dg ≡ g(u) − Kg(s). We numerically check that the second term in the last line is smaller than the first term. Furthermore, when W is around ICA solutions (i.e., K = I + dK is close to the identity matrix), from Lemmas S1.1 and S1.3, the EGHR-β becomes where  ∈ × P N N is a full-rank orthogonal matrix that holds  Linear stability of the EGHR-β. Below, we investigate the linear stability of the fixed points described in the above section (Cases 1-3 below are the same cases as those in the above section). See Supplementary Methods S3 for derivation details.

Case 1. The fixed point of Eqs (14-15) is linearly stable if and only if
In the special case of β = 1 and s N + 1 , …, s M following a unit Gaussian distribution, the condition for linear stability is Λ ii ≥ Λ jj for 1 ≤ i ≤ N and N + 1 ≤ j ≤ M. Thus, the state is stable when the output u represents a space that is spanned by the first to Nth principal components, while the state is unstable when u involves other minor components, meaning that the EGHR-β can extract major principal components. More generally, when s N + 1 , …, s M follow non-Gaussian distributions, the linear stability condition also depends on the kurtosis (κ i ≥ −2) as shown above.  j ≤ N, and 0 otherwise.) To see how the shape of the source distribution influences the linear stability, let us consider the special case in which s 1 , …, s N follow p 0 (s i ) ∝ exp(−b|s i | a ), where a > 0 is a positive constant and b > 0 is tuned such that 〈 〉 = s 1 i 2 , and s N + 1 , …, s M follow distributions with zero mean and unit variance. In this case, we can straightforwardly show that a > 2 is a necessary and sufficient condition to be linearly stable. Namely, when s 1 , …, s N follow a sub-Gaussian distribution (a > 2) and s N + 1 , …, s M follow Gaussian or super-Gaussian distributions, s 1 , …, s N are chosen as outputs. By contrast, when s 1 , …, s N follow a super-Gaussian distribution (a < 2), s 1 , …, s N may not be extracted simultaneously. Hence, the EGHR-β extracts sub-Gaussian sources.
Hence, the EGHR-β can extract either sub-Gaussian or major sources depending on β. For Fig. 2: In the simulations, M = dim(s) = dim(x) = 8 and N = dim(u) = 4 are used. An 8 × 8 mixing matrix A = RΛ 1/2 consists of a rotation matrix R and a diagonal matrix of eigenvalues Λ. We suppose that amplitudes of sources satisfy Λ 11 = Λ 22 = 4, Λ 33 = Λ 44 = 2, Λ 55 = Λ 66 = 1, and Λ 77 = Λ 88 = 0.5 in Fig. 2A  , and Λ 77 = Λ 88 = 1 in Fig. 2C and D. Moreover, we suppose that odd-numbered sources (s 1 , s 3 , s 5 , s 7 ) follow a unit Gaussian distribution ( ), while even-numbered sources (s 2 , s 4 , s 6 , s 8 ) follow a unit uniform distribution = p s ( ) 1/2 3 i for < s 3 i and 0 otherwise). The training time and the learning rate are defined by T = 2 × 10 7 and η = 8 × 10 −6 , respectively. In all cases, R is a random rotation matrix, and W starts from a random matrix in which each element W ij follows a Gaussian distribution with zero mean and a variance of 0.25. Note that source codes of the EGHR-β are appended as Supplementary Source Codes.
For Fig. 3: The performance of the EGHR-β is demonstrated using a natural image dataset. We prepare a total of 100 sources (M = 100): 12 high-variance colored-noise images with zero kurtosis, four low-variance natural images (hedgehogs), and 84 low-variance white-noise images with negative kurtosis. These sources consist of 200 × 200 pixels with RGB color. The natural images were retrieved from the Caltech101 dataset (http://www.vision.caltech.edu/Image_Datasets/Caltech101/) 40 , rescaled between 0 and 1, and adjusted to have a variance of 0.02. High-variance colored-noise images are created by enlarging 50 × 50 white noise images by a factor of four, and the original small-size images are produced by linearly summing truncated Gaussian noise (in the 0-1 range) and Laplace noise in order to have a mean of 0.5, variance of 0.023, and kurtosis of 0. Low-variance white-noise images are generated from a uniform distribution with a mean of 0.5 and variance of 0.002. We use these natural and noise images according to the protocol explained in 4,14 by first subtracting the constant mean of 0.5 (i.e., the gray background). Each of 100 images (200 × 200 pixels, RGB) is treated as a vector (40, 000 pixels × 3 colors = 120, 000 dimensions). This source data composed of these 100 vectors (a 100 × 120, 000 matrix) is mixed by a 100 × 100-dimensional rotation matrix R. A column of the resulting input data is randomly sampled at each step for training (T = 3 × 10 7 steps in total). The mixed signals are introduced as input into a one-layer feed-forward neural network (as in Fig. 1) to obtain the four-dimensional output (N = 4). Synaptic strength matrix W is updated by the EHGR-β with β = 0, 0.02, or 0.8. Hence, the input to output dimensions are For a comparison, a two-layer feed-forward network is considered in which synaptic strengths in the first and second layers are updated by Oja's subspace rule and Amari's ICA rule, respectively. In this case, the input to intermediate representation to output dimensions are → → .

Input (100)
Oja's subspace rule (4) Amari's ICA rule (4) The learning rate is defined by η = 2 × 10 −3 . For all algorithms, R is a common random rotation matrix, and W starts from the identity matrix.