A Local Learning Rule for Independent Component Analysis

Humans can separately recognize independent sources when they sense their superposition. This decomposition is mathematically formulated as independent component analysis (ICA). While a few biologically plausible learning rules, so-called local learning rules, have been proposed to achieve ICA, their performance varies depending on the parameters characterizing the mixed signals. Here, we propose a new learning rule that is both easy to implement and reliable. Both mathematical and numerical analyses confirm that the proposed rule outperforms other local learning rules over a wide range of parameters. Notably, unlike other rules, the proposed rule can separate independent sources without any preprocessing, even if the number of sources is unknown. The successful performance of the proposed rule is then demonstrated using natural images and movies. We discuss the implications of this finding for our understanding of neuronal information processing and its promising applications to neuromorphic engineering.

. The absence of spurious solutions and relaxation time of the EGHR with sources conforming to Laplace (A) and uniform (B) distributions. We numerically confirmed that there was no local minimum. We considered a two-dimensional source (dim s = 2) and transform matrix A that is defined as a rotation and scaling matrix A = (A 11 , A 12 ; -A 12 , A 11 ). We allowed W to learn according to the EGHR to ensure that the elements of u = Wx were independent of each other. The learning time constant was defined as τ W = 2 × 10 2 . We defined the relaxation time as the time needed by the rule to perform ICA. ICA ability was evaluated using the maximum value of the ratio of first to second maximum values for each row and column of matrix K = WA. Thus, for the ith row, we compared |K ik /K il | with threshold e th = 0.01, where |K ik | and |K il | are the maximum and second maximum absolute values in the ith row. We also compared |K kj /K lj | with threshold e th , where |K kj | and |K lj | are the maximum and second maximum absolute values in the jth column We defined the relaxation time as the time at which all |K ik /K il | and |K kj /K lj | for all i, j first become smaller than e th . This was evaluated once every 100 steps. Parameters A 11 , and A 12 were moved within -2 ≤ A 11 ≤ 2, and -2 ≤ A 12 ≤ 2 in increments of 0.05 steps. Relaxation times were calculated 100 times for each parameter set and the means are shown in graphs. The upper bound of the simulation time was defined by T = 4 × 10 7 . In all cases, W started to form an identical matrix and converged to one of the ICA solutions before the T step. rotation matrix A. We assumed transform matrix A to be a rotation matrix that was rotated for all possible axes (N(N -1)/2 axes) with randomly selected angles θ k from -π/4 ≤ θ k < π/4, where N is the common dimension of sources, inputs, and output. Note that k is the index of one of the N(N -1)/2 axes. The relaxation time was calculated using the same criterion as in Fig S1. A learning time constant of τ W = 2 × 10 4 was used.
The upper bound of the simulation time was T = 1 × 10 9 . Red boxes represent the median of the relaxation time distributions for a Laplace distribution, while blue boxes represent those for a uniform distribution. Further, W was started from an N × N-dimensional identical matrix. (B) Dimension dependency of the relaxation time of Laplace (red) and uniform (blue) distributions. We assumed A to be a random matrix, where each element of A was randomly selected from a normal Gaussian distribution, A ij ~ N(0, 1). Note that only a matrix A whose determinant was larger than exp(-N/2) was used for the simulations. A learning time constant of τ W = 2 × 10 3 was used. Other parameters are the same as in (A). The results reveal that although relaxation time increased with both distributions as source dimension increased, it was within a finite time and the rate of the increase was slower than an exponential increase. Figure S3. Robustness of the EGHR to a choice of nonlinear function g. We generate a two-dimensional source obeying p(s i ) ∝ exp(-β |s i | α ) (α > 0, β > 0) and assume A to be a rotation matrix, A = (cosπ/6, -sinπ/6; sinπ/6, cosπ/6). Note that β was defined so that the variance of s i was one. We investigate how the choices of g, the ones designed for α = 1 and ∞, influence the relaxation time of EGHR to one of the ICA solutions for a range of α. Relaxation time was calculated using the same criterion in Fig S1. A learning time constant of τ W = 1 × 10 5 was used. The upper bound of simulation time was T = 1 × 10 8 , and W was started from an identical matrix. The red circles represent relaxation times when we used a non-linear function g L (u i ) that was optimized for a Laplace distribution (thus α = 1; see the red arrow in the figure). Blue circles represent relaxation times when we used a non-linear function g U (u i ) that was optimized for a uniform distribution (α = ∞; blue arrow). Filled circles indicate ICA was successful with the non-linear function before the T-th step, while open circles indicate that ICA was not achieved before the T-th step. The details of g L (u i ) and g U (u i ) are described in Methods. When the source obeys a Gaussian distribution (α = 2; a dashed line), both g L (u i ) and g U (u i ) fail to achieve ICA, since any rotation matrix W makes the elements of u independent of each other when α = 2.  Figure 1B legend and Methods for details).

S2.2 Linear stability of the EGHR
The aim of this section is to determine the necessary and sufficient condition for ICA solution W = A -1 to become a stable equilibrium point of the EGHR. The ICA solution is stable if and only if all terms in d 2 L, the second differential form of the cost function of the EGHR, are non-negative. Let us define a matrix K as K = WA, so that u is rewritten as u = Ks. We suppose z i := -log p 0 (s i ), z i ′ = ∂z i /∂s i = g(s i ), and z i ′′ = ∂ 2 z i /∂s i 2 = g′(s i ).
First, we consider an analogy between the EGHR and Amari rule [15]. The cost function of the Amari rule L A is defined by , and the expectations of its first and second differential forms at . Therefore, the necessary and sufficient conditions for the Amari equation to become linearly stable are 〈s i 2 z i ′′〉 > -1 and 〈s i 2 〉〈z i ′′〉 > 1. Similarly, dL, the first differential form of L, is calculated as Eq. (S6) is a differential form of the EGHR. As described in Methods, the expectation (11)). The second order differential form of L, d 2 L, is then calculated as If K = 0 and g(0) = 0, we obtain Case 2-1. When i = k and j ≠ l, Eq. (S9) becomes zero because either s j or s l is invariably independent on other variables and (E(s) -E 0 )∂g(s i )/∂s i is an even function of s 1 , …, s N , so that (g(s i ) 2 s j s l + (E(s) -E 0 )∂g(s i )/∂s i s j s l ) is definitely an odd function for s j and s l and its expectation is zero (see S2.1.2).
Case 2-2. When i = k and j = l, Eq. (S9) becomes which is an even function of s 1 , …, s N .

S2.3 The absence of spurious solutions of the EGHR with Gaussian sources
In this section, we show the absence of spurious solutions of the EGHR if sources obey a Gaussian distribution. Note that, while ICA is not defined for Gaussian sources, the EGHR can still perform input whitening. Furthermore, we show that there is no spurious solution as long as the source distribution is nearly Gaussian in the next section.
Assuming that s i obeys a normal Gaussian distribution, p 0 (s i ) = N(s i ; µ = 0, σ 2 = 1), z i , z i ′, z i ′′, and E 0 are calculated as z i = u i 2 /2 + 1/2·log2π, z i ′ = g(u i ) = u i , z i ′′ = 1, and E 0 = N 〈s i 2 /2 + 1/2·log2π〉 + 1 = N/2 + N/2·log2π + 1, respectively. Then K ． can be rewritten as The condition where Eq. (S15) becomes zero for all i is ∑ k K ki 2 = 1 for all i. On the The condition where Eq. (S16) becomes zero for all i, . Taken together, Eq. (S14) is rewritten as The first term of Eq. (S17) helps to decorrelate the elements of u, while the second term only scales u. Therefore, the condition of K It is easy to see that K = 0 is an unstable equilibrium point from Eq. (S17). On the other hand, the K T K = I solution indicates that K is a rotation matrix R. From an analysis of the linear stability, it turns out that K = R resides at a valley of L, and K = 0 resides at a peak of L (see S2.2).
There is no other peak or valley of L. Therefore, the global minimum of L is given when K = R, with which input whitening is achieved.

S2.4 The absence of spurious solutions of the EGHR with the non-Gaussian sources with small non-Gaussianity
In this section, we proposed an approach to calculate the curvature of L when the sources obey a probability distribution similar to Gaussian distribution. We show that there is no spurious solution if the sources are distributed close to a Gaussian distribution.
Step 1. We define the general form of a source distribution by p 0 (s i ) ∝ exp(γ 1 s i 2 + γ 2 s i 4 + γ 3 s i 6 + ···). Assuming that γ 1 = -1/2, γ 2 = ε, and γ 3 , γ 4 , … are order O(ε 2 ) or less, where ε is a small constant, we obtain the first order approximation of p 0 (s i ) with ε as The integral of N(s i ; 0, 1)(1 + εs i 4 ) can be regarded as the expectation of (1 + εs i 4 ) over N(s i ; 0, 1), which is calculated as where 〈•〉 N(si) is the expectation over N(s i ; 0, 1). Thus, p 0 (s i ) can be further approximated The kurtosis of p 0 (s i ) is calculated as 〈s i 4 〉/〈s i 2 〉 2 -3 = 24ε; therefore, p 0 (s i ) will be a super-Gaussian distribution if ε is positive, while it will be sub-Gaussian if ε is negative.
We then consider z(u i ) for the general prior p 0 (s i ) as z(u i ) = -log p 0 (u i ). We assume that z(u i ) is an even function. The derivative is represented as g(u i ) = z′(u i ). In addition, we From the relationship of s = K -1 u, φ(s) is expanded as We define a matrix C by C := 〈(E(u) -E 0 )g(u)u T φ(s)〉 N(u) . If and only if C is zero, K ． becomes zero. Because E(u) is an even function, the diagonal elements of C become and its non-diagonal elements (i ≠ j) are where we set becomes zero. In this condition, C ij becomes zero if and only if ∑ k {c 1 (K -1 ) ki 3 (K -1 ) kj + c 2 (K -1 ) kj 3 (K -1 ) ki } = 0 holds for all i ≠ j. Whereas, if u i and u j are correlated, u) is not equal to zero; therefore, u must be decorrelated in order for C ij to be zero. Accordingly, because u should obey a normal Gaussian distribution to reach an equilibrium point, a necessary condition for K ． = 0 is that K becomes proportional to a rotation matrix R.
Step 2. In this step, we assume K is closed to a rotation matrix (K = R) and determine the optimal rotations that gives K ． = 0. Note that N(u; 0, I) = N(s; 0, I) holds in this case because rotation of a whitened Gaussian distribution is also white. In the following, we first consider the case where A is a rotation matrix, and later generalize the results to non-rotational A. Moreover, we assume that E 0 satisfies continues to be a rotation matrix (see S2.3). Since we suppose E[K] to be a rotation matrix, the degree of freedom of E[K] is N(N -1)/2. We define a rotation in an s i -s j plane to be θ ij (1 ≤ i < j ≤ N) and the corresponding rotation matrix as R θij . Here, R θij is the same as the identity matrix I except that the (i, i)th, (i, j)th, (j, i)th, and (j, j)th elements are cosθ ij , -sinθ ij , sinθ ij , and cosθ ij , respectively. Therefore, any rotation can be written as the product of various R θij for any pair of axes. Let us suppose K = R θij B θij holds, where B θij is any rotation matrix except a rotation in an s i -s j plane. Then, R ． θij can be written as and Eq. (S26) is approximated as The elements of R ． θij except for those of (i, i), (i, j), (j, i), and (j, j) are fixed to be zero. Since Thus, from Eqs. (S27) and (S28), we obtain Then, K ik and K jk can be expanded as and K ik 3 K jk and K jk 3 K ik can be calculated as Because θ ij = 0, π/2 are sufficient conditions for K to be an ICA solution, by substituting By assuming θ ij is independent of B θij , Eq. (S35) can be approximated as positive-definite when K is a rotation matrix. Therefore, if and only if θ ij = kπ/4 (k = 0, 1, …, 7), θ ． ij becomes zero. Notably, if θ ij = 0 is stable θ ij = kπ/4 (k = 2, 4, 6) are also stable, while θ ij = kπ/4 (k = 1, 3, 5, 7) are unstable. If θ ij = 0 is unstable, the exactly opposite occurs. Therefore, if θ ij = 0 is stable, which depends on the sign of ε (c 1 -c 2 ), only ICA solutions give equilibrium points of θ ij s, which means that there is no spurious solution of EGHR for the source distribution near Gaussian. Furthermore, for any transform matrix A, A T A is a positive definite matrix, so that even when A is not a rotation matrix, there is no spurious solution of EGHR for the source distribution near Gaussian.
Step 3. Lastly, we evaluate the sign of ε(c 1 -c 2 ) to determine whether these ICA solutions are stable or unstable. Indeed, the sign of (c 1 - Accordingly, from Eqs. (S36) and (S39), we obtain

S2.5 Simplification of conventional local ICA rules
In this section, to speed up numerical simulations in Fig. 3B, we analytically simplify the conventional local ICA rules, assuming that A and W are rotation matrices.

S2.5.1 The Linsker rule
In the Linsker rule [20] (see also Eq. (5) becomes W -T . Therefore, if (I -Q) rapidly converges to a 〈uu T 〉 (a > 0), the Linsker rule is equal to the Bell-Sejnowski rule [20]. To achieve such a condition, the learning rule To compare the EGHR with the Linsker rule, we consider the case where the sources are not much slower than v and the sampling time is discrete. If we assume a discrete sampling time Δt and that Q quickly converges to a fixed point, Eq. (5) can be rewritten as From the second and third equations, we obtain a solution of v as In order for Eq. (S42) to converge, |I -aKK T Δt/τ v | < 1 should hold. If K is proportional to a rotation matrix, that is, K = cR (c > 0), this condition can be further simplified as |1 -ac 2 Δt/τ v | < 1. As (ac 2 Δt/τ v ) is a positive scalar, we get c's condition for v to converge as 1 -a c 2 Δt/τ v > -1, or equivalently, This indicates that a large a is likely to make v unstable if K is started from large initial conditions. The Hebbian term 〈avx T 〉 in the first equation in Eq. (S41) with a general K can then be calculated as For a strict solution, we need to calculate ∑ k=0 where ρ((k + 1)Δt) := 〈s i (t + (k + 1)Δt)s i (t) T 〉 is the auto-correlation of a signal train generated from Eq. (7). From Eq. (S44), we define Eq. (17) as the reduced Linsker (R-Linsker) rule. The numerical calculation in Fig. 3B uses this R-Linsker rule to speed up simulations.
Next, we qualitatively explore how the ICA solutions disappear depending on parameter values. Approximating the auto-correlation function by ρ(t -t′) ≈ exp(-|t -t′|/τ s ) for an analytical simplification, we obtain Therefore, the R-Linsker rule is simplified as The coefficient matrix of W -T is defined by C L (K) := {τ v (exp(Δt/τ s ) -1)/aΔt K -T K -1 + I} -1 (see Table S1). For small Δt, dramatically decreases as (τ v /τ s )/a increases while 〈g(u) x T 〉 does not. This indicates that the ICA solutions (i.e., equilibrium points) disappear if (τ v /τ s )/a is too large.

S2.5.2 The Foldiak rule
The Foldiak rule [19] was originally described as shown in Eq. (6). To speed up the computation to plot the 2-dimensional velocity map in Fig. 3B, we consider a reduced version of the original Foldiak rule, assuming the following conditions: (1) s, x, u, v are all two dimensional vectors and the lateral interaction matrix is described by Q=[0, q; q, 0] with q ≤ 0, (2) A is a rotation matrix and W is a rotation matrix up to scaling factor, where c is a constant scaling factor, θ is the angle of the rotation by K = WA, and σ F is the parity of F, taking 1 and -1 if F is even and odd function, respectively. Similar calculations for 〈F(u 1 )G(u 2 )〉 n (n = 3 and 4) together yield Hence, we find that 〈f(u 1 )f(u 2 )〉 = 0 by setting F = G = f and σ f = -1. This result confirms that h = q = 0 is a solution of the Foldiak rule, i.e., 〈y i 〉 = 0 and 〈y 1 y 2 〉 = 0.
We next show that this h = q = 0 solution is stable. To demonstrate the linear stability, we express the derivative of the output by dy which is evaluated at h = q = 0. Using this expression, perturbations of h and q, i.e., dh and dq, develop according to Finally, we show that h and q must be both zero when the Foldiak rule achieves an ICA solution W = A -1 . In this case, u 1 = s 1 and u 2 = s 2 are independent inputs and outputs are given by y i = f(s i + qy j + qb -h). To keep the outputs independent, q must be zero because any none-zero q would introduce some dependency between the two outputs. If q = 0, then, h = 0 is the unique solution to 〈y i 〉 = 0 for monotonically increasing f.
Altogether, h = q = 0 is a stable solution of the Foldiak rule throughout the learning in this case, and h = q = 0 must hold when it eventually achieves the ICA solution.
Although it is possible to initialize h and q to a non-zero value, the output y becomes multi-stable if |q| is large. Hence, it is most natural to initially set h = q = 0, following the original proposal [19]. Accordingly, Eq. (6) If W = A -1 , u = Wx becomes u = s and 〈g(u)x T 〉 p(x) becomes 〈g(s)s T 〉 p0(s) A T . Note that 〈g(s)s T 〉 p0(s) is a diagonal matrix since s 1 , …, s N are independent of each other, and its diagonal elements are one, i.e., 〈g(s)s T 〉 p0(s) = I (see S2.1.1). As W = A -1 satisfies Eq.
(S50), W = A -1 is one of the solutions to the Bell-Sejnowski rule.
The equilibrium points of the Amari and Cichocki rules are calculated similarly. The difference among the Bell-Sejnowski, Cichocki, and Amari rules is only the multiplication of some regular matrix from the right, which does not remove an equilibrium point. Therefore, W = A -1 is also an equilibrium state of the Amari and Cichocki rules.

S2.6.2 The Linsker rule
As shown in S2.5.1, the Linsker rule becomes equal to the Bell-Sejnowski rule when the input is much slower than the dynamics of v [20]; therefore, W = A -1 is also an equilibrium state of the Linsker rule. However, when sources obey a rapidly changing super-Gaussian distribution (e.g., a Laplace distribution), the Linsker model does not always have a stable equilibrium point depending on the time constant and distribution shape of the source. Indeed, the existence of ICA solutions for the R-Linsker rule (Eq. (S40)) depends on the parameters when the sources are not much slower than v. If we assume K = c I (c > 0), the second term on the right of Eq. (S46) becomes 〈g(u)x T 〉 = 〈g(cs)s T 〉A T . If we assume sources obey a Laplace distribution, we obtain 〈g(cs)s T 〉 = 〈sgn(s)s T 〉 = I, which is independent of c. The fixed point is then calculated as Therefore, in order for K to converge to a non-zero finite value, the radial distance c should be started from and the fixed point of c is Accordingly, in order for the fixed point to exist, τ v /aΔt should be τ v /aΔt < € 1 4(exp(Δt /τ s ) −1) .
Note that a also controls the stability of the dynamics of v. If a is set too large, the dynamics of v does not converge.

S2.6.3 The Foldiak rule
The Foldiak rule (Eq. (6) A -1 , the output is given by v = f F (s). Therefore, 〈vs T 〉 must be proportional to an identical matrix. In this case, the equilibrium condition is given by 0 = W ． = a〈vs T 〉A T -A -1 , or equivalently, A -1 A -T = a〈vs T 〉 ∝ I. Accordingly, in order for the Foldiak rule to achieve ICA, A must be proportional to a rotation matrix. Moreover, if a is chosen so that a〈f F (s i )s i 〉 = 1, A -1 should be equal to A T , i.e., A must be a rotation matrix in order for the Foldiak rule to have the ICA solution.

S2.7.1 The Bell-Sejnowski, Amari, and Linsker rules
Here, we repeat the linear stability analysis of the the Bell-Sejnowski, Amari, and Linsker rules [32] to apply it to the Cichocki and Foldiak rules in the following sections.
Let us define K as K = WA and represent u as u = Ks. Linear stability is confirmed by showing that when K = WA = I + εJ is substituted into each ICA rule, where positive constant ε is small enough, the signs of the elements of K ． never change, regardless of the value of J. Specifically, the Amari rule (Eq. (3)) is rewritten as K ． ∝ 〈I -g(Ks)s T K T 〉 K (see Table S1). If we assume that W is near to A -1 as K = I + εJ, (ε << 1), the Taylor expansion of g(s + εJ s) is given by g(s + εJs) = g(s) + ΛεJ s + O(|εJ s| 2 ), where Λ =

S2.7.3 The Foldiak rule
We assume the same assumptions with S2.5.2. Thus, the Foldiak rule is equal to Eq. = (a ∑ k J ik 〈Λ ii s k s j 〉 -J ij )