Somatodendritic consistency check for temporal feature segmentation

The brain identifies potentially salient features within continuous information streams to process hierarchical temporal events. This requires the compression of information streams, for which effective computational principles are yet to be explored. Backpropagating action potentials can induce synaptic plasticity in the dendrites of cortical pyramidal neurons. By analogy with this effect, we model a self-supervising process that increases the similarity between dendritic and somatic activities where the somatic activity is normalized by a running average. We further show that a family of networks composed of the two-compartment neurons performs a surprisingly wide variety of complex unsupervised learning tasks, including chunking of temporal sequences and the source separation of mixed correlated signals. Common methods applicable to these temporal feature analyses were previously unknown. Our results suggest the powerful ability of neural networks with dendrites to analyze temporal features. This simple neuron model may also be potentially useful in neural engineering applications.

C ognitive functions of the brain entail modeling of externally or internally driven dynamical processes. For this modeling, the brain has to identify the salient temporal features of continuous information streams. How the brain conducts this time-series analysis remains unknown, but the component processes necessary for the analysis are partly known. The process by which frequently recurring segments of temporal sequences are concatenated into single units that are easy to process is called chunking or bracketing 1 . Chunking underlies sensory scene analyses, motor learning, episodic memory, and language processing 2-6 . In predictive coding [7][8][9] , the brain may chunk information in bottom-up and top-down pathways to identify variables relevant to the hierarchical Bayesian modeling of mental processes. Another important class of temporal feature analysis is blind source separation (BSS: related to the so-called cocktail party effect) in which the brain separates mixed sensory signals (typically auditory) from multiple sources in order to recognize the individual sources 10 . Despite their functional importance, the mechanisms by which neural circuits in the brain analyze and learn temporal features remain largely unclear. Whether different temporal feature analyses require specialized network architectures and learning rules is also unknown.
In this study, we introduce a novel solution to these fundamental problems of brain computing. We show, in a twocompartment neuron model, that the minimization of information loss between dendritic synaptic input and a neuron's own output spike trains enables efficient learning of clustered temporal events in a completely unsupervised manner. This learning proceeds intracellularly and can be viewed as a self-supervising process in which a single neuron (more precisely, the soma) generates an appropriate supervision signal to learn the spatiotemporal firing patterns repeated in upstream neurons (projecting to the dendrites of the neuron). The resultant learning rule conceptually resembles Hebbian learning with backpropagating action potentials, which experimental results [11][12][13][14][15] have demonstrated to be crucial to synaptic plasticity in cortical neurons. Importantly, our learning rule exploits the fact that neuronal adaptation is able to maintain somatic membrane potential in a regime where spiking has high information content [16][17][18][19] . Therefore, the gain and threshold of the somatic transfer function in our model are adapted in a history-dependent manner.
To our surprise, a family of competitive networks of the proposed neuron model can perform a variety of unsupervised learning tasks ranging from chunking to BSS, which were previously performed by specialized, distinct networks and learning rules. Members of this family have the same network architecture but different network parameters (e.g., synaptic weights). We emphasize that some chunking tasks solvable with our model (and also by humans) are difficult for conventional machine learning methods due to uniform transition probabilities between consecutive items 5 . Furthermore, the same network model successfully separates the mixed signals of highly correlated sources, namely musical instruments playing the same note. BSS has been extensively studied in machine learning [20][21][22][23] , but how the brain solves this problem is not fully understood. Our results provide suggestions for computational principles which may underlie the wide range of subconscious temporal feature analyses by cortical networks and the active role of dendrites in these processes.
Our algorithm builds on ideas introduced by the twocompartment learning rule of Urbanczik and Senn 24 , expanding the scope of neural computing towards slow-feature analysis (SFA 25 ) and independent-component analysis (ICA) based on temporal correlations 26 . A central feature of our learning rule is that synaptic weights on the dendrite are changed such that the somatic membrane potential fluctuates with unit variance around a target value. Our formulation is inspired by the observations that neuronal adaptation shifts the neuron always toward a regime of efficient information transmission [16][17][18][19] .

Results
The minimization of regularized information loss. Our model learns temporal features of an input based on a novel learning rule which we call minimization of regularized information loss (MRIL). Suppose the dendrite attempts to predict the responses of soma. In short, MRIL achieves this by minimizing the information loss (within a certain recent period) when the somatic activity is replaced with its model, that is, the dendritic activity driven by given synaptic inputs, the loss can be easily minimized if the somatic responses are well predicted. This will be the case when the neuron learns to selectively respond to temporal patterns recurring in synaptic input. Figure 1a schematically illustrates the present learning rule in a two-compartment spiking neuron model. Mathematically, MRIL minimizes the Kullback-Leibler (KL) divergence between the probability distributions of somatic and dendritic activities (see Methods for mathematical details). Note that in the resultant learning rule the somatic response is fed back to the dendrite to train dendritic synapses. These processes may be regarded as a consistency check between the soma and dendrite. Although the underlying biological mechanisms are not modeled here, backpropagating action potentials may provide such a feedback signal in cortical pyramidal neurons 27 .
The division of labor between the soma and dendrite was previously modeled with a teaching signal given explicitly or implicitly to the soma 24 . Unlike the previous model, our model modulates the gain and threshold of somatic responses according to the recent history of somatic responses. These modulations enable the model to avoid a trivial solution to the learning rule (see Methods), and therefore ensure successful learning of nontrivial temporal features. Differences between the present and previous models will be further discussed later.
Our learning rule (Eq. 16 in Methods) looks similar to maximum likelihood estimation 28 , a well-studied framework of supervised learning. However, there is a conceptual difference between them. In maximum likelihood estimation, the target data distribution (somatic activity) is provided externally as teaching signals. By contrast, our model simultaneously learns the probability distributions of input and output data without teaching signals. The consistency between the two data sets constrains the self-supervised learning, thereby avoiding an overly redundant or an overly simplistic categorization of temporal inputs. We emphasize that MRIL fits particularly well with neurons with dendrites, but the principle is generic and applicable to a broad range of information processing systems.
Learning patterned temporal inputs in single neurons. We first demonstrate that the two-compartment neuron model detects the salient temporal features recurring in synaptic input. Learning to detect and discriminate repeated temporal input patterns is crucial for various cognitive functions such as language acquisition 29,30 and motor sequence learning [2][3][4]31 . In Fig. 1b, presynaptic spike trains intermittently repeated three fixed temporal patterns of 50 ms each with equal probabilities of occurrence. These patterns may be regarded as chunks. As learning of the temporal input proceeds through the consistency check between the soma and dendrite, a single neuron gradually learned to respond selectively to an input pattern (Fig. 1c, d). The neuron learned one of the input patterns with approximately equal probabilities among the trials, although it responded to more than one input pattern in some trials (Fig. 1e). We note that all presynaptic neurons had the same average firing rates, which were constant during the entire task period (Methods). Therefore, the discrimination does not rely on differences in firing rates. Cortical neurons are actually capable of discriminating temporal inputs and generating sequence-selective spike outputs, although the synaptic sequences tested in the experiment were relatively simple 32 .
Automatic chunking with MRIL and inhibitory STDP. Next, we considered a competitive network of the two-compartment model neurons receiving similar presynaptic spike trains (Fig. 2a).
To study whether chunk-specific cell assemblies can be formed, we made recurrent inhibitory connections among these neurons modifiable by inhibitory spike timing-dependent plasticity (iSTDP; Fig. 2b). For near synchronous presynaptic and postsynaptic spikes, changes in inhibitory weights are negative in our iSTDP rule. Consequently, this rule weakens inhibition between two neurons when both of them respond to the same temporal feature, as shown below. The use of this plasticity rule for lateral inhibition is realistic given this type of STDP has been found at cortical excitatory synapses on inhibitory interneurons 33 and at inhibitory synapses in the hippocampus 34 . In either case, inhibitory circuits will exhibit the desired changes. Note that inhibitory weights were restricted in the positive regime (Methods). During learning, each neuron gradually increased coherence between the somatic and dendritic activities (Fig. 2c). The postsynaptic neurons self-organized into three neuron ensembles, each detecting one of the input activity patterns (Fig. 2d), through iSTDP which enabled mutual inhibition between the neural ensembles (Fig. 2e). The strength of lateral inhibition needs to be within an appropriate range, as too strong ( Supplementary  Fig. 1a) or too weak ( Supplementary Fig. 1b)     The dendritic component receives Poisson spike trains, and the somatic membrane potential is given as an attenuated version of the dendritic membrane potential. Output of the soma backpropagates to dendritic synapses as a self-teaching signal. Learning stops when the dendrite minimizes the error between its prediction and the actual somatic firing rate. b Three frozen spatiotemporal patterns (red, blue, and green) were repeated in irregular spike trains from 2000 input neurons. c A two-compartment neuron selectively learned one of the recurring patterns. Examples of the somatic (red) and dendritic (black dashed lines) activities are shown at the initial (top), middle (middle), and final (bottom) stages of learning. d Learning curve is shown, with circles indicating the time points at which the examples were drawn. Instantaneous correlations were calculated between the activities of the dendrite and soma every 15 s during learning. e The fraction of trials in which a single neuron model learned a selective response to one of the three repeated spike patterns is shown. The number of trials was 100. In some trials (Others), the neuron had more than one preferred pattern, i.e., the peak response to the second preferred pattern was greater than 50% of that to the most preferred pattern.
generate chunk-specific cell assemblies. The regularization parameter γ (see Methods) also has to be in an appropriate range, as values which were too large suppressed all neural responses and those which were too small did not generate selective responses to chunks ( Supplementary Fig. 1c).
Weights of mutual inhibition were strengthened rather than weakened when a neuron pair fired synchronously in several previous models 35,36 . We therefore tested whether and how the conventional iSTDP rule works in the above chunk-detection task ( Supplementary Fig. 2a). The conventional iSTDP rule generated variety of complex response patterns ( Supplementary Fig. 2b). A small portion of neurons expressed chunk-specific responses (e.g., neurons 3, 4 and 8). However, some neurons responded to more than one chunk (e.g., neurons 1 and 10) and other neurons to chunks and random inputs almost arbitrarily (e.g., neuron 5). Inhibitory weight matrix also showed no obvious cell-assembly structure ( Supplementary Fig. 2c). Therefore, the iSTDP rule shown in Fig. 2a is thought to be more suitable than the conventional one for the present chunk-detection task.
The ability of the network model to learn recurring input patterns was assessed with various types of biological noise. Background presynaptic spikes degraded the performance as the signal-to-noise ratio decreased (Fig. 2f), whereas learning was optimal at finite noise levels with synaptic transmission failure (Fig. 2g) and with jitters in presynaptic spike timing (Fig. 2h). We speculate that this disparity may reflect the different underlying noise structures. Background spikes were uncorrelated with the recurring input patterns and merely contaminated the signals, whereas transmission failures and timing jitters yielded noise patterns which were correlated with the input and thus enhanced the sampling during training. Therefore, the two types of noise are thought to induce data augmentation. Presynaptic noise may also induce a regularization effect during learning 37 . However, this effect was unlikely to be prominent in our model as not all types of presynaptic noise improved the learning.
The above results may account for the perceptual ability of humans to detect the recurrence of frozen noise patterns embedded in a noisy auditory signal 38 . As in Fig. 1b, both repeated and background auditory signals may be represented by irregular synaptic inputs to the auditory cortex. However, the subjects from this report 38 learned the noise without extensive training, indicating that the learning mechanisms might differ from the method presented here. We may use the present network model in analyzing largescale neural activity data. To show this, we performed similar simulations using synthetic data in which only a small fraction of presynaptic neurons (from a total 500) constituted a recurring pattern (Fig. 3a). (We note it is unlikely a large portion of recorded neurons participate in recurring cell assemblies in real data.) Learning was successful when the fraction of presynaptic neurons constituting the recurring pattern was 10% or 5%, but unsuccessful at 3% (Fig. 3b, c). We then considered the case where the total number of presynaptic neurons was 1,000 and 25 neurons (2.5% of all neurons) belonged to a patterned activity. Interestingly, the network still succeeded to learn the pattern, indicating that successful learning requires a minimal absolute number, but not a minimal fraction, of pattern-encoding presynaptic neurons (Fig. 3d).
Previously, STDP was used to detect repeated spike sequences 39,40 . We compared the detection performance between the present model and a STDP-based model 39 (see Supplementary Fig. 3a, b). Both models exhibited high success rates when recurring cell assemblies made up a large portion of presynaptic neurons. An interesting difference was found when only a small portion of presynaptic neurons participated in the cell assemblies. In such cases, our model outperformed the previous model ( Supplementary Fig. 3c).
We further examined the ability of our network model in learning a variety of information streams. First, we applied random sequences of three chunks comprised of four characters each (Fig. 4a) to a network model with 10 output neurons and 1000 input neurons. Each input neuron generated a 30 ms 10 Hz burst in response to a randomly assigned preferred character (Fig. 4b). This resulted in the formation of three neuron 10% 5% 3% 0%  Other cases with 25 (5%) and 15 (3%) such neurons were also examined. b Average correlations over 40 independent trials are shown between chunk-selective responses and the corresponding reference patterns. Vertical bars are standard errors and green circles show data points. P-values were calculated by two-sided Welch's t-test. c Examples of chunkselective neuronal responses in the 10% (top), 5% (middle), and 3% (bottom) cases. d A recurring firing pattern of 25 neurons was embedded into input spike trains of 1000 presynaptic neurons. The average and standard error of the input-output correlation with overlaid data points (left) and a typical response after learning (right) are shown. P-value (two-sided Welch's t-test) indicates a significant difference from the 0% case in b. As in b, 40 independent simulations were performed. NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-15367-w ARTICLE NATURE COMMUNICATIONS | (2020) 11:1554 | https://doi.org/10.1038/s41467-020-15367-w | www.nature.com/naturecommunications ensembles which selectively responded to the chunks (Fig. 4c). Principal-component analysis of the low-dimensional dynamics of the output neurons revealed the emergence of three chunks after learning (Fig. 4d). Then, we examined whether the model can learn partially overlapping chunks. In this case, some characters were shared between the three chunks ( Fig. 4e) and learning was more difficult than in the previous case. The original model, with fast synaptic current, failed to generate selective responses to the chunks (Fig. 4f). However, making the decay constant of the synaptic current slower (50 ms compared to 5 ms: see Methods) enabled the model to detect temporal inputs on a longer timescale and to successfully learn the overlapping chunks (Fig. 4g). The modified network could also learn chunks even if they were embedded with distractors, which were random sequences of arbitrary English characters (a to z) with variable lengths (3 to 7) ( Supplementary Fig. 4). These results suggest slower synaptic currents such as NMDA receptor-mediated currents may be important for chunking.
Because the word segmentation shown above is also relatively easy for other methods 41 , we tested our model with more complex input sequences generated by a random walk on a graph with a community structure in which the connection of each node to the other four occurred with an equal probability of 0.25 (Fig. 5a). Here, temporal community is clusters of frequently coappearing or mutually predicting stimuli in input sequence. The detection of this community structure is easy for human subjects but has proven difficult for conventional machine learning methods which rely on nonuniform transition probabilities between elements 5 . Like human subjects, output neurons in our model easily learned selective responses to members of a temporal community (Fig. 5b).
The network model could also learn feature detection maps from continuous sensory streams. All sensory features, either static or dynamic, arrive at the brain essentially in sequence. Therefore, we asked whether MRIL enables neural networks to learn the static features of an input when repeatedly presented in a temporal sequence. To examine this, we applied a random sequence of noisy images of oriented bars presented for 40 ms every 70 ms (Fig. 6a). The output neurons, which initially had no preferred orientations (Fig. 6b), developed well-defined preferences for specific orientations after learning (Fig. 6c), resembling a visual orientation map (Fig. 6d) 42,43 .
BSS of mutually correlated signals. The results shown above demonstrate that MRIL successfully chunks a variety of temporal inputs by detecting repeated temporal features. The question then arises whether this ability of the MRIL enables learning of other types of sequence processing tasks. One such task of cognitive and ecological importance is the so-called cocktail party problem 10 . We therefore examined the performance of our network model in the blind separation of mixed signals from multiple sources. BSS is an extensively studied problem in auditory processing [20][21][22] , and various methods have been proposed for mixtures of mutually independent signals. However, methods are limited when the original signals are comprised of mutually correlated signals 23 .
We applied MRIL to sound mixtures from two musical instruments, a bassoon and a clarinet (Bach10 Dataset) 44 , playing their respective parts of the same score (Fig. 7a) (thus the two sound sources are correlated). A mixed sound followed by the original sounds of the two instruments are presented in Supplementary Audio 1. These mixtures of signals were encoded into irregular spike trains (Fig. 7b), which in turn were applied to output neurons. After training, these neurons self-organized into two groups, each responding selectively to one of the true sources (Fig. 7c). The original sounds were then decoded from the average firing rates of these subgroups (Supplementary Audios 2 and 3). Although some high-frequency components were lost due to the low-pass filtering effect of the slow membrane dynamics ( Supplementary Fig. 5), the decoded sounds are readily comparable to the original sounds. We compared our model with a naive independent-component analysis (FastICA: Supplementary  Audio 4) 22,45 and temporal ICA (Second Order Blind Identification or SOBI: Supplementary Audios 5 and 6) 26 . We used the open source software of the SOBI for the comparison (Supplementary Methods). When the source signals are mutually independent, all three methods show excellent performance, although the ICA-based methods slightly outperformed our biology-inspired model (Fig. 7d, top). However, when the source signals are dependent on one another (i.e., mutually correlated), SOBI and our model exhibited significantly better performance than FastICA (Fig. 7d, bottom).
Although SOBI slightly outperformed our model in the present examples, SOBI only poorly performed chunking of the previous sequences of English characters, which our model could easily solve (see Fig. 4). In our simulations, SOBI could not generate  Fig. 6a). Rather, most of the units responded to all three chunks in SOBI. We conducted similar analyses for low-pass filtered versions of the input by using different time constants for coarse graining (15, 30 and 50 ms) or the bin width (1 ms or 10 ms), but the essential results remained unchanged (Supplementary Fig. 6b). We also examined SFA, a method known in temporal feature analysis 25 , on a similar task by using a Python toolkit 46 . The algorithm failed to generate any stable output when input sequences involved chunks. Detecting a whole chunk and detecting an arbitrary single character cost equally in the objective function of SFA (Methods). Due to this fact, the minimization algorithm of SFA presumably has too many solutions to chunking. Thus, our results demonstrate a virtue of the present brain-inspired model, which exhibits high levels of task performance in a wide range of temporal feature analysis. In addition, the model does not require highly task-specific network architectures.
Finally, we examined the performance of the model by varying the magnitudes of cross-talk noise between the two mixed signals (Methods). We also tested mixed signals which used the same instrument but playing different notes. In all cases, high performance was attained only at an intermediate level of cross-talk noise, implying that performance drops not only for strong noise but also for weak noise (Fig. 8a, dashed curves). Nevertheless, we could rescue the model from this counterintuitive defect for weak cross-talk noise by including another noise component (see Methods) in the somatodendritic interaction (Fig. 8a, solid curves). We speculate that the additional noise could suppress learning from harmful interferences between the original signals when both signals were weak. However, this point requires further clarification. We also examined whether the improved model trained on the original signals (i.e., vanishing cross-talk noise) exhibit better performance for other mixtures which were not used in the training. The pre-training actually made the decomposition of unexperienced mixtures easier (Fig. 8b).

Discussion
Nonlinear Hebbian and generalized STDP algorithms have been used as unsupervised learning rules to perform receptive field development 42,43 , ICA [47][48][49] , sparse coding 43 , spatio-temporal pattern detection 39,40 , or SFA 50 . Our novel algorithm belongs to the same family of methods and is applicable to some classic problems of receptive field development and ICA as well as to the additional problem of 'chunking' as an example task with specific temporal structure that has traditionally been solved with more specialized algorithms 51,52 .
We proposed a learning principle called minimization of regularized information loss (MRIL) which enables the selfsupervised learning of recurring temporal features in information streams using a family of competitive networks of twocompartment neuron models. Our model not only performs chunking but also achieves BSS from mixtures of mutually correlated signals. Importantly, although different values of parameters were learned in different tasks, the network structure was essentially the same. It is surprising that simple such neural networks with almost identical circuit structures can perform these broadly different tasks. In particular, our brain-inspired model can solve tasks, e.g., the detection of temporal community structure (Fig. 5) and the BSS of mixed correlated signals (Fig. 7), which conventional models have historically struggled with. To our knowledge, this is the first model to achieve such results on this broad collection of learning tasks.
Our learning rule minimizes the information loss between synaptically-driven dendritic activity and somatic output in the presence of neuronal adaptation. This rule uses mutually inhibiting two-compartment neurons to learn the repetition of temporal activity patterns on a slow timescale (typically, several tens to several hundreds of milliseconds). While the aim of many previous methods for chunking is to predict input sequences 53,54 , our model uses a different principle, where system learns to predict its own responses to a given input. MRIL minimizes the discrepancy between input data and output data to produce a predictable low-dimensional representation of high-dimensional input data. This learning continues until an agreement is formed by the somatic output and dendritic input regarding the lowdimensional features (i.e., chunks).
We previously used paired reservoir computing for chunking, where two recurrent networks supervise each other to mimic the partner's responses to a common temporal input 55 . Although that model also learns self-consistency between input data and output data, performance was severely limited since the model required exactly the same number of output neurons as chunks. In contrast, the present model self-organizes output neurons according to the number of temporal features.
Mutual information maximization (MIM) has often been hypothesized to describe the transfer of information between neurons 56 , and Hebbian synaptic plasticity may approximately follow MIM 57 . The aim of MRIL differs from MIM; MRIL attempts to detect recurrence, and hence salient, temporal features without considering the other information, whereas MIM ultimately implies that messages are faithfully copied at all layers of hierarchical processing. In other words, MIM does not account for the compression or abstraction of temporal inputs, whereas MRIL aims to describe how these processes may be executed in the brain and incorporates them into the model. Our results suggest such processes can even occur at the level of single cortical neurons.
Similarly to MRIL, a method called information bottleneck also compresses data streams 58 . The method contains a free parameter to determine the degree of information loss between the original and compressed data. To clarify whether there is a relationship between information bottleneck and the proposed method is an intriguing open question.
A previous model (U-S model) 24 used a learning rule similar to the present one. However, while the somatic response function undergoes activity-history-dependent modulations in our model (see Eqs. 4-7), such modulations were not included in the U-S model. Importantly, our model without these modifications (i.e., the U-S model) could not solve the present unsupervised learning tasks. Networks of the U-S model were shown to perform semiunsupervised learning, for instance, when recurrent synaptic input was configured as an effective teaching signal to the soma. In contrast, our model indicates the recent history of somatic activity is sufficient for self-supervising the learning of temporal features. We note that the somatic response modifications introduced in this model may be achieved in cortical neurons by local inhibitory circuits 59 , the plasticity of intrinsic excitability 60 or neuronal adaptation [16][17][18][19] .
Dendritic computing has been studied from various viewpoints of neural computing. Memmesheimer et al. derived the capacity of leaky integrate-and-fire neurons to implement desired transformations from streams of input spikes into desired output spike sequences 61 . The capacity was estimated by calculating the available volume of state space for generating the desired spike outputs and an error-correcting supervised learning rule was presented to attain the desired input-output associations (which does not require dendrites). Legenstein and Maass studied the role of nonlinear dendritic processing in performing various logic operations 62 . Their model combines the branch-strength potentiation of dendrites and STDP to discriminate spatial activity patterns represented in presynaptic neuron ensembles. Sacramento et al. used dendrites to implement a classical error backpropagation algorithm for supervised learning where deviations between top-down predictive signals and bottom-up sensory signals provided an error signal 63 . Redundant synaptic connections between neuron pairs have also been utilized to implement a Bayesian filtering algorithm to infer input-output associations in single neurons with realistic dendritic morphology 64 . If such a model includes both the Hebbian learning of synaptic weights and structural plasticity on the dendrites, a small number of Performance is compared between the pre-trained (cyan) and untrained (magenta) network on the original signal. Cross-talk noise in these tests was 0.5. Networks were exposed to 10 s long mixed sounds during each learning epoch and correlations were calculated afterwards.
redundant synapses is sufficient for an optimal inference. All of the models of dendritic processing discussed here are biologically more realistic compared to the present model, yet they did not address the ability of neurons with dendrites in analyzing temporal features of information streams. On the other hand, memory-related sequential activities of hippocampal neurons were modeled in terms of nonlinear amplification of synchronous inputs 65 . Furthermore, the discrimination of sequences on behavioral time scales was recently formulated in terms of the reaction-diffusion processes triggered by sequential inputs along dendrites 66 . While these processes were implemented in morphologically realistic neuron models, whether such models can perform complex temporal feature analyses is yet to be clarified. Hawkins and Ahmad modeled sequence processing in a cortical microcircuit model of formal neurons, each of which receives top-down feedback inputs on apical synapses, feedforward inputs on proximal synapses and lateral inputs from nearby neurons on multiple dendritic segments 67 . Through coincidence detection and segment-basis Hebbian learning, the network learns to recognize sparse activity patterns and to predict next spikes in an input sequence. While their model emphasizes the role of dendrites and cortical microcircuit structure in predicting spike sequences, our model demonstrates the ability of single neurons with dendrites to learn recurring temporal input patterns.
Determining which neuron or synapse should be credited for learning a desired output in a hierarchical neural circuit is a difficult problem. Solutions to this 'credit assignment problem' require feedback signals to neurons or synapses. In cortical pyramidal neurons, feedforward sensory data is thought to be received at the basal dendrites while feedback credit information is received at apical dendrites. It was recently argued that the spatial separation between the two pathways enables these neurons to solve the credit assignment problem through dendritic computing 68 . The current version of our model does not solve the credit assignment problem, and this problem arises on multiple timescales in hierarchical brain computation. How morphologically complex neurons implement the proposed temporal feature analysis and how this analysis helps the brain to solve hierarchically organized credit assignment problems are intriguing open questions.

Methods
Neural network model. Each output neuron has two compartments-somatic and dendritic. The dendritic membrane potential of output neuron i 2 1; 2; ; N out f g is calculated as where w ij is the synaptic weight between output neuron i and input neuron j. The variable e j stands for the unit postsynaptic potential induced by neuron j and is described later. The somatic activity integrates the dendritic potential, and it evolves as where τ = 15 ms and the conductance between the two compartments is g D = 0.7. The last term describes lateral inhibition with synaptic weights G ij (≥0). We calculated the inhibitory input in terms of the firing rates of output neurons. However, as explained below, spike trains of these neurons were also generated for simulating modifications of G ij by spike-timing-dependent plasticity. We assume that the soma of neuron i generates a Poisson spike train with the instantaneous firing rate ϕ som Þin terms of the nonlinear response function The parameters β i and θ i are defined as follows: where μ i (t) and σ i (t) are the mean and variance of the membrane potential, respectively, over a sufficiently long period t 0 : We set β 0 = 5 throughout this study, but the values of ϕ 0 and θ 0 are taskdepend (Supplementary Methods).
We note that the slope of nonlinearity β i and the threshold value θ i are modified as the values of μ i and σ i change during learning. As described below, the online modifications of the somatic response function maintain the dynamic range of output firing rate within a certain range. To see this, we use Eqs. (4) and (5) to obtain As the mean ofû i t ð Þ is constrained to be zero, the above equation implies that ϕ som Thus, the somatic activity does not saturate.
In our model, sensory information given to the network is first encoded into Poisson spike trains. Input neuron i 2 1; 2; ; N in f g generates a Poisson spike train where δ is the Dirac' delta function and t i,q denotes the time of the q-th spike of input neuron i. The presynaptic spikes induce the following synaptic current I i (t): where the synaptic time constant τ syn = 5 ms (τ syn = 50 ms in Fig. 4g and Supplementary Fig. 4c). The synaptic currents in turn evoke a postsynaptic potential e i (t) as The unit amplitude of postsynaptic potentials is given as e 0 = 25 in all simulations.
Optimal learning rule for MRIL. To extract the characteristic features of the temporal input, our model compresses the high dimensional data carried by the input sequence onto a low dimensional manifold of neural dynamics. The model performs this by modifying the weights of dendritic synapses to minimize the timeaveraged mismatch between the somatic and dendritic activities over a certain interval [0,T]. In a stationary state, the somatic membrane potential u i (t) of a twocompartment model can be described as an attenuated version v * i t ð Þ of the dendritic membrane potential with an attenuation factor α = g D /(g D + g L ), where g L = τ −124 . Though we deal with time-dependent stimuli in our model, we compare the attenuated dendritic membrane potential with the somatic membrane potential at each time point. This comparison, however, is not drawn directly on the level of the membrane potentials but on the level of the two Poissonian spike distributions with rates ϕ som i ðuðtÞÞ andφ v * i t ð Þ À Á , respectively, which would be generated if both soma and dendrite were able to emit spikes independently. The functionφ v * i t ð Þ À Á can also be regarded as a nonlinear-filtered version of the attenuated dendritic membrane potential 69 .
Explicitly representing the dependency of u i and v * i on X, we define the cost function for synaptic weights w as where P * (X) stands for the true distribution of input spike trains, Ω X for the space spanned by all possible combinations of input spike trains, and D KL for the KLdivergence between the two Poisson distributions: Note that unlike the somatic response function ϕ som i , of which the values of β i and θ i are neuron-dependent, the function ϕ dend is common to all neurons.
We minimize the cost function (i.e., the averaged KL-divergence) with respect to w such that the responses of the two compartments become consistent with each other. Thus, the unsupervised learning rule of somatodendritic consistency check resolves the discrepancy between the somatic and dendritic responses to temporal input. Similar to reference 24 , we search for the optimal weight matrix by gradient descent as Note that the identity dϕ dend x ð Þ=dx ¼ ϕ dend x ð Þdlogϕ dend x ð Þ=dx was used in deriving the last expression. Since v * i t ð Þ ¼ α P j w ij e j t ð Þ, the local learning rule is written in a vector form as where Note that the i-dependence of Δw i arises in our network model from activitydependent modifications of recurrent inhibitory connections among output neurons (see Eq. 2). The inhibitory connections are modifiable by STDP (see Fig. 2b).
In all simulations, we added the regularization term −γw i to Eq. (14) to prevent the diverging growth of synaptic weights. Thus, the following online learning rule was used: where η is the learning rate. The parameter γ controls the strength of regularization and was adjusted in a task-dependent manner. The initial values of w were generated by a Gaussian distribution with mean zero and standard deviation 1= ffiffiffiffiffiffi ffi N in p : Note that the above learning rule coincides with the Bienenstock-Cooper-Munro (BCM) theory except for a sign difference 70 . In BCM theory, the threshold between potentiation and depression is an unstable fixed point while in our model this point is a stable fixed point. However, as shown previously, the online modifications given in Eqs. (4)-(7) prevent the function ϕ som i u i t ð Þ ð Þfrom coinciding with ϕ dend v * i t ð Þ À Á . This in turn prevents a trivial fixed point w = 0 of Eq. (16). We note that the online modifications of somatic response function play a similar role to the standardization method to avoid a trivial solution in the SFA of temporal input 25 .
A similar learning rule to Eq. (16) was previously considered in a supervised learning model in which the average surprise of somatic spike output driven by dendritic synaptic input and a teaching signal given to the soma was minimized 24 . In this analogy, our learning rule may be interpreted as self-consistent surprise minimization in which the teaching signal itself is provided by the somatic response to make the learning rule for two-compartment neurons unsupervised. This summarizes the essential difference between our model and the previous model. Improved learning rule with additional noise. In Fig. 8, we included an additional noise term at each time step of learning as follows: where ξ i is a random variable obeying a normal distribution. The parameter g controls the strength of the noise, and we set g = 0.6 in Fig. 8. The piecewise linear function f is defined as Negative signals should be eliminated to suppress the learning during noisedominant epochs.
Inhibitory plasticity. We modified lateral inhibitory connections through a symmetric anti-Hebbian STDP: if a pair of presynaptic and postsynaptic spikes occur at the times t pre and t post , respectively, the weight changes were calculated as where τ p and τ d are the decay constants of LTP and LTD, respectively 35,36 .
Evaluation of the degree of independency between signals. ICA was not valid for the auditory signals used for the simulations of BSS. This was because the signals were not independent. In addition to the standard correlations between two analog signals, negentropy (≥0) was used to evaluate the independency of signals. Negentropy measures the deviation of a target distribution from a Gaussian distribution: negentropy vanishes if the target distribution is Gaussian but otherwise takes a positive value; the larger the deviation is, the larger the value of negentropy is. The calculation of negentropy J(Y) for the statistical variable Y requires the true distribution, but it is unknown in the present study. Therefore, we made the following approximation in the evaluation of J(Y) using a function Q: where E(x) refers to the expectation value of x and ρ obeys a Gaussian distribution. Typically, the logarithm of hyperbolic cosine function is used for Q 49 : where 1 ≤ a ≤ 2. In this study, we set as a = 1.
Cross-talk noise. In Fig. 8, we mixed the original signals X 1 (t) and X 2 (t) as follows: cosθ sinθ sinθ cosθ X 1 t ð Þ Then, the cross-talk noise between the two mixed signals was defined as tanθ. The mixed signals coincide with the original signals at θ = 0, while the two mixtures are identical at θ ¼ π 4 and BSS is a single-source separation problem.
Chunking of character sequences by SOBI and SFA. In Supplementary Fig. 6a, we applied SOBI to the same sequential input as used in Fig. 5a. In the simulations of SOBI, the number of input units was set equal to the number of characters in the sequence, and each unit takes the value 1 when the corresponding character appears in input and 0 otherwise. In (B), we low-pass filtered a raw input with the time constant of 50 ms, and then resampled the filtered input with the time step of 10 ms priori to the application of SOBI. We also employed different values of the time constant (15 and 30 ms) and time step (1 ms), but these modifications did not change the essential results. Denoting the observed time-series data at time t and the output of the j-th unit as X t and y j;t ¼ g j ðX t Þ, respectively, we can describe the outline of SFA as follows. The objective of SFA is to minimize the following quantity (Δ-value): where Á h i t denotes the averaging over time, under the following three constraints: In other words, we should find out the scalar function g j (X t ) that minimizes the time derivative of the latent variable y j,t . Then, the latent variable y j,t that minimizes the Δ-value is called the slow feature of X t . Equations (24) and (25) prevent a trivial solution as in our model, and Eq. (26) deccorelates the outputs of different units. We applied SFA 25 to the same input sequence as used in Fig. 5a. However, the results are not shown as the algorithm failed to generate outputs within a reasonably long simulation time. Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All numerical datasets necessary to replicate the results shown in this article can easily be generated by numerical simulations with the software code provided below. No datasets were generated during this study.

Code availability
All codes were written in Python3 with numpy 1.17.3 and scipy 0.18.1. Example program codes used for the present numerical simulations and data analysis are available at https://github.com/ToshitakeAsabuki/MRIL_codes. Received: 18 April 2019; Accepted: 6 March 2020;