Artificial Dendritic Neurons Enable Self-Supervised Temporal Feature Extraction

The brain identifies potentially salient features within continuous information streams to appropriately process external and internal temporal events. This requires the compression or abstraction of information streams, for which no effective information principles are known. Here, we propose conditional entropy minimization learning as the fundamental principle of such temporal processing. We show that this learning rule resembles Hebbian learning with backpropagating action potentials in dendritic neuron models. Moreover, networks of the dendritic neurons can perform a surprisingly wide variety of complex unsupervised learning tasks. Our model not only accounts for the mechanisms of chunking of temporal inputs in the human brain but also accomplishes blind source separation of correlated mixed signals, which cannot be solved by conventional machine learning methods, such as independent-component analysis. One Sentence Summary Neurons use soma-dendrite interactions to self-supervise the learning of characteristic features of various temporal inputs.

learning in an unsupervised fashion. We demonstrate that networks of artificial dendritic neurons can self-supervise the learning of spatiotemporal firing patterns that are repeatedly evoked in upstream neurons. This model enables the learning of a surprisingly wide variety of tasks, including the chunking of temporal inputs, the formation of orientation maps, and even the blind source separation (BSS) of correlated mixed signals. Although BSS has been extensively studied for independent signals (6)(7)(8), no effective methods except for semi-supervised methods are known for the processing of correlated signals (9).
Our model entails the learning of temporal features of an input based on a novel learning rule, which we term "self-conditioned entropy minimization (SCEM)." In short, SCEM categorizes temporal inputs by minimizing variations in neuronal responses to a given set of external inputs. The variation will be minimal when a neuron responds similarly to similar inputs. To achieve this, SCEM learns to self-generate appropriate teaching signals. Figure 1A shows a biologyinspired implementation of SCEM in a two-compartment spiking neuron model (see Supplementary Materials for mathematical details). In short, activity in the dendritic compartment, driven by external inputs, predicts somatic spike responses. This division of labor between somatic and dendritic compartments has been explored in a neuron model for supervised learning, with a teaching signal given to the soma (10). Unlike the previous model, our neuron model performs unsupervised learning by feeding the somatic response back to the dendrite to train dendritic synapses. Although the underlying biological mechanisms require further clarification, backpropagating action potentials may provide the feedback signal for this self-supervision in cortical pyramidal neurons (11,12). Our learning rule (Eq. 19 in Supplementary Materials) looks similar to the maximum likelihood estimation (13), a wellstudied framework of supervised learning. However, there is a conceptual difference between them. In the maximum likelihood estimation, the target data distribution (somatic activity) is provided externally as teaching signals. By contrast, our model learns the simultaneous distributions of input and output data without teaching signals. The consistency between the two data sets constrains the self-supervised learning, thereby avoiding a redundant or an overly simplistic categorization of temporal inputs. Although SCEM fits particularly well with dendritic neurons, the principle is generic and applicable to a broad range of information processing systems.
As shown in Fig. 1B (top), presynaptic spike trains intermittently repeated three fixed spatiotemporal patterns with equal probabilities of occurrence. The learning of repeated temporal input patterns is crucial for various cognitive functions such as language acquisition (14,15) and motor sequence learning (1)(2)(3)16). A single neuron learned to respond selectively to one of the input patterns (Fig. 1B, bottom), with approximately equal probabilities for the patterns among the trials, although it responded to more than one input pattern in some cases (Fig. S1). Cortical neurons actually have the ability to discriminate simple temporal inputs (17). Next, we considered a competitive network of two-compartment model neurons receiving similar presynaptic spike trains (Fig. 1C). Recurrent inhibitory connections among these neurons were modifiable by inhibitory spike timing-dependent plasticity (iSTDP; Fig. S2A). The postsynaptic neurons self-organized into three neuron ensembles, each detecting one of the input activity patterns (Fig. S2B). iSTDP enabled mutual inhibition between the neural ensembles (Fig. S2C). The strength of lateral inhibition required adjustment, as inhibition that was too strong (Fig. S3A, B) or too weak (Fig. S3C, D) eliminated chunk-specific cell assemblies. These results may explain how humans can detect frozen noise patterns repeated within noisy auditory signals. The regularization parameter γ (see Materials and Methods) must also be in an appropriate range to enable the unsupervised learning of chunk-specific cell assemblies, as values that are too large suppress all neural responses and those that are too small do not generate selective responses to chunks (Fig. S4).
The ability of the network model to learn was assessed with various types of biological noise. Background presynaptic spikes degraded the performance as the signal-to-noise ratio decreased (Fig. S5A), whereas learning was optimal at finite noise levels with synaptic transmission failure ( Fig. S5B) and with jitters in presynaptic spike timing (Fig. S5C). We speculate that this disparity may reflect the different noise structures. Background spikes were not correlated with the repeated input patterns and merely contaminated the signals, whereas the noise patterns from transmission failures and timing jitters yielded noise that was correlated and thus enhanced the sampling for learning. Although presynaptic noise may induce a regularization effect during learning (18), this likely did not occur in our model network, as not all types of presynaptic noise improved the learning.
The network model is capable of learning repeated patterns in various information streams. To show this, we applied random sequences of three chunks comprising four characters each ( Fig.  2A) to a network model with 10 output neurons and 1,000 input neurons. Each input neuron generated a 30 ms 10 Hz burst in response to a randomly assigned preferred character (Fig. 2B). This resulted in the formation of three neuron ensembles that selectively responded to the chunks (Fig. 2C). We conducted a principal-component analysis to study the low-dimensional dynamics of output neurons, which revealed the emergence of the three chunks after learning (Fig. 3D). However, the word segmentation shown above is not difficult for other methods as well (19). Therefore, we next tested the same model with more complex input sequences generated by a random walk on a graph with a community structure, where the connection of each node to the other four occurred with an equal probability of 0.25 (Fig. 2E). The detection of this community structure is easy for human subjects but difficult by construction for the conventional machine learning methods that rely on surprise signals, such as those with nonuniform transition probabilities (4). To our surprise, each output neuron easily learned to respond selectively to members within its community (Fig. 2F).
The network model also learns the static features of input when they are repeatedly shown in a temporal sequence. A random sequence of noisy images of oriented bars presented for 40 ms at 30 ms intervals was applied to the model (Fig. 3A). The output neurons, which initially had no preferred orientation (Fig. 3B), developed well-defined preferences for specific orientations after learning (Fig. 3C), resembling a visual orientation map (Fig. 3D) (20,21). Because all sensory features, either static or dynamic, arrive at the brain in sequence, temporal processing is potentially important for the formation of feature detection maps from continuous sensory streams.
These results demonstrate that the SCEM successfully chunks a variety of temporal inputs by automatically identifying repeated temporal input patterns. The question then arises whether this ability of the SCEM enables learning of other types of sequence processing tasks. Sequence processing also involves the blind separation of signals within mixtures from multiple sources.
BSS is an extensively studied problem in auditory processing (6)(7)(8), but the various methods that have been proposed are effective only if individual signals are independent. To our knowledge, there are no effective methods for separating mixtures of dependent or correlated signals. We applied SCEM to sound mixtures from two music instruments (Audio S1), i.e., a bassoon and a clarinet (Bach10 Dataset (22); Audios S2 and S3), playing their respective parts of the same score (Fig. 4A); thus the two sound sources are correlated. These mixtures of signals were encoded as irregular spike trains (Fig. 4B), which in turn were applied to output neurons. After training, these neurons self-organized into two subgroups, each responding to one of the true sources (Fig. 4C). The original sounds were then decoded from the average firing rates of these subgroups (Audios S4 and S5), though some high-frequency components were lost due to the low-pass filtering effect corresponding to membrane dynamics (Fig. S6). By contrast, BSS was poor via an independent-component analysis (Audio S6).
Mutual information maximization (MIM) has often been hypothesized to describe the transfer of information between neurons (23), and Hebbian synaptic plasticity may approximately follow MIM (24). However, the MIM principle ultimately implies that messages are faithfully copied at all layers of hierarchical processing. Furthermore, MIM does not account for the compression or abstraction of sensory input to the brain.
Our learning rule minimizes the entropy associated with the conditional probability of neuronal output for a given input. The rule enabled mutually inhibiting dendritic neurons to learn the repetition of spatiotemporal activity patterns on a slow timescale (typically, several tens to several hundreds of milliseconds). While the aim of many previous methods for chunking is to predict the input sequence (25,26), our model entails a novel principle in which a neural system learns to predict its own responses to input. To this end, the SCEM minimizes the conditional entropy of output data to produce a predictable low-dimensional representation of highdimensional input data. This learning continues until there is agreement between the somatic output and dendritic input regarding the low-dimensional features (i.e., chunks). We previously used paired reservoir computing for chunking (but not for BSS), in which two recurrent networks supervise each other to mimic the partner's responses to a common temporal input (27). The present model outperforms the previous one, but the two models share a fundamental computational principle, namely, self-consistency between input and output data. The SCEM, on the other hand, differs from autoencoders that compress input information in hidden layers. The compression rate is much higher in our model than in autoencoders, because input sequences cannot be faithfully reconstructed from chunked pieces. Despite resembling methods for learning the probabilistic structure of input data (10,13) and the fact that the information bottleneck compresses data while maintaining mutual information to some degree (28), the SCEM differs from these and other methods aimed at learning the likelihood of the input data distribution.
In sum, our model not only performs chunking but also achieves BSS from mixtures of correlated signals. It is surprising that simple neural networks with identical circuit structures can perform these seemingly different tasks. Such a multifunctional model was previously unknown in learning information streams.
its prediction and the actual somatic firing rate. (B) Three frozen spatiotemporal patterns (red, blue, and green) were repeated as irregular spike trains from 2,000 input neurons (top). Three dendritic neurons selectively responded to one of the repeated patterns after learning (bottom). (C) A competitive network used in all of the present tasks. The input layer consists of Poisson spiking neurons, and the output layer comprises the dendritic neuron models. Ten output neurons were connected with all-to-all inhibitory synapses modifiable by iSTDP.