Control of criticality and computation in spiking neuromorphic networks with plasticity

The critical state is assumed to be optimal for any computation in recurrent neural networks, because criticality maximizes a number of abstract computational properties. We challenge this assumption by evaluating the performance of a spiking recurrent neural network on a set of tasks of varying complexity at - and away from critical network dynamics. To that end, we developed a plastic spiking network on a neuromorphic chip. We show that the distance to criticality can be easily adapted by changing the input strength, and then demonstrate a clear relation between criticality, task-performance and information-theoretic fingerprint. Whereas the information-theoretic measures all show that network capacity is maximal at criticality, only the complex tasks profit from criticality, whereas simple tasks suffer. Thereby, we challenge the general assumption that criticality would be beneficial for any task, and provide instead an understanding of how the collective network state should be tuned to task requirement.


I. INTRODUCTION
A central challenge in the design of an artificial network is to initialize it such that it quickly reaches optimal performance for a given task.For recurrent networks, the concept of criticality presents such a guiding design principle [1][2][3][4][5][6][7] (for feed-forward networks see e.g.[8][9][10]).At a critical point, typically realized as a second order phase transition between order and chaos or stability and instability, a number of basic processing properties are maximized, including sensitivity, dynamic range, correlation length, information transfer, and susceptibility [11][12][13][14][15].Because all these basic properties are maximized, it is widely believed that criticality is optimal for task performance [1, 2, 4-7, 12, 16].
Tuning a system precisely to a critical point can be challenging.Thus ideally, the system self-organizes to a critical point autonomously via local learning rules.This is indeed feasible in various manners by modifying the synaptic strength depending on the pre-and postsynaptic neurons' activity only [6,12,[17][18][19][20][21][22][23][24].The locality of the learning rules is key for biological and artificial networks where global information (e.g.task performance error or activity of distant neurons) may be unavailable or costly to distribute.Recently, it has been shown that specific local learning rules can even be harnessed more flexibly: A theoretical study suggests that recurrent networks with local, homeostatic learning rules can be tuned towards and away from criticality by simply adjusting the input strength [22].This would enable one to sweep the entire range of collective dynamics from subcritical to critical to bursty, and assess the respective task performance.
Complementary to tuning collective network properties like the distance to criticality, local learning also enables networks to "learn" specific patterns or sequences [25][26][27][28].For example, spike-timing dependent plasticity (STDP) shapes the connectivity, depending only on the timing of the activity of the pre-and postsynaptic neuron.STDP is central for any sequence learning -a central ingredient in language and motor learning [27,28].Such learning could strongly speed up convergence, and enables a preshaping of the artificial network -akin to the shaping of biological networks during development by spontaneous activity [29].
Given diverse learning rules and task requirements, it may be questioned whether criticality is always optimal for processing, or whether each task may profit from a different state, as hypothesized in [13].One could speculate that e.g. the long correlation time at criticality on the one hand enables long memory retrieval, but on the other hand could be bad if a task requires only little memory.However, the precise relation between the collective state, and specific task requirements is unknown.
When testing networks, the observed network performance is expected to depend crucially on the choice of the task.How can one then characterize performance independently of a specific task, like e.g.classification or sequence memory?A natural framework to characterize and quantify processing of any local circuit in a taskindependent manner builds on information theory [30]: Classical information theory enables us to quantify the transfer of information between neurons, the information about the past input, as well as the storage of information [30][31][32].The storage of information can be measured within the network or as read out from one neuron.In addition, most recently classical mutual information is being generalized to more than two variables within the framework of partial information decomposition (PID) [30,[33][34][35].PID quantifies the unique and redundant contribution of each source variable to a target, but most importantly also enables a rigorous quantification of synergistic computation, a key contributor for any information integration [30,33,34,36,37].Thereby information theory is a key stepping stone when linking local computation within a network, with global task performance.
Simulations of recurrent networks with plasticity become very slow with increasing size, because every membrane voltage and every synaptic strength has to be updated.To achieve an efficient implementation, physical emulation of synapses and neurons in electrical circuitry are very promising [38,39].In such "neuromorphic chips", all neurons operate in parallel, and thus the speed of computation is largely independent of the system size, and is instead determined by the time constants of the underlying physical neuron and synapse models -like in the brain.Realizing such an implementation technically remains challenging, especially when using spiking neurons and flexible synaptic plasticity.The BrainScaleS 2 prototype system combines physical models of neurons and synapses [40] with a general purpose processor carrying out plasticity [41].In this system, the analog elements provide a speedup, energy efficiency, and and enables scaling to very large systems, whereas the general purpose processor enables to set the desired learning rules flexibly.Thus with this neuromorphic chip, we can run the long-term learning experiments -required to study the network self-organization -within very short compute-time.
In the following, we investigate the relation between criticality, task-performance and information theoretic fingerprint.To that end, we show that a spiking neuromorphic network with synaptic plasticity can be tuned towards and away from criticality by adjusting the input strength.We show that criticality is beneficial for solving a complex non-linear task, but not the simple ones -challenging the common notion that criticality in general is optimal for computation.The relation between criticality (as set by the input strength) and task-performance is underpinned using methods from classical information theory as well as the novel framework of partial information decomposition (PID).This provides an understanding how basic computational properties shape task performance.

II. RESULTS
Model overview.We emulate N = 32 leaky integrate-and-fire (LIF) neurons on the mixed-signal neuromorphic prototype system BrainScales 2 (figure 1a and table I).We use the term emulation in order to clearly distinguish between the physical implementation, where each observable has a measurable counterpart on the neuromorphic chip, and standard software simulations on conventional hardware.The system features an array of 32 × 32 current-based synapses, where 20% of the synapses are programmed to be inhibitory.Synaptic plasticity acts equally on all synapses and is composed of a positive drift and a negative anticausal STDP term.In conjunction both terms lead to homeostatic regulation and thus stable network activity of about 20 Hz per neuron (see figure 1a of Supplemental Material [42]).
Plasticity is executed by an on-chip general purpose processor alongside to the analog emulation of neurons and synapses.This allows for an uninterrupted and fast data acquisition.
Neurons are potentially all-to-all connected, but K ext out of the N synapses per neuron are used to inject external Poisson or pattern input.Effectively, K ext quantifies the input strength with the extreme cases of K ext /N = 1 for a feed-forward network and K ext /N = 0 for a fully connected recurrent network, which is completely decoupled from the input.Depending on the degree of external input K ext , the network shows diverse dynamics (figures 1b and 1c).As expected [22], K ext shapes the Fitting a truncated power law, (b) the exponential cutoff scut peaks, and (c) critical exponents αs approximate 1.5, as expected for critical branching processes.(d) A maximumlikelihood comparison decides for a power-law compared to an exponential fit in the majority of cases.Dashed vertical lines indicate the set of Kext/N values that have been selected in (a).In this and all following figures, the median over runs and (if acquired) trials is shown, and the errorbars show the 5%-95% confidence intervals.
collective dynamics of the network from synchronized for low K ext to more asynchronous-irregular for high K ext .
Critical dynamics arises under low input K ext .The transition to burstiness for low K ext suggests the emergence of critical dynamics, i.e. dynamics expected at a non-equilibrium second order phase transition.Indeed, as detailed in the following, we find signatures of criticality in the classical avalanche distributions (figure 2a) as well as in the branching ratio (figure 4a), the autocorrelation time (figure 4b), the susceptibility and in trial-totrial variations (figure 4d).
To test whether the network indeed approaches criticality, we assume the established framework of a branching process [11,[43][44][45].In branching processes, a spike at time t triggers on average m postsynaptic spikes at time t + 1, where m is called the branching parameter.For m = 1 the process is critical, and the dynamics give rise to large cascades of activity, called avalanches [11,46].The size s of an avalanche is the total number of spikes in a cluster and is power-law distributed at criticality.Indeed, our network shows power-law distributed avalanche sizes s over two orders of magnitude for low K ext (figure 2a).For almost any K ext , the distribution is better fitted by a power-law than by an exponential distribution [47] (figure 2d).However, only for low K ext the exponent of the avalanche distribution is close to the ex- pected one, α s ≈ 1.5 (figure 2c), and the power-law shows the largest cutoff s cut (figure 2b).Together, all the quantitative assessment of the avalanches indicate that a low degree of input K ext produces critical-like behavior.
As the physical system at hand is of limited size, we utilize software simulations to investigate finite-size scaling.Therefore, a network with the same topology, plasticity rules and single neuron dynamics (though without parameter noise and hardware constraints) is simulated for various system sizes N .The resulting avalanche distributions show power-laws for any system size (figure 3a), and the cutoff s cut scales with N as expected at criticality (figure 3b).The scaling exponent is 1.6 ± 0.2.Together, these numerical results confirm the hypothesis that for low degrees of input K ext , the small network that is emulated on the chip self-organizes as close to a critical state as possible.
The assumption that the critical state of the network corresponds to the universality class of critical branching processes is tested further by inferring the branching parameter m (equation ( 18)) proper, the autocorrelation times and the response to perturbations.First, the branching parameter m characterizes the spread of activity and is smaller (larger) than unity for sub-critical (super-critical) processes.For our model, it is always in the sub-critical regime, but tends towards unity for low K ext (figure 4a).Second, the autocorrelation time τ corr is expected to diverge at criticality as . Indeed, τ corr as estimated directly from the autocorrelation of the population activity is maximal for low K ext (figure 4b).Third, the estimates of m and τ corr are in theory related via the analytical relation τ branch (m) ∼ −1/ log (m).This relation holds very precisely in the model (figure 4c, correlation coefficient ρ = 0.998, p < 10 −10 ).Fourth, towards criticality, the response to any perturbation increases.The impact of a small perturbation is quantified by a variant of the van-Rossum distance ∆ VRD (equation ( 16)).It peaks for low degrees of the external input K ext (fig- ure 4d).Last, one advantage of operating in the vicinity of a critical point is the ability to enhance stimulus differences by the system response.This is reflected in a divergence of the susceptibility at the critical point.The susceptibility χ (equation ( 17)), quantified here as the change in the population firing rate in response to a burst of N pert = 6 additional spikes, is highest for low K ext (figure 4d).Thus overall, the avalanche distributions as well as the dynamic properties of the network all indicate that it self-organizes to a critical point under low degree of input K ext .
Network properties have to be tuned to task requirements for optimal performance.It is widely assumed that criticality optimizes task performance.However, we found that one has to phrase this statement more carefully.While certain abstract computational properties, like the susceptibility, sensitivity or memory time span are indeed maximal or even approaching infinity at a critical state, this is not necessary for task performance in general [5,14,45,48].We find that it can even be detrimental.We find that for every single task complexity, a different distance to criticality is optimal, as outlined in the following.
We study the performance of our recurrent neural network in the framework of liquid state machines [49][50][51].A liquid state machine is composed of a recurrent neuronal network, and its performance is quantified by the ability of linear readout neurons to separate different sequences.To that end, it is often necessary to maintain information about past input for long time spans.To test performance, we specifically use a n-bit parity task MC(a j : because it is fully non-linear and the task complexity increases with n: to solve the task, the network has to memorize all the input from the n past steps.As liquid state machines close to a critical point have longer memory as quantified by the lagged mutual information (I τ , figure 4b), one expects that particularly the complex tasks profit from criticality (tasks with high n are better at low degrees of input K ext ).In contrast, simple tasks (low n) might suffer from criticality because of the maintenance of memory about unnecessary input.Thus, depending on the task complexity, there should be an ideal K ext , leading to maximal performance.
For our network, we find indeed that maximal task performance depends on both, task complexity and distance to criticality: simple tasks (n = 2) are optimally solved away from criticality, whereas complex tasks (n = 5) profit from the long timescales arising at criticality (figure 5b).Hence, we are capable of adapting the networks computational properties to task complexity by fine-tuning the strength of the input.
Information theory enables task-independent quantification of computational properties.While task performance is the standard bench-mark for any model, such bench-mark tasks have two disadvantages: In many biological systems, like higher brain areas or in vitro preparations, such tasks cannot be applied.Even if tasks can be applied, the outcome will always depend on the chosen task.To quantify computational properties in a task-independent manner, information theory offers powerful tools [30].Using the Poisson noise input, we find that the lagged mutual information I τ between the input s i and the activity of a neuron after a time lag τ , a j predicts the performance on the parity task.Here, at high K ext (away from criticality) information about the input is maximal for very short τ , but decays very quickly (figure 5c).This fast-forgetting is important to irradiate past, task-irrelevant input that would interfere with novel, task-relevant input.At small K ext , the recurrence is stronger and input can be read out for much longer delays (20 ms vs. 60 ms).This active storage of information is required in a liquid state machine to solve any task that combines past and present input, and hence the more complex parity task also profits from it.However, the representation of input in every single neuron becomes less reliable (i.e.I τ is smaller).A measure for the representation of the input in the network could be obtained by integrating I τ over τ .Interestingly, this memory capacity (MC) stays fairly constant (figure 5d).Note that we only quantified the representation of the input in a single neuron, a measure very easily accessible in experiments; obviously the readout can draw on the distributed memory across all neurons, which jointly provide a much better readout.
The memory maintenance for task processing has to be realized mainly by activity propagating on the recurrent connections in the network.Therefore, it is often termed active information storage (AIS) [52,53].The recurrent connections become stronger closer to criticality, and as a consequence we find that the lagged mutual information between pairs of neurons in the liquid also increases (figure 5e).As a result the MC of the liquid increases over almost two orders of magnitude when approximating criticality (lower K ext , figure 5f).This increase in internal MC carries the performance on the more complex parity tasks.
When assessing computational capacities, information theory enables us to quantify not only the entropy (H) and mutual information (I) between units, but also to disentangle transfer and storage of information, as well as unique, redundant and synergistic contributions of different source neurons [30,33,34,36,54].We find that all these quantities increase with approaching criticality (smaller K ext , figures 6b and 7c).This indicates that the overall computational capacity of the model increases, as predicted for the vicinity of the critical state [1,2,7,14].
In more detail, the AIS of a neuron, as well as the I and the transfer entropy (TE) between pairs of neurons increase with lower K ext (figure 6b).In our analysis, these increases reflect memory that is realized as activity propagation on the network, and not storage within a single neuron, because the bin-size used for analysis is larger than the refractory period τ ref , synaptic-τ syn and membrane-timescales τ m .Information theory here enables us to show that active transfer and storage of information within the network strongly increases towards criticality.A similar increase in I, AIS and TE has been observed for the Ising model and liquids at criticality [1,14], and hence supports the notion that criticality maximizes information processing capacity.Note however, that this maximal capacity is typically not necessary; as shown here, it can even be unfavorable when solving simple tasks.
Very recently, it has become possible to dissect further the contributions of different neurons to processing, using PID [33].PID enables us to disentangle for a target neuron a i , how much unique information it obtains from its own past activity a − i , or the past activity of a second neuron a − j ; and how much information is redundant or even synergistic from the two (figure 7a).Synergistic information is that part of information that can only be computed if both input variables are known, whereas redundant information can be obtained from one or the other.
(a) The two input variables for PID correspond to the spiking histories a − i and a − j of two neurons, and the output variable to the present state aj.PID enables to quantify the unique contribution of each source to the firing of the target neuron, as well as the shared (also called redundant), and synergistic contributions.(b) The joint mutual information (I) increases with decreasing Kext.(c) All PID components increase with approaching criticality.Interestingly, the synergistic and shared contributions are always much larger than the unique contributions (note the logarithmic axis).This highlights the collective nature of processing in recurrent neural networks.
All the PID components increase when approaching the critical point (low K ext , figure 7c).Quantitatively, the redundant and the synergistic information are always stronger than the unique ones which are about ten times less.The shared information dominates closer to criticality, mirroring the increased network synchrony and redundancy between neurons.Further, the synergistic contribution, i.e. the contributions that rely on the past of both neurons slightly increases, and is indeed the largest contribution for high K ext .This reflects that typically the joint activity of both neurons is required to activate a LIF neuron.Interestingly, the strong increase in shared information (i.e.redundancy) does not seem to impede the performance at criticality (small K ext ).However, for even higher synchrony, as expected beyond this critical transition, the shared information might increase too much and thereby decrease performance.

III. DISCUSSION
In this study, we used a neuromorphic chip to emulate a network, subject to plasticity, and showed a clear relation between criticality, task-performance and information theoretic fingerprint.Most interestingly, simple tasks do not profit from criticality while complex ones do, showing that every task requires its own network state.
The state and hence computational properties can readily be tuned by changing the input strength, and thus a critical state can be reached without any parameter fine-tuning within the network.This robust mechanism to adapt a network to task requirements is highly promising, especially for large networks where many parameters have to be tuned and in analog neuromorphic devices that are subject to noise in parameters and dynamics.
It has been generally suggested that criticality optimizes task performance [1,16].We show that this statement has to be specified: indeed, criticality maximizes a number of properties, like the autocorrelation time (figure 4b), the susceptibility (figure 4d), as well as information theoretic measures (figures 5 to 7).However, this maximization is apparently not at all necessary, potentially even detrimental, when dealing with simple tasks.For our simple task, high network capacity results in maintenance of task-irrelevant information, and thereby harms performance.This is underlined by our results that clearly show that all abstract computational properties are maximized at criticality, but only the complex task profits from criticality.Hence, every task needs its own state and therefore a specific distance to critical dynamics.
The input strength could not only be controlled by changing K ext , the number of synapses of a neuron that were coupled to the input.An equally valid choice is a change of external input rate to each neuron.In fact, we showed that changing the input rate has the same effects on the relation between criticality (figure 2 and 3 of Supplemental Material [42]), task performance, and information measures (figure 5 of Supplemental Material [42]) as changing K ext .Moreover, in this framework the lowest input rates even allow to cross the critical point (figure 3 and 4 of Supplemental Material [42]).Thus for both control mechanisms or a combination, there exists an optimal input strength, where the homeostatic mechanisms bring the network closest to critical.This optimal input strength has been derived analytically for a mean-field network by Johannes Zierenberg [22], and could potentially be used to predict the optimal input strength for other networks and tasks as well.
Not only the input strength, but also the strength of inhibition can act as a control parameter.Inhibition plays a role in shaping collective dynamics and is known to generate oscillations [55,56].For a specific ratio of excitation and inhibition, criticality has been observed in neural networks [23,24,57,58].Likewise, our networks has 20 % inhibitory neurons.However, inhibition would not be necessary for criticality [17,45].Nevertheless, the existence of more than one control parameters (degree of input, input rate, and inhibition) allows for flexible adjustment even in cases where only one of them could be freely set without perturbing input coding.
Plasticity plays a central role in self-organization of network dynamics and computational properties.In our model, the plasticity, neuron and synapse dynamics feature quite some level of biological detail (table I), and thus results could potentially depend on them.All synaptic weights are determined by the synaptic plasticity.Here, we showed results for homeostasis and STDP that implement the negative (anticausal) arm only.When implementing the positive (causal) arm of STDP in addition, the network destabilised, despite counteracting homeostasis.This is a well known problem [59].Our implementation is still similar to full STDP, because anticaulsal correlations are weakened and the causal ones are indirectly strengthened by homeostasis.With its similarity to STDP and its inherent stability, our reduced implementation may be useful for future studies.
The characterization of the network in a taskdependent as well as in an task-independent manner is essential for understanding the impact of criticality on computation.The computational properties in the vicinity of a critical point have been investigated by the classical measures AIS, MI and TE alone [5,60], or by PID alone [36,61].In this paper, the direct link of the information theoretic measures to task performance has been established.This enables us to understand how task complexity and the information-theoretic fingerprint are related.Such understanding is the basis for well-founded design decisions of future artificial architectures.
The presented framework is particularly useful for analog neuromorphic devices as analog components have inherent parameter noise as well as thermal noise, which potentially destabilize the network.Here, the synaptic plasticity plays a key role in equalizing out particularly the parameter noise, as also demonstrated for short-term plasticity [62], and thus makes knowledge about parameter variations, as well as specific calibration to some extend unnecessary.
Despite of the small system size (N = 32 neurons only), the network not only showed signatures of criticality, but also developed quite complex computational capabilities, reflected in both, the task performance and the abstract information-theoretic quantities.We expect that a scale-up of the system size would open even richer possibilities.Such a scale-up would not even require finetuning of parameters, as the network self-tunes owing to the local-learning rules.As soon as larger chips are available, we expect that the abilities of neuromorphic hardware could be exhausted in terms of speed and energy efficiency allowing for long, large-scale and powerful emulations.
Overall, we found a clear relation between criticality, task-performance and information theoretic fingerprint.Our result contradicts the widespread statement that criticality is optimal for information processing in general: While the distance to criticality clearly impacts performance on the liquid task, we showed that only the complex tasks profit from criticality; for simple ones, criticality is detrimental.Mechanistically, the optimal working point for each task can be set very easily under homeostasis by adapting the mean input strength.This shows how critical phenomena can be harnessed in the design and optimization of artificial networks, and may explain why biological neural networks operate not necessarily at criticality, but in the dynamically rich vicinity of a critical point, where they can tune their computational properties to task requirements [13,63].

IV. METHODS
We start with a description of the implemented network model, followed by a summary of the analysis techniques.All parameters are listed in table I and all variables in table 1 and 2 of the Supplemental Material [42].

A. Model
The results shown in this article are acquired on the mixed-signal neuromorphic hardware system described in [41] (figure 1a).In the following a brief overview of the model, which is approximated by the physical implementation on the hardware, and the programmed plasticity rule is given.Since the neuromorphic hardware system comprises analog electric circuits, transistor mismatch causes parameter fluctuations which can be compensated by calibration [64][65][66][67].Here, no explicit calibration on the basis of single neurons and synapses is applied.Instead, only parameters common to all neurons/synapses are set such that all parts behave according to the listed equations, especially that all parts are sensible to input but silent in the absence of input.This choice leads to uncertainties in the model parameters as reported in table I.
Neurons: Implemented in analog circuitry, the neurons approximate current-based LIF neurons.The membrane potential u j of the j-th neuron obeys: with the membrane time constant τ mem , the leak conductance g leak = C m /τ mem , the leak potential u leak and the input current I j (t).The k-th firing time of neuron j, t k j , is defined by a threshold criterion: Immediately after t k j , the membrane potential is clamped to the reset potential u j (t) = u reset for t ∈ t k j , t with the refractory period τ ref .
The neuromorphic hardware system comprises N = 32 neurons, operating in continuous time due to the analog implementation.

Synapses:
Like the membrane dynamics, the synapses are implemented in electrical circuits.Each neuron features N = 32 presynaptic partners (in-degree is 32).The synaptic input currents onto the j-th neuron enter the neuronal dynamics in equation ( 1) as the sum of the input currents of all presynaptic partners i, I j (t) = N i=1 I ij (t), where I ij (t) is given by: with the excitatory and the inhibitory synaptic time constants τ exc syn and τ inh syn .N inh synapses of every neuron j are randomly selected to be inhibitory.The external synaptic current I ext ij (t) depends on the l-th spike time of an external stimulus i, s l i , whereas the recurrent synaptic current I rec ij (t) depends on the k-th spike time of neuron i, t k i , each of which transmitted to neuron j: with the synaptic delay d syn and the weight conversion factor γ. The synaptic weight from an external spike source i to neuron j is denoted by w ext ij , and w rec ij is the synaptic weight from neuron i to neuron j.Every synapse either transmits external events s l i or recurrent spikes t k i , i.e. if w stim ij ≥ 0 then w rec ij = 0 and vice versa.Network: The LIF neurons are potentially connected in an all-to-all fashion.A randomly selected set of K ext synapses of every neuron is chosen to be connected to the spike sources.As every synapse could either transmit recurrent or input spikes, the K ext synapses do not transmit recurrent spikes.
Plasticity: In the network, all synapses are plastic, the recurrent and the ones linked to the external input.Therefore, we skip the superscript of the synaptic weight and drop the distinction of t k i and s k i in the following description.Weights are subject to three contributions: A weight drift controlled by the parameter λ drift , a correlation sensitive part controlled by λ stdp and positively biased noise contributions.A specialized processor on the neuromorphic chip is programmed to update synaptic weights to w ij (t + T ) = w ij (t) + ∆w ij according to: The STDP-kernel function f depends on the pre-and postsynaptic spike times in the time interval [t − T, t): with t k i > t l j , and t k i , t l j ∈ [t − T, t), and only nearestneighbor spike times are considered in the sum [68].η stdp and τ stdp denote the amplitude and the time constant of the STDP-kernel.The term n ij (t) adds a uniformly distributed, biased random variable: where n amp specifies the range, while n is the bias of the random numbers.
In the following, we consider the case λ stdp < 0 and λ drift < 0. λ stdp and λ drift are chosen such that the average combined force of the drift and the stochastic term is positive.Thus, only the negative arm of STDP is implemented.
Initialization: The synaptic weights are initialized to w ij = 0 µA.Afterwards, the network is stimulated by N Poisson-distributed spiketrains of rate ν by the K ext synapses of every neuron.By applying equation (8) for the total duration T burnin weights w ij = 0 µA develop.For every K ext , the network is run 100 times, each with a different random seed.If not stated otherwise, the resulting weight matrices are used as initial conditions for experiments with frozen weights (∆w ij = 0) for a duration of T exp on which the analysis is performed.
Simulations: To complement the hardware emulations, an idealized version of the network is implemented in Brian 2 [69].Specifically, no parameter or temporal noise is considered, and weights are not discretized as it is the case for the neuromorphic chip.For simplicity, the degree of the input is implemented probabilistically by connecting each neuron-input pair with probability K ext /N and each pair of neurons with probability (N − K ext )/N .

B. Evaluation
Binning: The following measures rely on an estimate of activity, therefore we apply temporal binning: where δt corresponds to the binwidth, and 1 is the indicator function.With this definition, we are able to define the binarized activity for a single process i: The variable x i (t) can represent either activity of a neuron in the network a i (t), or of a stimulus spike train s i (t), and correspondingly the spike times x k i represent spikes of network neurons or stimulus (input) spike trains.
The population activity a(t) of the network is defined as: Neural avalanches: A neural avalanche is a cascade of spikes in neural networks.We extract avalanches from the population activity a(t), obtained by binning the spike data with δt corresponding to the mean inter-event interval, following established definitions.In detail, one avalanche is separated from the subsequent one by at least one empty time bin [46].The size s of an avalanche is defined as the number of spikes in consecutive nonempty time bins.At criticality, the size distribution P (s) is expected to follow a power-law.
To test for criticality, we compare whether a powerlaw or an exponential distributions fits the acquired avalanche distribution P (s) better [47].For the fitting, first the best matching distribution is determined based on the fit-likelihood.The fit-range is fixed to s ∈ {4, 3•N } as the system is of finite-size.An estimation of the critical exponent α s and an exponential cutoff s cut is obtained by fitting a truncated power-law: for s ≥ 1.Power law fits are performed with the Python package power-law described in [70].
Fano-factor: The variability of the population activity is quantified by the Fano factor F = σ 2 a /µ a where σ 2 a is the variance and µ a is the mean of the population activity a(t), binned with δt = τ ref .
Trial-to-trial variability and susceptibility: The trial-to-trial distance ∆ VRD is obtained by stimulating the same network twice with the same Poisson spike trains, leading to two different trials m and n influenced by variations caused by the physical implementation.The resulting spike times in trial m emitted by neuron i, termed t j i,m , are convolved with a Gaussian: and likewise for trial n.The width is chosen to be σ VRD = τ ref and the temporal resolution for the integration is chosen to be 0.1 ms.From different trials m and n the distance is calculated: To obtain an estimate of the networks sensitivity χ to external perturbations, a pulse of N pert additional spikes is embedded in the stimulating Poisson spike trains at time t pert : normalized to the number of external connection K 2 ext to compensate for the decoupling from external input with decreasing K ext .The population activity is estimated with binsize δt = d syn .By evaluating χ immediately after the perturbation, only the effect of the perturbation is captured by minimizing the impact of trial-to-trial variations.
To calculate χ and ∆ VRD , each weight matrix, obtained by the application of the plasticity rule, is used as initial condition for 10 emulations with frozen weights and fixed seeds for the Poisson-distributed spike trains of duration T pert and T static .Additionally, a perturbation of size N pert at t pert = T pert /2 is embedded for the estimation of χ.
Autoregressive model: Mathematically, the evolution of spiking neural networks is often approximated by a first-order autoregressive representation.To assess the branching parameter m of the network in analogy to [45,46], we make use of the following ansatz: where the population activity in the next time step, a(t + 1) is determined by internal propagation within the network (m), and by external input h.Here, .|. denotes the conditional expectation and m corresponds to the branching ratio.For m = 1, the system is critical, for m > 1 the system is supercritical and activity grows exponentially on expectation (if not limited by finite size effects), whereas for m < 1 the activity is stationary.
The branching parameter m is linked to the autocorrelation time constant by τ branch = −δt/ ln (m).To obtain the activity a(t) the binwidth δt is set to the refractory time τ ref .
Estimating m is straight forward here, as subsampling [71,72] does not impact the estimate.Thus a classical estimator [73] can be used, i.e. m is equal to the linear regression between a(t) and a(t + 1).For model validation purposes, the autocorrelation function ρ a,a is calculated on the population activity a(t) binned with δt = τ ref : where σ a is the standard deviation, and µ a the mean of the population activity.Subsequently, ρ a,a is fitted by an exponential to yield the time constant τ corr .
Information theory: We use notation, concepts and definitions as outlined in the review [30].In brief, the time series produced by two neurons represent two stationary random processes X 1 and X 2 , composed of random variables X 1 (t) and X 2 (t), t = 1, ..., n, with realizations x 1 (t) and x 2 (t).The corresponding embedding vectors are given in bold font, e.g X l 1 (t) = {X 1 (t), X 1 (t− 1), ..., X 1 (t − l + 1)}.The embedding vector X l 1 (t) is constructed such that it renders the variable X 1 (t + 1) conditionally independent of all random variables X 1 (t ) with t < t − l + 1, i.e. p(X 1 (t + 1)|X l 1 (t), X 1 (t )) = p(X 1 (t + 1)|X l 1 (t)).Here, (•|•) denotes the conditional.The entropy (H) and mutual information (I) are calculated for the random variables X 1 , X 2 , if not denoted otherwise.This is equivalent to using l = 1 above, e.g.H(X 1 ) and I(X 1 : X 2 ) = H(X 1 ) − H(X 1 |X 2 ).We abbreviate the past state of spike train 1 by X The current value of the spike train is denoted by X 1 .With this notation the active information storage (AIS) of e.g.X 1 is given by: AIS(X 1 ) = I(X 1 : X − 1 ) .
To access the information modification the novel concept of PID is applied [33,34,36].Intuitively, information modification in a pairwise consideration should correspond to the information about the present state of a process only available when considering both, the own process past and the past of a source process.Therefore, the joint mutual information I(X 1 : X − 1 , X − 2 ) is decomposed by PID into the unique, shared (redundant), and synergistic contributions to the future spiking of one neuron, X 1 , from its own past X − 1 , and the past of a second neuron or an input stimulus X − 2 : In more detail, we quantify: 1.The unique information I unq (X 1 : X − 1 \ X − 2 ) which is contributed from the neurons own past.
2. The unique information I unq (X 1 : X − 2 \ X − 1 ) that is contributed from a different spike train (neuron or stimulus).
4. The synergistic information I syn (X 1 : X − 2 ; X − 1 ), i.e. the information that can only be obtained when having knowledge about both past states.
I syn is what we consider to be a suitable measure for information modification [36].
The joint mutual information as defined here is the sum of the AIS and the TE: I(X 1 : X − 1 , X − 2 ) = I(X 1 : X − 1 ) + I(X 1 : We calculated H, AIS, I and TE with the toolbox JIDT [74], whereas the PID was estimated with the BROJA-2PID estimator [75].The activity is obtained by binning the spike data with δt = τ ref and setting l to 4 to incorporate sufficient history.I and TE as well as the PID were calculated pairwise between all possible combinations of processes.Results are typically normalized by H to compensate for potential changes in the firing rate for changing values of K ext (figure 1 of Supplemental Material [42]).For the pairwise measures, H of the target neuron is used for normalization.
Parity task: The performance of the neural network as liquid state machine [49,50] is quantified using a variant of the n-bit parity task.The network weights are frozen (i.e.plasticity is disabled) to ensure that the network state is not changed by the input.
To solve the parity requires to classify from the network activity a j (t), whether the last n bits of input carried an odd or even number of spikes.The network is stimulated with a single Poisson-distributed spike train of frequency ν acting equally on all external synapses, i.e. the input spike times are s k i = s k ∀i.Spike times are binned according to equation (12) with binwidth δt to get a measure of the n past bits.The resulting stimulus activity s(t) is used to calculate the n-bit parity function according to: p n [s(t)] = s(t) ⊕ s(t − 1) ⊕ ... ⊕ s(t − n + 1) , (24) with p n [s(t)] ∈ {0, 1} and the modulus 2 addition ⊕, i.e. whether an odd or even number of spikes occurred in the n past time steps of duration δt.
On the network activity a j (t) a classifier is trained: where Θ(•) is the Heaviside function, and v(t) is the predicted label.The weight vector w j of the classifier is determined using linear regression on a set of training data s train of duration T train :

FIG. 1 .
FIG. 1.The degree of external input Kext shapes the collective dynamics of the network.(a) The neural network is implemented on the prototype neuromorphic hardware system BrainScaleS 2. This system features an analog-neural network core as well as an on-chip general purpose processor that allows for flexible plasticity implementation.(b) For low degree of input Kext, strong recurrent connections develop, and the activity shows irregular bursts, resembling a critical state.(c) For high degrees of input, firing becomes more irregular and asynchronous (Kext/N ∈ {0.25, 0.56}).

FIG. 2 .
FIG.2.Under low degree of input Kext, the network selforganizes towards a critical state, and shows long-tailed avalanche distributions.(a) Distributions of avalanche sizes s show power-laws over two orders of magnitude for low Kext.Fitting a truncated power law, (b) the exponential cutoff scut peaks, and (c) critical exponents αs approximate 1.5, as expected for critical branching processes.(d) A maximumlikelihood comparison decides for a power-law compared to an exponential fit in the majority of cases.Dashed vertical lines indicate the set of Kext/N values that have been selected in (a).In this and all following figures, the median over runs and (if acquired) trials is shown, and the errorbars show the 5%-95% confidence intervals.

FIG. 3 .
FIG.3.Finite-size scaling is assessed using a software implementation with varying system size N .(a) Exemplary avalanche size distributions follow a power-law for any tested N (degree of input Kext/N = 1/4).(b) As expected for critical systems, the cutoff scut scales with the system size.The scaling exponent is 1.6 ± 0.2.

FIG. 4 .
FIG. 4. For low degree of input Kext, the network shows clear signatures of criticality beyond power-laws.Only for low values of Kext, (a) the estimated branching ratio m tends towards unity, and (b) the estimated autocorrelation time τcorr peaks.(c) The clear match of the τcorr, and the τ branch ∼ −1/ log (m) as inferred from m supports the criticality hypothesis (correlation coefficient of ρ = 0.998, p < 10 −10 ).(d) Trial-to-trial ∆VRD variations as well as the susceptibility χ increase for low Kext.

FIG. 5 .
FIG.5.Computational challenging tasks profit from critical network dynamics -simple tasks do not.(a) The network is used to solve a n-bit parity task.Here, task complexity increases with n, the number of past inputs that need to be memorized.(b) For high n task performance profits from criticality (low Kext), whereas simple tasks suffer from criticality.The performance is quantified by the normalised mutual information Ĩ between the parity of the input and the vote of a linear classifier.(c) Memory about the input si as read out from neuron aj after a time lag τ is quantified by the mutual information Iτ (aj, si).Here, high degrees of the external input Kext are favorable for memory on short timescales, whereas low Kext is favorable on larger timescales.(d) The memory capacity (MC) stays fairly constant, despite of a decreased coupling to the stimulus for low Kext.(e) The lagged I between the activity of pairs of neurons indicates increasing memory for decreasing Kext, also visible in the MC (f).The selection of Kext/N in (c) and (e) is marked by dashed vertical lines in (d) and (f).

FIG. 6 .
FIG.6.The information fingerprint changes with the degree of input Kext, thus with distance to criticality.(a) The entropy (H) of the spiking activity of a single neuron, aj stays fairly constant, except for low Kext as a consequence of decreasing firing rates.(b) The mutual information (I) between the activity of two units ai, aj increases with lower Kext (i.e.closer to critical).The network intrinsic memory also increases, indicated by the active information storage (AIS) I(aj : a − j ).Likewise, the information transfer within the network increases with lower Kext.Information transfer is measured as transfer entropy (TE) between pairs of neurons aj and ai, I(aj : a − i |a − j ).

FIG. 7 .
FIG. 7. Partial information decomposition (PID) components increase towards criticality (i.e. with smaller input Kext).(a)The two input variables for PID correspond to the spiking histories a − i and a − j of two neurons, and the output variable to the present state aj.PID enables to quantify the unique contribution of each source to the firing of the target neuron, as well as the shared (also called redundant), and synergistic contributions.(b) The joint mutual information (I) increases with decreasing Kext.(c) All PID components increase with approaching criticality.Interestingly, the synergistic and shared contributions are always much larger than the unique contributions (note the logarithmic axis).This highlights the collective nature of processing in recurrent neural networks.
The network's performance on the parity task is quantified by I (p n [s test (t)] , v(t)) on a test data set s test of duration T test .Temporal binning with δt = τ ref is applied to s train , s test as well as a j (t).