A canonical neural mechanism for behavioral variability

The ability to generate variable movements is essential for learning and adjusting complex behaviours. This variability has been linked to the temporal irregularity of neuronal activity in the central nervous system. However, how neuronal irregularity actually translates into behavioural variability is unclear. Here we combine modelling, electrophysiological and behavioural studies to address this issue. We demonstrate that a model circuit comprising topographically organized and strongly recurrent neural networks can autonomously generate irregular motor behaviours. Simultaneous recordings of neurons in singing finches reveal that neural correlations increase across the circuit driving song variability, in agreement with the model predictions. Analysing behavioural data, we find remarkable similarities in the babbling statistics of 5–6-month-old human infants and juveniles from three songbird species and show that our model naturally accounts for these ‘universal' statistics.

B ehavioural variability is a pivotal component of motor learning and adaptation 1,2 . While young individuals can usually produce non-stereotyped disorganized behaviours, motor exploration is more often expressed as movement variability around a stereotyped motor pattern. Highly irregular patterns of activity, which are ubiquitous in the brain 3 , are thought to underlie variable motor behaviours 4,5 . Specifically in songbirds, a neural circuit necessary for song learning in juveniles 6,7 has been recently shown to be responsible for vocal variability both in adults 8 and throughout development 7,9,10 . This circuit includes two cortical-like areas: a premotor nucleus, the lateral magnocellular nucleus of the anterior nidopallium (LMAN), and its efferent motor nucleus, the robust nucleus of the arcopalium (RA). While RA is essential in driving the effectors (muscles or muscle synergies) producing the song 11 , LMAN is not necessary for song production in adults 6,7 but has a key role throughout development and in adults in driving variability in the song 9,10 and in the activity of RA neurons 12 .
The idea that temporally irregular activity of neurons in the central nervous system (CNS) is capable of generating behavioural variability may seem obvious. A careful examination, however, reveals that the link between irregularity in neural activity and behavioural variability is far from being straightforward. This is because to impact the behaviour, patterns of activity generated in the CNS must also be spatially correlated (that is, correlated across neurons). For example, consider the minimal model of a cortical network driving motor behaviour depicted in Fig. 1a. It consists of many neurons randomly connected recurrently, divided into D functional groups; each group is composed of M neurons (larger than D by over an order of magnitude) that project to one effector of the motor behaviour. The collective dynamics of the network give rise to highly irregular firing patterns as a consequence of the interplay between excitation and inhibition 13 (Fig. 1b, Supplementary Fig. 1a,b). Despite this large variability in their activity, unless the number of neurons in a group is very small, fluctuations in the effectors are negligible (the coefficient of variation of the input to the effector, CV eff , see Methods section, is very small; Fig. 1a, Supplementary  Fig. 1e). This stems from the fact that the network activity is only weakly correlated across neurons (on the order of 1/N, where N is the number of neurons in the network, Fig. 1b right, Supplementary Fig. 1d) and thus, by virtue of the law of large numbers, the fluctuations they induce in the net input to an effector 'average out'. This example emphasizes the fact that, for the fluctuations to be transferred robustly from the CNS to the effectors, neuronal firing in the motor network must be sufficiently correlated within a neural population projecting to the same effector.
While the mechanism underlying asynchronous irregular spiking activity in recurrent networks of excitatory and inhibitory neurons is well understood 13-16 , how the CNS autonomously generates patterns of activity, which are both temporally irregular and correlated across neurons, remains an open fundamental question [16][17][18][19] . A key result of our theoretical work is that the activity of neurons in the motor network driving the effectors will be highly irregular and also spatially correlated if this network receives topographically organized excitatory projections from another upstream strongly recurrent network, hereafter premotor network. In the context of the circuit driving song variability in songbirds, our theory predicts that correlations emerge along the LMAN-RA circuit, namely, that correlations across neurons are very weak in LMAN but substantial in RA. We validate this prediction with simultaneous extracellular recordings of neurons in singing finches. Our theory also suggests that vocal variability in different species of juvenile vocal learners should exhibit very similar statistics, as a consequence of universal statistical properties of the circuit dynamics. We verify this prediction by comparing the statistics of the song produced during the babbling phase of three species of songbirds as well as of human infants. Preliminary report of this work previously appeared in an abstract form 20 .

Results
We first show that temporally irregular and spatially correlated patterns of spiking activity can robustly emerge in a circuit of topographically organized and strongly recurrent networks. To this end, we consider the circuit depicted in Fig. 2a. Neurons in the motor network that project to the same effector share a fraction, f, of their premotor inputs, and this shared component is different from one group to the other (see Methods section for a detailed description of the architecture). With this architecture, the spiking activities of the neurons in the premotor, as well as in the motor network, are highly irregular, as a result of their recurrent dynamics. There is, however, an important difference between the networks in the spatial structure of their activities. In the premotor network, correlations across neurons are typically extremely weak (Figs 2b,c and 4a). In contrast, in the motor network pairs of neurons projecting to the same effector are substantially and positively correlated, whereas correlations are weak (and possibly negative) for neurons projecting to different effectors (Figs 2d-f and 4b and Supplementary . These functional correlations are highly robust and only weakly influenced by the model parameters (Supplementary Figs 3d,e and 5d-f and Supplementary Note 1). As a result, fluctuations are amplified along the circuit (Fig. 2g) and the variability is robustly transferred to the effectors.
Importantly, the correlations in the motor network are substantial only if the footprint of the recurrent interactions in that network is sufficiently wider than the footprint of the premotor-to-motor projections (Fig. 2h). Indeed, when the recurrent interactions are too local, correlations in the motor network are weak (Fig. 2h,i). Thus temporally irregular and spatially correlated patterns of activity naturally emerge from the interplay between topographic feedforward (FF) projections from the premotor to the motor network and recurrent interactions within the motor network ( Supplementary Fig. 10).
The emergence of spatial correlations in the motor network can be intuitively understood as follows. The FF input to a neuron in the motor network consists of one structured component, shared by all neurons belonging to the same functional group, and another one which is unstructured. Since the neurons in the premotor network are firing asynchronously, both components are the sum of a large number of uncorrelated contributions (on average fK and (1Àf)K, respectively; K being the average number of synapses per neuron; see Supplementary Fig. 3i) and thus their temporal fluctuations are smaller than their temporal average by a factor on the order of 1= ffiffiffi ffi K p (Supplementary Fig. 3f: blue curve). Neural activity in the motor network will be spatially correlated if the amplitude of the fluctuations in the structured component to the network is on the order of the neuronal threshold. This implies that the temporally averaged FF input must be on the order of ffiffiffi ffi K p . This will happen if the strength of the FF connections are on the order of 1= ffiffiffi ffi K p . To prevent the neurons in the motor network to fire regularly at a very high rate, the inhibitory recurrent inputs in the motor network must compensate for most of this averaged FF input. Such a compensation occurs naturally if the motor network is strongly recurrent and operates in the 'balanced excitationinhibition' regime 13 (see Supplementary Note 1 for more details on the mechanism).
The fluctuations in the component common to all neurons in the same functional group give rise to the correlations in the activity in the motor network on a spatial scale on the order of the size of a group. Moreover, the groups also compete with each other provided that the recurrent interactions extend over a distance larger than the size of a group. As a result, the network dynamics self-organize such that the average instantaneous rates of the excitatory and inhibitory populations are essentially constant in time (Fig. 2d). This guarantees that the network operates in the balanced excitation-inhibition regime in a robust manner (see Supplementary Note 1).
Correlation structure in circuit driving vocal variability. Songbirds, with their well-identified and segregated circuit devoted to song learning, including a minimal circuit driving song variability (see Introduction), offer an ideal opportunity to test predictions of our theory. In songbirds, LMAN controls the trial-to-trial fluctuations across repetitions of the temporally structured song 8,9 . These fluctuations are important for adapting the song upon perturbations [21][22][23] . Moreover, anatomical studies in the circuit driving the song indicate that the projections from LMAN to RA are topographically organized, as our model posits for the projections in the premotor-to-motor pathway 24,25 . We therefore hypothesized that song variability stems from essentially uncorrelated fluctuations produced in LMAN, which by virtue of the topographic projections from LMAN to RA induce spatially correlated fluctuations in RA activity. To further test this hypothesis, we extended the two-area circuit considered above to take also into account the temporally structured inputs from nucleus HVC (used as a proper name) into RA neurons 12,26 . To this end, we included an additional FF excitation to the motor network in our model, representing the latter input (Fig. 3b, see Methods section). The responses of the neurons in the motor network are then locked to this input in a way that is reminiscent to the locking of RA neurons to the song 27,28 (compare Fig. 3a with Fig. 3c). However, these responses still exhibit trial-to-trial variability. By analysing the spatiotemporal patterns of these trial-to-trial fluctuations, we found substantial noise correlations (see Methods section) for neurons in the motor network belonging to the same group but almost none in the upstream premotor network (Fig. 4). In the motor network, noise correlations were positive for pairs in the same group. They were typically weaker and negative for pairs in different groups. The averaged correlation over all pairs of excitatory neurons was very small due to the compensation between positive and negative correlations. (Fig. 4b and also Supplementary Fig. 4). Our model thus predicts a build-up of noise correlations along the circuit generating behavioural variability in singing birds.
To test this prediction, we recorded pairs of LMAN or RA neurons during singing in zebra finches. In LMAN, we found that spike-triggered average (STA) of the local field potential (LFP), as well as STA of the multi-unit activity were weak (Fig. 5b, Supplementary Fig. 6, see Methods section). We also found that noise crosscorrelograms were flat (Fig. 5c,d) and that correlation coefficients were tightly distributed around zero (Fig. 5e) in LMAN. By contrast, in RA neurons displayed substantial noise correlations during singing, as revealed by the shape of their crosscorrelograms (Fig. 5h,i; one-tailed two-sample t-test; Po0.01 for single-units pairs, n ¼ 4 pairs in LMAN and n ¼ 5 pairs in RA; Po0.001 for single-versus multi-units pairs, n ¼ 6 pairs in LMAN and n ¼ 25 pairs in RA; and Po0.001 for multiunits pairs, n ¼ 21 pairs in LMAN and n ¼ 21 pairs in RA; see Methods section) and large values of noise correlation coefficients (compare Fig. 5j with Fig. 5e). The fact that correlations in RA were stronger than in LMAN is consistent with the model prediction, since the recorded units were likely to be located in the same functional group given the small distance between electrodes compared to RA diameter (see also Supplementary Fig. 6 and Supplementary Note 2). Multi-unit-STA and LFP-STA also consistently displayed high noise-related activity around the recorded spikes in RA, in contrast to LMAN (compare Fig. 5g with Fig. 5b and Supplementary Fig. 6a). Finally, we found that noise crosscorrelations (CCs) between LFPs recorded from evenly spaced electrodes decreased with the distance between the electrodes and became negative when they were far apart (Supplementary Fig. 6b and Supplementary Note 2). Therefore, as predicted by our model, noise correlations during singing are strong in RA, while they are extremely weak in LMAN.
Our electrophysiological recordings also reveal that the decays of the autocorrelations (ACs) and of the CCs of the spiking activity last for hundred of milliseconds in RA neurons (Figs 5h,i and 6b,c) and that these decays are substantially faster in LMAN (Fig. 6a,c; two-sample t-test, Po0.01 for n ¼ 10 single units in LMAN and n ¼ 14 single units in RA). What is the source of the relatively slow decorrelations in the activity of RA neurons? In our theory, synchronous temporal fluctuations in RA activity will be slow if the shared FF drive of the neurons in the motor network slowly fluctuates (see Supplementary Note 1 and Supplementary Fig. 7). If the synaptic dynamics in the premotorto-motor pathway are slow, they give rise to autocorrelograms and crosscorrelograms in the motor network that can be as broad as in the data (compare   stage, the inputs from HVC to RA are not yet functional 33 and the song is mostly driven by LMAN-RA circuit 10,12 . We therefore asked whether the neuronal circuit depicted in Fig. 2a can drive behavioural variability with statistics similar to those observed during the babbling stage of juvenile birds. To this end, we combined the circuit with a mechanical model of the vocal production organ 34 . The latter was developed for zebra finches. To characterize the statistics of the output signal of the model, we computed the distribution of gesture durations (vocal elements) and the autocovariance of the envelope signal (ACE; see Methods section), which quantifies high-order correlations between consecutive gestures and intergesture intervals. The gestures durations had an exponential distribution in the model (Fig. 7a, bottom left; see Methods section). As for the ACE (Fig. 7a, bottom right), it monotonically decays over a duration of several tens to hundreds of milliseconds. Quantitatively, the decay time constant of the ACE and the scale parameter of gesture duration distributions depend on the synaptic time constant of the premotor-to-motor projections (compare the blue and red lines in Fig. 7a).
To what extent do these statistics depend on the details of the model architecture and connectivity, the neuronal dynamics and the nonlinearities in the input-output transduction by the effectors? Fluctuations in the FF input to the neurons in the motor network in our circuit consist of many uncorrelated fluctuating contributions and their statistics are thus close to Gaussian ( Supplementary Fig. 8a). This is the case also for the net (FF þ recurrent) input to these neurons. (Supplementary Fig. 8b). Hence, the fluctuating activity of the neurons in the motor network can be approximately described as a wideband Gaussian process that is rectified ( Supplementary Fig. 8b-d), resulting in the tendency of the neurons to fire bursts of spikes with an approximately exponential distribution of durations (Supplementary Fig. 8f). Moreover, because neurons in the motor network are correlated, the temporal statistics of the input to an effector and of the neuron activity are similar. While the babbling behaviour generated by the circuit is a complex transformation 35 of the inputs to the effectors, the ACE or the distribution of gesture durations (but not finer structures of the gestures) are expected to be qualitatively independent on the details of this transformation. For example, for rectified power-law transformations the distribution of gesture durations is close to exponential ( Supplementary Fig. 8g) and the ACE barely depends on the nonlinearity ( Supplementary Fig. 8e). This is also true when combining the circuit with a mechanical model of the vocal production organ 34 . Therefore, these features reflect universal statistical properties of the circuit dynamics and are, to a large extent, insensitive to the circuit parameters and to the details of the transformation from the input to the effectors to the vocal behaviour.
Given the universality of the statistics of the features analysed above in our model, we then compared the 'babbling behaviour' generated by the model with babbling behaviour in different vocal learners. To this end, we analysed babbling vocalizations of juveniles from three different songbird species with completely different adult repertoires (zebra finches: single song of 3-8 syllables per individual; swamp sparrow: 2-5 stereotyped songs per individual gathering 5-10 syllable types; canaries: complex song sequences based on a repertoire of 20-40 syllables per individual), as well as vocalizations of 5-6-month-old human infants (adult repertoire: complex sentences based on 10-100 phonemes grouped in 410,000 words).
Remarkably, we found that the statistics of the vocalizations produced during the early period of babbling (but not later in development, Supplementary Fig. 9a-c,f) had a large degree of similarity in the four species we analysed. In all four species, as in the model, the distribution of vocal gesture durations could be well fitted with a single exponential (Fig. 7b-e, left and insets; see Methods section, see also 10,36 ). In addition, the ACE lacks a clear temporal structure in all babbling vocalizations. The scale parameter of the gesture duration distributions, as well as the decorrelation times of ACE (that is, the typical time constant of the ACE) varied across individuals and species from several tens to a few hundreds of milliseconds (Fig. 7b-e). However, as the distributions were close to exponential and variability within species was small (Fig. 7f,h), interspecies and intraspecies differences in gesture duration distributions became comparable after normalizing each individual distribution by its speciesaveraged scale parameter (Fig. 7g-  to species differences compared to 87% before rescaling; one-way analysis of variance (ANOVA)). A similar uniformity was observed in the inter-gesture interval distributions after normalization by the species average ( Supplementary Fig. 9d.; P ¼ 0.23, only 10% of total variance among distributions was attributed to species differences compared to 66% before normalization; one-way ANOVA). Interspecies variations in ACE were mostly due to differences in the species-averaged decorrelation time (P ¼ 0.16; only 17% of the total variance among ACEs was attributed to species differences, compared to 81% before normalization; one-way ANOVA). Finally, correlations between consecutive gestures and inter-gestures were small and comparable among species ( Supplementary  Fig. 9e). Together, these results show that in the four species we studied the statistics of the babbling-like vocalizations are very similar and can be naturally accounted by our minimal circuit.

Discussion
Our paper addresses the extent to which the intrinsic temporal irregularity of neuronal activity in the CNS can drive motor variability. This is a fundamental non-trivial question since, as to impact the behaviour, patterns of activity generated in the CNS must also be spatially correlated (that is, correlated across neurons; see also Supplementary Note 3. for alternative mechanisms). Although the emergence of asynchronous irregular activity in recurrent networks is well understood 13,16 , much less is known regarding the possible mechanisms giving rise to irregular spiking in which fluctuations of the activity are both temporally irregular and correlated across neurons. As a matter of fact, in virtually all network models of irregular spiking previously investigated, the activity is either asynchronous 16 or the synchronous component of the temporal fluctuations in neuronal spiking is strongly rhythmic 37 .
In particular, previous theoretical studies 16,38 concluded that correlations should be very weak in strongly recurrent cortical circuits (on the order of 1/N, where N is the network size). However, these studies assumed a completely random connectivity, without structure (with an Erdös-Renyi graph).
Here we showed that substantial correlations emerge naturally in a circuit with topography. With such an architecture, the dynamics self-organize in groups of neurons that are positively correlated within a group but negatively correlated between groups. In this spatial pattern of correlations, the balance between excitation and inhibition is maintained over the whole network. As a result, the circuit can eventually produce robust variable behaviour with 'universal' statistics. In fact, we showed that this mechanism does not require any fine-tuning of parameters. In particular, it is robust to the number of neurons and the average number of connections, as well as to the connectivity in the topographic pathway to the effectors (and the number of neurons projecting to an effector; see also Supplementary Note 1).
In songbirds, the organization of the LMAN-to-RA pathway becomes clearly topographic during the early sensory period of song learning 32 . Thus it is already present when juvenile finches start to babble (35-40 days post hatch (DPH)). Neurons in RA also send topographic projections to the hypoglossal nucleus (nXII) as well as to the respiratory motor nuclei 25 . The projections of the hypoglossal nucleus to syringeal muscles are also topographic 39 . Thus the pathway from RA to syringeal muscles (and likely similarly to respiratory muscles) is topographic, as required by our mechanism. Applied to the LMAN-RA circuit, this mechanism predicts that noise correlations are weak in LMAN but substantial in RA. We reported experimental evidence in line with this prediction in the adult zebra finch.
In juvenile and adult zebra finches, the inputs from LMAN to RA are dominated by NMDA receptors with slow kinetics of time constant on the order of B100 ms [29][30][31] . Moreover, recurrent excitation in LMAN is largely dominated by NMDA receptors and the kinetics of these receptors is faster in adults than in young juveniles 40 , with typical time constants of B30 ms in adults and B120 ms in juveniles 41 . Therefore, slow synapses in LMAN as well in LMAN to RA projections can underlie the relative slowness of the dynamics of the babbling behaviour we reported in juvenile finches (see also Supplementary Fig. 7 and Supplementary Note 1). In agreement with this view, localized mild cooling of LMAN in zebra finches results in an increase in the time constant of the exponential gesture distribution during babbling-like behaviour and in a longer tail in the distribution in older juveniles 36 .
Our behavioural data show substantial differences in the timescale of the babbling behaviour between zebra finches, canaries, swamp sparrows and humans. Our model suggests that this may be due to differences in the kinetics of NMDA receptors in these species. Revealing a direct correlation between these differences and NMDA receptors kinetics requires data on the latter. To the best of our knowledge, there is no such data available for canaries, swamp sparrows or human infants. However, the range spanned by the babbling timescales in our behavioural data is compatible with the diversity of kinetics reported in NMDA receptors of different subunit composition 42,43 .
In adult subjects, motor variability is expressed as fluctuations around a stereotyped motor pattern, which despite their relatively small amplitude, can contribute significantly to motor learning 2 . At early stage of development, young animals, as well as human infants, produce spontaneous exploratory gestures referred to as 'motor babbling' that do not rely on any stereotyped or goaloriented movement and rather appear to express pure motor variability 10,44,45 . Such exploratory movements may allow the self-organization 46 and the adaptation of sensory-motor networks through correlation-based (Hebbian learning) and reinforcement learning mechanisms 1,47-49 . These mechanisms posit that synaptic neural correlates of exploratory behaviour must persist for tens of milliseconds in the learning circuit. Our work suggests that the wide presence of NMDA receptors in the LMAN-to-RA projections 30,31 is a key component in the emergence of such eligibility trace in the overall dynamics of the circuit that generates behavioural variability in birds.
To conclude, we showed that a circuit comprising strongly recurrent neural networks, which is organized in a topographic manner, is capable of driving variable motor behaviours. This mechanism relies on only a few architectural constraints and is thus likely to be a general operating principle by which the brain acquires motor skills and adapt behaviour in a changing environment. During the review process, we became aware of a manuscript 50 which partially overlaps our work. It addresses the origin of spatially correlated activity in macaque V1. The mechanism proposed in this paper relies, similar to ours, on structured FF excitation and broad recurrent interactions.

Methods
Subjects. Seven human infants (3 males and 4 females) were recorded in their natural environment. Their parents gave written informed consent for participation in this study. Nine zebra finches and five canaries were obtained from our breeding facilities (Paris Descartes and Paris Sud Universities). Seven swamp sparrows were collected as nestlings and hand-reared in the laboratory (see ref. 51 for details). Birds were housed under natural light/dark conditions and provided with food and water ad libitum. Animal care and experiments were performed in accordance with European directives (86/609/CEE and 2010-63-UE) and the French legislation. Experiments were approved by Paris Descartes University ethics committee.
Human infants. We recorded spontaneous vocalizations in six infants in their natural environment starting from 5 to 7 months after birth (denoted as the 'babbling period'). The parents were instructed to place a recorder (digital dictation machine with stereo microphone, ICD-PX333M SONY) near the baby's head for B30 min at least 5 days a week for several weeks (4-20 weeks). The data presented include babies for which vocalizations were collected from at least 20 days during this recording period. Additionally, one 10-month-old infant was recorded for repetitive babbling.
Zebra finches. Juvenile zebra finches were raised in single cages with their parents and siblings. At age 26-41 DPH ('babbling period'), 9 male zebra finches were removed and placed in custom-made sound isolation chambers. Vocalizations were recorded for 10-30 days continuously with Sound Analysis 52 , which was configured to ensure that recordings were triggered on all quiet vocalizations of young birds. Five of the nine birds were continuously recorded until song crystallization (B3 months), with episodic access to their father.
Swamp sparrows. Seven swamp sparrow males were recorded in individual sound isolation chambers (Industrial Acoustics AC-1) once per week, starting in February of their first year when they were about 250 DPH. The onset of song development was first detected at 262-296 DPH ('babbling period'), and recording continued up to 366-386 DPH, when the males were singing crystallized adult song. Subsong was sampled for 30 min (Marantz PMD221 cassette tape recorder, Realistic Omnidirectional microphone, Yamaha Mike to Line Amplifier). An automated system was introduced to detect and record song during late plastic and crystallized song using a voice-activated switch (modified UherAkusomat) and a Digital Delay System (Digitech).
Canaries. Juvenile canaries were raised in our breeding facility at Paris Sud University in single cages with their parents and siblings. At age 75-150 DPH ('babbling period'), as they started to produce their first vocalizations, five male canaries were removed and placed in custom-made sound isolation chambers. Vocalizations were recorded continuously for 3 months (September-December) during the fall following their birth with Sound Analysis Pro, which was configured to ensure that recordings were triggered on all quiet vocalizations of young birds. Four of the five birds were also recorded 3 months later (early spring) for 5-10 more days.
Vocalizations. Songs and infant vocalizations were manually sorted. For subsongs, we took the first recorded song vocalizations of the bird. Recordings were from 1 day of vocalizations, except for zebra finches, where in some individuals subsongs from 1 to 3 recording days were combined to get enough gestures.
Spectrograms. Spectrograms were estimated using the multitaper method with two slepian tapers.
Envelope signal. We extracted the envelope of the signal (termed also 'amplitude' in the literature) by band passing the sound signal in the frequency ranges of the vocalizations (from 800 Hz and up to 4,000-10,000 Hz, depending on the species, with order-80 linear-phase finite-duration impulse response filter), taking the absolute value of the signal and low passing it at 1-200 Hz with a linear filter of order-200 linear-phase filter finite-duration impulse response.
Averaged ACs of envelope (ACE). The ACE was estimated for each recording and then normalized to the zero lag. The ACE signal was then estimated by averaging this signal over 1 day of recording sessions.
Gesture and inter-gesture segmentation. We used a local method for gesture and inter-gesture detection. We calculated the peaks of the derivative of the log-envelope signal (after band passing the signal; see above) that was smoothed     by a 5-30 ms sliding window, depending on the noise level of the signal and the species (using fpeaks in Matlab and a filter of ( À 1 0 1)) and defined sound onsets and offsets as the closest points to these crossings. We defined the threshold for each file by the x percentile of the peaks, where x was in the range of 85-98. The percentile threshold, x, as well as other relevant parameters, for segmentation were fixed after manually examining a subset of the data for each recording day. When the signal was too noisy to use the local method (mainly for infant babbling and several swamp sparrows), we used a global threshold: for each recording, we calculated a sound threshold by fitting a two-Gaussian mixture model (corresponding to noise and sound) using an expectation-maximization algorithm for the log-envelope signal. We then detected crossings of this threshold and defined sound onsets and offsets as the closest points to these crossings where the envelope deviated from the noise by 4 s.d. Using a global method instead of a local one on all the juvenile recordings yielded similar results; however, we preferred to use the local method whenever possible. For both methods, sounds separated by a duration of o7 ms of silence were merged into a single gesture, and segments of overly long (Zf: 4800 ms; Sw: 4400 ms; Ca: 4900 ms; Bab: 43,500 ms) or short durations (Zf, Sw, Ca: 7 ms; Bab:o30 ms) were eliminated.
Fitting exponential decay. We fit an exponential function to the gesture duration distribution using maximum-likelihood estimation on a finite interval 36 (based on a median of 1,870 gestures per day for a songbird and 700 for about 1 month of human infant recordings; interval duration: Zf: 50-800 ms; Sw:10-200 ms; Ca: 50-600 ms; Bab: 50-1,500 ms). Distributions that were well fit by the exponential function usually had a high goodness-of-fit metric 36 (adjusted R 2 40.7 and Lilliefores statistic o3.5). To extract decorrelation timescales from the ACE, we fit an exponential decay to the ACE.
Microdrive implantation. For single-and multi-unit recordings, a custom-built motorized microdrive (RP Metrix) was modified to accept 2-4 tungsten microelectrodes (8-20 MO, FHC), as well as lateral positioner. It was implanted in LMAN or RA as follows: Young adult male zebra finches (o180 DPH, 3 for nucleus LMAN and 2 for nucleus RA) were anaesthetized with 5% isoflurane (induction) and placed in a stereotaxic apparatus with a head angle of 30-50°( for LMAN implantation) or À 5°(for RA). Anaesthesia was maintained with 0.5-1% isoflurane for the duration of the surgery. LMAN/RA was located using previously established stereotaxic coordinates and identified based on its characteristic neural activity patterns. The electrodes and exposed brain were surrounded with Kwik-Cast (WPI), and the microdrive was secured to the skull using dental cement (Superbond, Phymep). A silver wire implanted under the skull acted as a ground, and a low impedance fixed tungsten electrode served as the reference. We also conducted LFP recordings in nucleus RA, using 3 Â 3 microelectrode arrays (AlphaOmega) with an impedance of 1-2 MOhm. Electrode arrays were implanted in young male zebra finches under isoflurane anaesthesia (as specified above), relying on the recorded signals to locate nucleus RA, and then fixed onto the skull using dental cement (Superbond, Phymep). A silver wire implanted under the skull was used as a ground, and one of the contact points of the array served as the reference. In both types of experiments, subjects were allowed to recover and habituate to the weight of the recording apparatus for a few days. They were then transferred to the recording cage and connected through a commercial tether and head stage (Neuralynx or AlphaOmega) and the implanted microdrive to a mercury commutator on the roof of the cage (Dragonfly systems). An elastic thread built into the tether helped to support the weight of the implant. Subjects remained tethered both during and between experiments.
Chronic recordings. Neural signals and vocalizations were collected using a commercial head stage and acquisition system (Neuralynx or AlphaOmega). Signals were amplified, digitized and filtered either o300 Hz (LFP signal) or between 300 and 30 kHz (spike signal).
Data analysis. Spike signals were analysed using Spike2 software (Cambridge Electronic Design) and custom-written software in Matlab (MathWorks). Single-and multi-unit signals were isolated using Spike2, and spike times were then exported to Matlab. Motif onset times were extracted from sound recordings using custom programs. We calculated the autocorrelograms (AC; 5 s window, 5 ms bin) of single spike trains and crosscorrelograms (CC; 5 s window, 5 ms bin) of all pairs of spike trains recorded simultaneously. Activity was first aligned to motif onsets and then averaged over all motifs produced during each recording session (peristimulus time histogram (PSTH) analysis), using a time window limited to the duration of a single song motif, and 5 ms bins. To eliminate temporal variability due to fluctuations in the duration of single syllables, spike trains were aligned and stretched using piecewise linear time warping with each syllable onset as a time reference 28 . Signal correlations measure the similarity of the activity of two neurons during singing. Noise correlations, on the other hand, is a measure of the similarity of the trial-to-trial variability (around the motif-related PSTH) of two neurons. We computed the spike counts in 5 ms bins during song production and subtracted the mean motif-related PSTH of a song motif for each neuron for all motifs produced. For each pair of simultaneously recorded neurons (by definition, the noise correlations can only be calculated for simultaneously recorded neurons), we computed the noise correlations by calculating the correlation coefficient of these two vectors. To compare the shapes of the crosscorrelograms from units recorded in two considered brain nuclei (LMAN and RA), we measured the mean deviation from zero in a crosscorrelogram according to the following procedure. The absolute value of the CC function was first averaged over the ( À 50 ms þ 50 ms) window (appropriate for the behavioural output given the integration time-constant). The absolute value of the AC functions of the two corresponding units was also averaged over the same time window. The average absolute CC was then normalized to the square root of the product of the average absolute AC functions to provide the numbers used in the statistical tests in the main text. This equation is given by: Shape of crosscorrelograms ¼ P t¼50 ms t¼ À 50 ms c 12 t ð Þ j j ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi P t¼50 ms t¼ À 50ms c 11 t ð Þ j j P t¼50 ms t¼ À 50 ms c 22 t ð Þ j j q where c 12 (t) is the crosscorrelogram across the two neurons at time lag t and c 11 (t) and c 22 (t) are the ACs of the two neurons.
LFP signals were aligned to motif onsets and averaged over all motif renditions produced during a recording session. To calculate the 'noise STA LFP', we first subtracted motif-aligned average LFP from LFP signals recorded during each single motif. We then computed the average of the residual-LFP signals cut in 600 ms window around each spike over all spikes produced during all motif renditions in the recording session. Additionally, we computed noise STA of the envelope of the background multi-unit spiking activity present behind single-unit recordings. To this end, we first removed spike shapes from the sorted single unit from raw spiking signal, rectified the leftover background signal and convolved it with a 5 ms wide Gaussian function. A similar treatment as for the LFP was applied to this background multi-unit envelop to get its noise STA.
To quantify the possible effect of bad or partial time warping on the level of correlation in our data set, we first compared the level correlation in our data set before and after time warping and then also incorporated artificially wrong syllable timing before the time warping was applied. To this end, we added variable jitter (from 2 to 500 ms) to the syllable onset times and then time-warped the spike times of the two units according to this jittered syllable timing. This manipulation served to artificially introduce a strong misalignment of the spiking activity with real singing behaviour, with a common time jitter for both units, without modifying the average activity of the neurons in a given trial.
Statistics. Numerical values are given as mean ± s.d., unless stated otherwise. Whenever using a statistical test, we report the type of test applied and the associated P value (probability of observing the given result, or one more extreme, by chance if the null hypothesis is true).
Histology. After the last recording session, subjects were killed by intramuscular injection of sodium pentobarbital (Nembutal) and perfused transcardially with 0.9% saline followed by 4% paraformaldehyde as fixative. The brain was then removed, postfixed in 4% paraformaldehyde for 24 h and cryoprotected in 30% sucrose. Sections (60 mm thick) were then cut in the parasagittal plane on a freezing microtome and processed for histological examination to verify the location of the recording electrodes. Tissue was Nissl stained to visualize the electrode tracks.
Spiking network model. Our model consists of two large recurrent networks, both comprising N E excitatory (E) and N I inhibitory (I) neurons. For simplicity, we take N E ¼ N I ¼ N. These two networks represent a premotor and a motor network. The premotor network projects in a FF manner to the motor network, and the latter activates a small number of effectors, consistent with the songbird anatomy 10,24,25,39 .
Single neuron and synaptic dynamics. All the neurons in the circuit are modelled as leaky-integrate-and-fire (LIF) units. The subthreshold dynamics of the membrane potential, V i,a (t), of neuron i in population a(i ¼ 1, y, N a ; a ¼ E,I) obey: where t m is the neuron membrane time constant, I a rec;i ðtÞ is the recurrent input into neuron (i, a), due to its interactions with other neurons in the same network (premotor or motor), I a FF;i ðtÞ is the total FF input into that neuron. V L is the reversal potential of the leak current (taken to be V L ¼ À 60 mV).
These subthreshold dynamics are supplemented by a reset condition: if at t ¼ t ia the membrane potential of neuron (i, a) crosses the threshold, V i;a t À ia À Á ¼ 10 mV, the neuron fires an action potential and the voltage reset to V i;a t þ ia ð Þ ¼ À60 mV. We model all synaptic inputs as pure currents. The total current into neuron (i, a) due to its recurrent interactions yields: where J ab ij is the strength of the connection from presynaptic neuron (j, b) with neuron (i, a) and S ab j ðtÞ are the synaptic variables, which follow the dynamics: Here t ab s is the synaptic time constant (assumed to depend solely on the natureexcitatory or inhibitory-of the presynaptic and postsynaptic neuron) and the sum is over all spikes emitted at times t jb ot.
Recurrent architecture. The recurrent connectivity of the E and I populations in the premotor network is random (Erdös-Renyi graph). In each network, the connectivity matrix, C ab , between presynaptic population b and postsynaptic population a is therefore a random N Â N matrix such that C ab ij ¼ 1, with probability K/N and zero otherwise, where K is the average number of inputs a neuron receives from population b. We assume that the strength of the synapses depends solely on these populations yielding: J ab ij ¼ J ab C ab ij where J aE 40 (excitation) and J aI o0 (inhibition). When comparing the dynamics of networks with different connectivity, we follow the prescription 13,53 : where the parameters J ab are of order unity and can be different for the premotor and motor networks.
Distance-dependent recurrent architecture. In the motor network, the connectivity is random with probability, which depends on the distance between the neurons. The probability of connections between two neurons is p ij where s rec is the footprint of the recurrent interactions. Here we have assumed for simplicity that the motor network is one dimensional, of size l and periodic boundary conditions. For large values of s rec , distant neurons are as likely to be connected as close ones, while for small values of s rec only neurons which are close have a significant probability to be connected (Fig. 2h). In most of the results depicted in the paper, we assume that the recurrent interactions in the motor network have a wide footprint (s rec -N), except for Fig. 2h,i, where we investigate how the results depend on the value of s rec .
Feed-forward architecture. The premotor network receives external FF inputs, which in the context of the songbird system represent the thalamic (the medial part of the dorsolateral nucleus of the anterior thalamus, DLM) inputs that may tonically activate LMAN during song. The total number of FF inputs to a premotor neuron is modelled as a constant drive 13,53 , ffiffiffi ffi K p I a . Similarly, the motor network receives a FF input from outside the circuit that we model as a constant drive. Importantly, the motor network also receives FF projections from the premotor area that exhibit a topographic organization. To implement this key feature of the architecture of our model, we divide the excitatory population in the motor network into D statistically equivalent functional groups (N/D neurons in each group). For each group, we choose a set of fK neurons (set P l ) in the premotor network projecting to in neurons in the group. For each neuron in the group, additional inputs are chosen by drawing randomly from the premotor network with probability (1Àf)K/N. Each neuron in a group therefore receives on average K projections from the premotor network. Changing f allowed us to easily manipulate the total amount of correlations in the FF inputs by changing the parameter, f, keeping the total average number, K, of premotor inputs per neurons fixed. For f ¼ 1 all the projections are topographic, whereas if f ¼ 0 they are completely random. The total FF premotor input into an excitatory neuron i (i ¼ 1, y, N) in group l ( ¼ 1, y, D) in the motor network is therefore modelled as: In this architecture, the probability that two neurons in group l share premotor inputs is f þ O(K 2 /N 2 ). Note that if K is too large it will be impossible to have different shared inputs for each group in the motor network. However, this does not happen with the model parameters in the simulations described in the paper since we take NZ10,000, and the maximum number of connections is K ¼ 1,000 for 10 clusters (Supplementary Fig. 3d).
The total FF premotor input into an inhibitory neuron i (i ¼ 1, y, N) in the motor network is I I FF;i ðtÞ ¼ J I0 f P j2premotor C I0 ij S I0 j t ð Þg þ ffiffiffi ffi K p I I . The synaptic strength, J a0 , is parameterized as the recurrent synapses: J a0 ¼ J a0 = ffiffiffi ffi K p , with J a0 of order unity and C a0 ij is a random adjacency matrix: C a0 ij ¼ 1 with probability K/N and zero otherwise. Finally, similar to the recurrent interactions, the dynamics of the synaptic variables S a0 i ðtÞ yields: t a0 s dS a0 i ðtÞ dt ¼ À S a0 i ðtÞ þ P ftjE g dðt À t jE Þ with t jE as the spike times of neuron (j, E) in the premotor network. For simplicity, we take K FF ¼ K.
Temporally structured FF input to the motor network. We model the temporally structured inputs to the motor network (HVC to RA input) by including an additional contribution, I struct i t ð Þ, to the FF input received by the neurons in this network. The input, I struct i t ð Þ, to neuron i, lasts for a duration of 600 ms (a typical duration of a zebra finch song motif) repeated 300 times. It consists of a random sequence of On and Off periods, the duration of which are drawn from an exponential distribution with mean 20 ms for the On and 70 ms for the Off periods. The amplitudes of the input during the On periods are drawn randomly from uniform distribution over an interval [0.1, 0.5]. The input sequences are generated independently for neurons in different functional groups. Each group is then divided into 20 (non-overlapping) subgroups such that all the neurons in a subgroup share the sequence. Note that the results depicted in Fig. 4a were obtained by simulating the network without structured input.
Effectors. The pathway from the motor network to the effectors is topographic. Specifically, we assume that: (1) the number of effectors and the number of groups are equal; (2) a given effector is activated by M4 4D neurons in the motor network randomly chosen from one group; and (3) different effectors are activated by different functional groups. We modelled the activation of an effector, E l (t)(l ¼ 1, y, D), as a linearly filtered version of the activity of the neurons in the motor network, namely: where t eff is the effector time constant and the sum is over all spikes emitted by the neurons in the motor network, which activates the l effector at times t jE ot.
The vocal organ. We modelled the vocal tract as in Amador et al. 34 In particular, we did not include the trachea or the Helmholtz filter, as these filters are species specific and in general will not affect gesture and inter-gesture durations. Two variables activate the vocal organ: tension and pressure. We modelled the pressure variable as e Pr ¼ E 1 ½ þ , where [x] þ is a rectified linear function. The tension is modelled as a linear combination of nine effectors:T n ¼ 1 W a E a , where, W l , (l ¼ 2, y, D), are random weights, W l BN(0,1). Tension and pressure are then scaled to fit the dynamic range of the oscillating phase (see ref. 34): where z T is the z-score ofT n and s T , m T and P max are constant parameters that define the dynamic range and b is a bias that ensures that when there is no pressure the system is at a fixed point. We take: m T ¼ 0.6; s T ¼ 0.2; P max ¼ 0.21; b ¼ 0.01. The tension and pressure were then smoothed by a rectangular window of 20 ms and interpolated to a sampling frequency of 44,100 Hz. We then used the tension and pressure parameters to simulate the model by Amador et al. 34 Finally, to reduce transient effects at the boundaries of the gestures (as a result of crossing the bifurcation) generated by the vocal tract model, the sound signal was taken as the product of the output model and the Pr signal.
We would like to stress here once more that the qualitative behaviour of the model is highly robust to changes in all its parameters (see Supplementary Note 1).
Numerical integration. The dynamics of the model circuit were numerically integrated using the Euler method supplemented with an interpolation estimate of the spike times 54 . In all simulations the integration time step was 0.1 ms. We verified the validity of the results by performing complementary simulations with smaller time steps.
Autocovariance and crosscovariance of spike activities. Neuronal spike trains were filtered with an exponential kernel (time constant ¼ 5 ms). ACs and CCs of neuronal activities were estimated from the resulting smoothed signals. Populationaveraged ACs and CCs were computed over all neurons in the corresponding population. The Pearson CC was defined as the crosscovariance normalized by the autocovariance at zero lag.
Measure of synchrony and variability of the effectors. We quantified the degree of synchrony in the activities of the premotor or motor network using the synchrony measure, w(M), defined by 55,56 : In that case, pair-wise correlations are small, of the order of 1/ N and the population average firing rate is constant in time. In contrast, if a converges to a non-zero value for large N, the network is in a synchronous state. To quantify the variability of the inputs to the effectors (receiving inputs from M neurons in the motor network), E l (M, t)(l ¼ 1,y, D), we computed the coefficient of variation, CV eff such that: If the motor network is in an asynchronous state, A $ 1 N , since E l (t) is linearly related to the population averaged activity in a functional group l.
Data availability. All relevant data and computer codes are available from the authors.