Abstract
Songbirds are capable of vocal learning and communication1, 2 and are ideally suited to the study of neural mechanisms of complex sensory and motor processing. Vocal communication in a noisy bird colony and vocal learning of a specific song template both require the ability to monitor auditory feedback3, 4 to distinguish self-generated vocalizations from external sounds and to identify mismatches between the developing song and a memorized template acquired from a tutor5. However, neurons that respond to auditory feedback from vocal output have not been found in song-control areas despite intensive searching6, 7, 8. Here we investigate feedback processing outside the traditional song system, in single auditory forebrain neurons of juvenile zebra finches that were in a late developmental stage of song learning. Overall, we found similarity of spike responses during singing and during playback of the bird's own song, with song responses commonly leading by a few milliseconds. However, brief time-locked acoustic perturbations of auditory feedback revealed complex sensitivity that could not be predicted from passive playback responses. Some neurons that responded to playback perturbations did not respond to song perturbations, which is reminiscent of sensory-motor mirror neurons8, 9. By contrast, some neurons were highly feedback sensitive in that they responded vigorously to song perturbations, but not to unperturbed songs or perturbed playback. These findings suggest that a computational function of forebrain auditory areas may be to detect errors between actual feedback and mirrored feedback deriving from an internal model of the bird's own song or that of its tutor. Such feedback-sensitive spikes could constitute the key signals that trigger adaptive motor responses to song disruptions10, 11 or reinforce exploratory motor gestures for vocal learning12.
The field L region and the caudolateral mesopallium (CLM) are interconnected brain areas not part of the traditional song-control system and are analogous to the auditory cortex in mammals in that they receive the main stream of auditory input from the thalamus, as well as feedback from motor-related areas13, 14, 15, 16, 17. Neurons in field L and CLM of awake and anaesthetized animals respond robustly to a large variety of auditory stimuli such as white noise, the bird's own song (BOS), and conspecific songs18, 19, 20, 21. These features make field L and CLM potential substrates for the integration of self-generated and external sounds and for monitoring singing-related auditory feedback. To explore this hypothesis, we made extracellular recordings from CLM and field L neurons in juvenile male zebra finches using chronically implanted miniature motorized microdrives (Supplementary Fig. 1). Our strategy was first to probe singing-related firing in these neurons for evidence of motor-specific processing beyond passive auditory responses elicited by playback of the BOS, and then to investigate the feedback sensitivity of singing-related spikes by delivering brief acoustic stimuli during singing.
Singing and playback-related firing was similar in most cells (Fig. 1a), despite the large differences in sound amplitudes in vocal and playback conditions and despite the variable direction of the playback source relative to the bird's head. Average firing rates in vocal and playback conditions were typically above baseline rates (Fig. 1b) and were highly correlated (r = 0.77, P < 10-10, n = 92; Supplementary Figs 2 and 3). Firing rates were also well matched on a finer timescale. In 69 of 92 cells, the spike-coherency function22, 23 averaged over all pairings of song and playback trials displayed a significant peak (>2 jacknife standard deviations above zero) within
20 ms. The peak of the mean coherency function (averaged over 92 cells) was significant and occurred 6.8 ms after singing-related spikes (Fig. 1c), indicating that, overall, singing-related activity slightly preceded playback-evoked responses (only 0.5 ms of this lag can be explained by the closer proximity of the sound source to the bird's ears during singing). This anticipatory behaviour of singing-related activity suggests that in addition to auditory inputs, cells in field L and CLM also received inputs from a vocal-related, non-auditory source. Consistent with this view, in roughly one-fifth of the cells we observed firing increases before onset of the first introductory note of a song bout (Fig. 1d, e and Supplementary Fig. 4), which demonstrates a source of non-auditory drive in these cells during singing. Hence, some neurons seemed to integrate auditory with non-auditory signals, of which the latter may have reflected information about song-motor activity, for example as part of a motor estimate of auditory feedback.
Figure 1: Comparison of active responses with passive responses.

Active responses were similar to passive responses, but were more stereotyped and slightly anticipatory. a, Activity during song (green rasters) leads activity during BOS playback (red rasters) in this example neuron. Top: spectrogram of an example song motif (high sound amplitudes in red and low amplitudes in blue). Bottom: average firing rate (FR) curves. b, Scatter plot of firing rates (z-scores) in individual cells. Cells were stimulated with one version (black circles) or many versions (red crosses) of the BOS. c, The mean coherency function between singing- and playback-related spike trains peaked 6.8 ms (dotted line) after singing-related spikes. d, The spike rasters in this neuron are anticipatory to song onset (blue line) and delayed relative to playback onset. Significant deviations from baseline firing are marked by green and red horizontal bars, and onset times are indicated by asterisks. Top: mean sound amplitude (solid line) plus/minus standard deviation (dashed line). e, Cumulative distribution of response onset times. f, Median firing stereotypy (black bars) during song (n = 92) is similar to stereotypy during playback of a single BOS (n = 24) but higher than stereotypy during playback of many versions of the BOS (n = 68). Second and third quartiles are shown by coloured boxes.
High resolution image and legend (252K)Download Power Point slide (708K)Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
Motor-specific processing was also evident by analysis of firing stereotypy, which was higher during singing than during playback of different versions of the BOS (Wilcoxon rank-sum (WRS) test, P < 10-8; Fig. 1f). This stereotypy difference could be attributed neither to intrinsic differences between song and playback stimuli (because the latter were copies of the former) nor to differences in average firing rates (Supplementary Fig. 3). By contrast, the firing stereotypy during singing did not differ significantly from the stereotypy during playback of one unique BOS stimulus (WRS test, P = 0.42). Thus, singing-related firing stereotypy was higher than predicted by passive responses, but was commensurate with intrinsic synaptic noise, suggesting that auditory responses may be partly subsumed during singing by a motor-specific source of stereotyped synaptic input.
Brief acoustic stimuli delivered during singing provide an effective means of operant conditioning of song features11. Such stimuli are thus ideally suited to probing auditory feedback sensitivity. In 50% of song motifs (randomly selected) we applied a brief perturbing stimulus through a second loudspeaker that was time locked to a given syllable (see Methods, Fig. 2a and Supplementary Figs 5 and 6). In agreement with previous reports on adult birds6, we found that feedback perturbations did not induce immediate spectral or temporal changes in vocal output (all analyses were restricted to song motifs with conserved syllable sequences; see Supplementary Figs 7–9 and Methods).
Figure 2: Example perturbation responses.

a, Bottom: spectrogram of a song that was twice perturbed by a long-call stimulus (red shading). Song motifs are delimited by red boxes. Top: extracellular voltage trace of a simultaneously recorded neuron (inset, spike burst). b, Neuron with increased firing during perturbation of song (top) and of BOS playback (bottom). Perturbing stimuli are indicated by red shaded areas; lighter shading corresponds to lower sound amplitudes. Average firing rate curves are shown for cases without perturbation (blue line) and with perturbation (solid red line; dashed and dotted red lines are averages over trials with high and low perturbation amplitudes, respectively). Also shown are average spontaneous firing rate (dashed horizontal lines) and the times of significant perturbation responses (black horizontal bars). b = 0.0, s = 0.42. c, As in b, but for a neuron that responds to perturbations only in the playback, but not the vocal condition. b = -0.95, s = 0.0. d, e, Two feedback-sensitive neurons in the same bird with selective (b = 0.38, s = 1.38; d) and highly selective (b = 0.38, s = 7.11; e) responses for song perturbations (e shows the same cell as in a).
High resolution image and legend (261K)Download Power Point slide (892K)Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
We quantified the propensity of cells to respond to perturbations either in the vocal or the playback condition in terms of a response bias b confined to values between b = -1 (no perturbation response during song) and b = 1 (no perturbation response during playback). Similarly, we quantified the sensitivity of responses to perturbed feedback in terms of a selectivity s that was normalized to s = 0 if the firing did not change during feedback perturbations and to s = 1 if the firing doubled during perturbations (see Methods). Overall, 66 of 67 cells significantly responded to feedback or playback perturbations (54 of 67 to feedback, and 53 of 67 to playback; see Methods). Many cells responded robustly to perturbations in both conditions (Fig. 2b). However, almost 20% of cells had a strong playback bias and small selectivity for distorted feedback (b < -0.5 and |s| < 0.5, n = 12); these cells responded to playback perturbation, but did not respond to even high-intensity song perturbation (Fig. 2c). About 10% of cells showed very high selectivity for distorted feedback (s > 3, n = 8), and many of them tended to be quite unresponsive to perturbed playback (Fig. 2d, e and Supplementary Fig. 10a). Feedback perturbations predominantly induced firing increases (s > 0), although we observed firing suppression in a few cells (Fig. 3a and Supplementary Fig. 10b, c). In both conditions, sound amplitudes during perturbation responses were significantly lower than average (Student's t-test, P < 10-10), in agreement with a monotonic relation between perturbation amplitude and response selectivity s (Supplementary Fig. 11). However, onset times of perturbation responses were widely distributed across cells (Fig. 3b), revealing a cell-specific temporal modulation of perturbation sensitivity.
Figure 3: Summary of perturbation selectivity.

a, Scatter plot of bias, b, versus selectivity, s (n = 67 cells, circles), the marginal distributions (histograms, showing number of cells), and the median chance selectivity (dashed line) with first and third quartiles (grey shading; see Methods). b, Cumulative distributions of perturbation-response onset latencies during song (solid lines) and playback (dotted lines). Latencies were widely distributed, both across all birds (black lines) and in two individual birds (orange and purple lines). The estimated contribution caused by random fluctuations is shown by the dashed line (labelled 'Chance'; see Methods), and the range of perturbation end points by the grey shaded area.
High resolution image and legend (122K)Download Power Point slide (500K)Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
In six of eight cells with selectivity s > 3, peak firing rates during perturbed song were more than three times higher than peak firing rates during unperturbed song (in the remaining two cells, peak firing rates were more than 40% higher). Thus, responses to perturbed feedback could largely exceed all of unperturbed singing-related activity, suggesting that high selectivity for distorted auditory feedback derives from a precisely timed and strongly coherent synaptic drive.
We also explored whether high selectivity was associated with suppression of neural activity during self-initiated vocalizations, because such suppression is a common gain-control mechanism found in the auditory brain areas of animals as diverse as crickets24, bats1 and marmosets25. The baseline firing in highly selective cells (s > 3) was lower than for all other cells (5.7 Hz versus 19.5 Hz, P = 0.028; Supplementary Fig. 12a); nevertheless, highly selective cells were relatively more suppressed during singing (WRS test of equal median z-scores, P = 0.005; Supplementary Fig. 12b).
In conclusion, field L and CLM responses equally reflect processing of self-generated and external auditory inputs, as made evident by the similarity of average firing rates in active and passive conditions (Fig. 1b and Supplementary Fig. 3), by the uniformly distributed bias index (Fig. 3a) and by similarity of onset-latency curves (Fig. 3b). Hence, the auditory forebrain seems to form an invariant representation of actively and passively perceived songs for integrating and comparing auditory feedback with the songs of other birds. Dissimilarities between active and passive sound processing were evident in terms of a motor-specific drive. Consequently, some neurons showed singing-related activity that resembled playback-evoked activity but was insensitive to perturbed feedback. Such behaviour is reminiscent of auditory-vocal mirroring reported in HVC8 neurons and could arise from corollary discharges elicited by an efference copy of motor commands. On the other hand, neurons that were largely quiescent during singing, except when the auditory signal was perturbed, are reminiscent of some neurons in primate auditory cortex that strongly respond to frequency-shifted auditory feedback26. Vocal-mirror spikes could contribute to the generation of highly perturbation-selective responses, provided that such spikes are able to precisely counterbalance the excitatory drive elicited by sensory feedback. We find some evidence for such counterbalancing in perturbation-selective neurons in terms of their relatively strong firing suppression during song, even though suppression was rare overall, unlike in monkey auditory cortex25 and in an auditory ganglion of crickets24 (note that in crickets, responses are not perturbation selective, despite this suppression).
Relatively few cells specialized into highly selective perturbation detectors, yet their mere existence suggests that auditory feedback is analysed in the auditory forebrain with reference to an internal model (Supplementary Fig. 13). For example, vocal mirror responses could represent predicted auditory feedback, which helps the bird to generate a stable perception of its song in the midst of a noisy colony. Accordingly, highly feedback-sensitive responses would reflect prediction errors of auditory feedback; such errors could signal song disruptions or simplify vocal learning, according to some forward-model theories27. Alternatively, given that the birds in our study were in the process of learning a tutor song, vocal mirroring could constitute online replay of the tutor memory, as evidenced by similar firing stereotypies during song and during playback of a single song template (Fig. 1f). Similarity of vocal and playback-related firing could thus be a reflection of a good match between the actual song and the memorized auditory template, of which the latter may feed into CLM and field L through the caudomedial nidopallium28, 29. According to such a template-replay interpretation, responses in highly perturbation-selective neurons represent performance errors signalling the dissimilarity between the perturbed song and the tutor memory, a property with obvious benefits for song learning12, 27. The ability of birds to correlate perturbations with subtle motor variability11 suggests a functional connection between perturbation-selective neurons and premotor neurons, although a direct link between perturbation selectivity and song learning remains to be observed.
Methods Summary
Subjects and electrophysiology
All experiments were carried out in accordance with protocols approved by the Veterinary Office of the Canton of Zurich, Switzerland. Data were collected from six juvenile male zebra finches (60–92 days old). Microdrives with three independent electrodes were implanted 1.1–2.1 mm anterior and 1.5–2.0 mm lateral of the midsaggital sinus (under a fixed head angle of 70° or 90°) using methods previously described30. After each experiment, the brain was removed for histological examination of unstained slices to verify the location of reference lesions. Neurons were classified as putative CLM or putative field L based on anatomical location15.
Data analysis
Differences between average firing rates in unperturbed and perturbed trials were assessed using a WRS test (P = 0.05) on the number of spikes in 30-ms time windows. Windows were shifted in 5-ms steps and only sequences of at least two subsequent windows with P < 0.05 were considered significantly different.
We defined the response bias as b = (|df| - |dp|)/(|df| + |dp|), where df is the feedback perturbation response and dp is the playback perturbation response, each defined as the difference in average firing rates between perturbed and unperturbed conditions in a time window extending from the time of onset of perturbation up to 50 ms after perturbation ends. We defined the response selectivity as s = df/rs, where rs is the average firing rate during unperturbed song motifs.
The stereotypies of singing and playback-related spike patterns were assessed using the average coherency of spike rasters at zero time lag (sound traces were aligned by the amplitude onset of the detected syllable). Differences between singing and playback-related firing stereotypies were detected using a WRS test (same BOS, P = 0.3; different BOS, P < 10-7).
Full methods accompany this paper.

s is the average firing rate during song/playback,
s and
