The segregation of vocal circuits solves a credit assignment problem associated with multi-objective reinforcement learning

Motor circuits vary in topographic organization, ranging from a coarse relationship between neuron location and function to highly localized regions controlling specific behaviors. For unclear reasons, vocal learning circuits lie at this second extreme: they repeatedly evolved to be spatially segregated from other parts of the motor system. Here we show that spatially segregated motor circuits can solve a specific problem that arises when an animal tries to learn two things at once. We trained songbirds in vocal and place learning paradigms with brief strobe light flashes and noise bursts. Strobe light negatively reinforced place learning but did affect song syllable learning. Noise bursts positively reinforced place preference but negatively reinforced syllable learning. These double dissociations indicate that vocalization-related reinforcement signals specifically target the vocal motor system, while place-related reinforcement signals specifically target the navigation system. Non-global, target-specific reinforcement signals have established utility in machine implementation of multi-objective learning. In vocal learners, such signals could enable an animal to practice vocalizing as it does other things such as forage for food or learn to walk.


Introduction
Diverse behaviors can be shaped by primary reinforcement such as reward (e.g. food or water) and punishment (e.g. electric shock), including place preference, lever pressing, action sequencing and timing, reaching, choice tasks, and more 1 . Electrical or optogenetic activation of ascending neuromodulators such as dopamine can also reinforce a wide range of actions coincident with the stimulation 2,3 . The diffuse, nontopographic projection patterns of ascending neuromodulatory systems are well-suited to carry reinforcement signals globally to multiple action-generating modules in basal ganglia and cortex [4][5][6] .
Yet one problem with global reinforcement signals is credit assignment: how does the brain 'know' which action caused a reward and, relatedly, which actiongenerating neural circuit requires synaptic plasticity and policy updating to improve performance? Superstitious behaviors acquired during reinforcement learning exemplify how a global reinforcement signal can mis-assign credit to a motor act temporally contiguous with, but causally unrelated to reinforcement 7 . Stereotypic body rotations, arm and leg movements acquired during simple tapping or pecking tasks further demonstrate that motor regions controlling arm, leg, and orientation circuits share common, broadcasted reinforcement signals 8,9 . The credit assignment problem is particularly severe in cases when an agent pursues multiple objectives at once [10][11][12][13] . For example, consider a toddler babbling to herself while stacking blocks. She uses her vocal motor system to speak and her hands and arms to stack. Learning these tasks depends on different types of feedback.
Learning to talk may rely on comparison of sensory feedback to an internal auditory target, while learning to stack blocks may rely on comparison of sensory feedback to an entirely independent visual target.
Whereas standard reinforcement learning (RL) algorithms optimize a single cost function (e.g. maximize cumulative reward) with a scalar reinforcement signal 17 , in multiobjective learning a single agent can be endowed with independent sub-agents which are trained by an equal number of agent-specific reinforcement signals [14][15][16] . In the babbling toddler, for example, auditory error signals would reach the vocal motor system (and not the block building one) to shape future vocalizations. Meanwhile errors such as tower collapse would reach the block-building system (and not the vocal motor one) to shape future block building policy 18 . To our knowledge it remains unknown if a single animal possesses distinct 'agencies' inside its brain which are, by definition, shaped by agent-specific reinforcement signals.
Here we use songbirds to test if an animal can compute behavior-specific reinforcement signals and route them to corresponding behavior-producing parts of the motor system. Songbirds sing and, at the same time, navigate (i.e. hop and fly). An objective of the song system is to produce a target sequence of sounds derived from the memory of a tutor song [19][20][21] . An objective of a navigation system is to avoid aversive stimuli 22 . Song learning can be reinforced with distorted auditory feedback (DAF): if a brief broadband sound is played to a bird as it sings a target syllable a certain way, the bird modifies its song to avoid the feedback 23,24 . A song-relevant reinforcement signal thus derives from auditory error [25][26][27][28] . Navigation policy can be reinforced with a bright strobe light: if a strobe is flashed in a specific place, many animals learn to avoid that place 29 . A navigation-relevant reinforcement signal can thus derive from an aversive visual stimulus.
Songbirds also have a discrete vocal motor 'song system', dedicated to song learning and production, that is spatially segregated from other parts of the motor system 30 . Lesions to song system nuclei impair singing but not other behaviors such as grooming, eating, navigation and flight [30][31][32][33] . In addition, neural activity in song system nuclei is strongly correlated with singing and not other motor behaviors [34][35][36][37] .
The ability of songbirds to simultaneously generate distinct behaviors with distinct objectives, together with the existence of a spatially isolated song system, presents a unique opportunity to test different network architectures for multi-objective learning. To determine if vocal and place learning can be shaped by shared, overlapping, or distinct reinforcers, we built a closed-loop system that provides either strobe light or noise feedback contingent on zebra finch spatial position or pitch of a target song syllable ( Figure 1). As shown in Figure 2, distinct learning algorithms require distinct network architectures that make distinct and specific experimental predictions. In a standard RL network with a scalar, global reinforcement signal, both strobe and noise could similarly reinforce both song pattern and place preference (Figure 2A). In a multi-agent RL architecture where each behavior is independently trained by a behavior-specific reinforcement signal, noise could reinforce song pattern but not place preference, and strobe could reinforce place preference but not song pattern ( Figure 2B). Finally, global and target-specific reinforcement signals might coexist: one of the stimuli could drive a global error signal that reinforces both behaviors, while another could specifically target one behavior ( Figure 2C).
We find that song pattern and place preference are differentially reinforced by sound and strobe light respectively, consistent with a multi-agent network architecture.
Our identification of behavior-specific reinforcement suggests that auditory feedback has privileged access to songbird vocal motor circuits. More generally, our results provide support for animal implementation of a specific network architecture used in machine learning and provide a logic for the spatial segregation of vocal motor circuits that independently evolved in diverse vocal learning species 16,41 .
We next carried out song syllable pitch-contingent auditory feedback. In each bird, we chose a 'target' harmonic syllable amenable to real-time pitch computation (Methods). After at least three days of obtaining baseline target syllable pitch distributions, we implemented pitch-contingent noise feedback by playing the 75 ms noise burst (used in perch preference experiments) during low pitch target syllable variants ( Figure 4). All birds increased the pitch of their target syllable to avoid the noise (change in pitch per day: 8.2±7.3 Hz, p<0.0001, one-sample t test, n=5 birds), consistent with previous studies 23,24,[42][43][44][45] . Thus the same noise that was positively reinforcing to the navigation system was aversive to the vocal motor system.
To test if strobe light is aversive to the vocal motor system, we implemented pitch-contingent strobe feedback, exactly as described above except the 75 millisecond sound was replaced with the 75 millisecond strobe stimulus. Birds did not change the pitch of their target syllables to avoid strobe, even when they were given extended periods of time to allow for potentially slower learning (change in pitch per day: 0.2 ± 3.3 Hz, p>0.7, n = 5 birds, 45 days, one-sampled t test). Thus, the light stimulus that was aversive to the navigation system was not detectably aversive to the song system.
The routing of error signals to distinct parts of the motor system could in principle be gated by behavioral context. For example, the noise sound could be aversive during singing but not during non-singing periods ( Figure 6A). To test this, we separately analyzed perch occupancy patterns for singing and non-singing periods during the perch-contingent noise experiments. Birds preferred the 'noisy' perch during both singing and non-singing periods ( Figure 6B-D Similarly, the strobe light might be globally aversive but only during non-singing periods, for example if birds simply did attend to light during singing ( Figure 6E). To test this, we separately analyzed perch occupancy patterns for singing and non-singing periods during the perch contingent strobe experiments. Birds avoided the strobed perch during both singing and non-singing ( Figure 6F-H)

Discussion
Vocal learning poses unique problems because vocalizations are often produced as animals are doing other things. Toddlers babble even as they learn to walk; birds learn to sing even as they hop and fly around an environment. In machines, one way to solve the credit assignment problem associated with multi-objective reinforcement learning is to endow an agent with independent sub-agents which are trained by an equal number of agent-specific reinforcement signals [14][15][16] . In this view, functionally segregated vocal learning circuits could provide a target for vocalization-specific reinforcement that would not contaminate non-vocal behaviors. We report that song and place learning are driven by distinct reinforcers, demonstrating that action-specific reinforcement signals can be computed and precisely routed to corresponding sub-parts of the motor system. These findings also provide a clear counterexample to general purpose models of learning that rely on global reinforcement 4,46 .
Specific evolutionary histories endow animals with genetic constraints on the associativity of actions with outcomes 47 . For example, dogs struggle to learn to yawn for food, trapped cats readily learn to escape a cage by pressing a lever but not by grooming, rats associate sounds and lights with electric shock but not with nauseating food, and pigeons can learn to peck a key for food and take flight to avoid a shock, but not vice versa 1,48-50 . These studies demonstrate that pairing of specific actions with valent consequences in a laboratory setting may be so unnatural that an animal is unable, or 'contraprepared', to associate them 51 . In our experiments, it was likely natural for bird to navigate away from a threatening stimulus, but not to avoid eliciting it by singing in a different way. Reinforcing vocalizations based on auditory, but not visual feedback, may also be more natural for song imitation. Finally, it may also be natural for a social animal like a zebra finch to navigate towards noisy places and away from quiet ones, as silence may indicate isolation and an associated increased predation risk.
What are the precise neural circuits that connect an aversive light flash to the navigation system to drive avoidance behavior, and a song-like noise to the vocal motor system to change syllable pitch? First, much like the human speech system, the song system is a discrete neural circuit, embedded in an evolutionarily conserved basal ganglia thalamocortical loop 41,52 . Electrophysiology, brain lesion and immediate early gene studies indicate that the song system is dedicated to singing, and not to other behaviors such as grooming, eating or navigation 53 . The anatomical segregation of vocal circuits might create a discrete spatial target for song-specific error signals. For example, we recently identified song-related auditory error signals in dopaminergic neurons of the songbird ventral tegmental area (VTA) 25 . Using antidromic and anatomical methods we discovered that only a tiny fraction (<15%) of VTA dopamine neurons project to the vocal motor system -yet these were the ones that encoded vocal reinforcement signals. The majority of VTA neurons which project to other parts of the motor system did not encode any aspect of song or singing-related error. This specific 'song evaluation channel' embedded inside the ascending mesostriatal dopamine system thus targets auditory performance error signals specifically to vocal motor, and not navigation, circuits. Each perch was equipped with two 5mm IR-beam break sensors (Adafruit, product ID: 2168). Beam-break data was acquired and analyzed alongside the microphone signal with an arduino and custom labview code that communicated with either a speaker or strobe lights. Depending on the contingency, a targeted perch was associated with light or noise feedback.

Animals
Statistical analyses. Statistics were first performed with two-way ANOVAs to test for effect of condition (strobe or no strobe, noise or no noise) and singing state (singing and non-singing), followed up with post hoc one-sample t tests to test whether specific conditions differed from the null hypothesis that perches would be equally occupied and landed on.   network where a single reinforcement signal acts globally on multiple parts of the motor system to shape the policy of multiple behaviors. This architecture predicts that both strobe light and noise burst will be aversive to both vocal motor and navigation systems, i.e. will shape both song syllables and place preference. (B) A multi-agent RL network where each behavior is shaped by its own behavior-specific reinforcement signal. This architecture predicts that noise will shape song but not place preference, and that strobe will shape place preference but not song. (C) Global and behavior-specific reinforcement signals might coexist. Here, it is imagined that strobe light drives reinforcement signals that reach all parts of the motor system, whereas DAF-related reinforcement signals target specifically the vocal motor system. This architecture predicts that DAF will shape song but not place preference, and that strobe will shape both song and place preference.     Strobe light was aversive to the navigation system but was apparently unable to access vocal motor circuits.