Songbirds work around computational complexity by learning song vocabulary independently of sequence

While acquiring motor skills, animals transform their plastic motor sequences to match desired targets. However, because both the structure and temporal position of individual gestures are adjustable, the number of possible motor transformations increases exponentially with sequence length. Identifying the optimal transformation towards a given target is therefore a computationally intractable problem. Here we show an evolutionary workaround for reducing the computational complexity of song learning in zebra finches. We prompt juveniles to modify syllable phonology and sequence in a learned song to match a newly introduced target song. Surprisingly, juveniles match each syllable to the most spectrally similar sound in the target, regardless of its temporal position, resulting in unnecessary sequence errors, that they later try to correct. Thus, zebra finches prioritize efficient learning of syllable vocabulary, at the cost of inefficient syntax learning. This strategy provides a non-optimal but computationally manageable solution to the task of vocal sequence learning.

renditions in birds 1 and 2; colors, pitch of syllables A/A t and C/C t (t= target pitch); 11 grayscale, Wiener entropy in neighboring syllables (as in Fig. 2b). Sonograms at 12 bottom and top show song at start and end points. Birds corrected pitch errors in 13 syllable A and C before changing syntax (bird 1 did not change syntax at all; bird 2 14 matched the target syntax). (d) Fraction of pitch error correction (left) and time (days) 15 to reach 50% pitch match (right) in syllables A and C across experimental birds. 16 Black, individual birds (lines connect the two syllables in each bird); red, mean ± 17 s.e.m; n=8. one (bird 5 in (b)), the pitch of both syllable types shifted towards the spectrally 27 closer targets. In bird 5, the pitch of both syllable types (shown in black and green for 28 visual clarity) shifted towards the spectrally farther targets. Bird ages at the end of the 29 experimental period in task 4.2 were 121, 121, 128, 130 and 153 days post hatch. As 30 the sensitive period for song learning in zebra finches ends around day 90-100 post 31 hatch, it is unlikely that birds in this group that matched the 1-semitone targets were 32 on the way to matching the farther targets. (depicting Bird 1 of task 3). All birds except one (Bird 4 in (a)) matched the vacant 44 target with a syllable type initially external to the song motif, usually a call. In bird 4 45 (a, bottom), the target Bwas not matched. Syllable B shifted to B + in the "wrong" 46 context (namely, after syllable A), but was also performed sparsely after syllable C; 47 pitch trajectories in this bird are shown separately for renditions after A and after C 48 (middle plots); right-most plot shows daily pitch means ± s.e.m. for renditions after A 49 (   To avoid a large duration difference between source and target playbacks, in this task 117 the source playbacks included 4 motif repetitions (Audio 10). 118 119 120  121  Mathematical Supplement:  122  123 Birds simplify the quadratic problem of computing performance error to a linear 124 assignment problem 125 126

Supplementary Notes
Performance error 127 Associated with a song S we define a family of performance errors ( ) parameterized by a set of unknown parameters grouped in the matrix The 129 errors are composed of an overall phonology (spectral) error and a syntax (sequence) 130 error 131 (1) 132 The parameter represents an unknown tradeoff between phonology and syntax 133 errors. 134 The overall phonology error between song S and target T is defined as a 135 weighted sum of the local phonological errors ( ) target assignments (which syllables are assigned to a specific target) correspond to 141 rows of ∆ and syllable assignments (which targets are assigned to a specific syllable) 142 to columns of ∆ . To illustrate this notation, a bird that does not assign a phonology 143 error to syllable 2 S entails ,2 0 i δ = for all i ; and, a bird that compares 2 S with the 144 first target (syllable) 1 T entails 1,2 1 δ = . If there is local chaining of assignments then a 145 bird that compares 1 S to 3 T will also compare 2 S to 4 T ; in terms of ∆ , chaining of 146 assignments means that the condition , 1 i j δ = implies 1, 1 1 i j δ + + = with high probability.

147
By virtue of the assignment weights , i j δ , the phonology errors may parameterize any 148 imaginable comparison between song and target. 149 150 The syntax error # ∆ quantifies the amount of resequencing a bird must 151 perform in order to bring its song elements into global alignment with the template. 152 This error quantifies the new transitions to be created among existing song elements. 153 Because of stepwise acquisition of syntax in songbirds 1 , it makes sense to attribute to 154 # ∆ a cost proportional to the number of new transitions to be generated.

156
In the case of binary and one-to-one syllable-target assignments ( , 0 where we use the short-hand notation necessary to incorporate the circular boundary conditions arising from birds' tendency 163 to repeat motifs several times in a song bout. As can be seen, Equation (2)  Equations (1) and (2) model the performance error attributed to any given song. The 168 assignment matrix is not specified therein; therefore, these equations can be thought 169 of as the space of all possible strategies for estimating phonology and syntax errors 170 between song and template (Fig. 1a). Our experiments were designed to resolve birds' 171 strategy in dealing with phonology and syntax errors and the on/off-diagonal structure 172 of the assignment matrix. 173 174 Next we describe the process of song learning. In terms of Equation (1) In the process of song learning, ∆ is either fixed or it evolves in time, possibly 184 giving rise to very complex learning trajectories. If birds want to perform song 185 learning optimally, they will try to compute the initially optimal assignments * ∆ , 186 which are the ones that achieve minimal initial performance error, 187 This optimal choice of assignment in Equation (4) 195 196 Whatever birds do, we imagined they must be facing a tradeoff between 197 phonology and syntax errors, illustrated by the following example: Consider two birds 198 that need to change their songs from syllable sequence ABC ABC to ACB ACB . 199 The first bird forms 3 new bigrams ( AC , CB , and BA ) among the existing 200 syllables, which would imply that its syntax error initially is is that the second bird learns the song sequence intrinsically by globally aligning the 207 song to the template ( ∆ is the identity matrix), that latter bird needs only to correct 208 phonology errors to also automatically learn the correct syntax. However, global 209 alignment may not be an ideal strategy because it can entail a high phonology cost, 210 which is absent in the first bird. As a tradeoff we imagine that birds may chain 211 alignments locally rather than globally. Such chaining is for example suggested by the 212 sequence requirement for correct acoustic models in white-crowned sparrows 4 . 213 214 Experimental characterization of ∆

216
Our experiments provide the following constraints 1 C to 4 C on error 217 assignments in zebra finches: particularly interesting is that the optimization in Equation (5) does not depend on the 248 tradeoff constant c , implying absence of a tradeoff. The optimization in Equation (5)   249 is known as the linear assignment problem which can be conveniently solved using 250 for example the Hungarian method 5 . 251 252 In the context of natural language processing, the solution to Equation (5) (the 253 minimum in Equation (5) rather than its argument) is also known as the word mover's 254 distance 6 that represents the distance between two text documents. In that analogy, 255 analogy to the musical chairs competition we find. The word mover's distance 261 outperforms other approaches on many benchmark document categorization tasks 6,7 . 262 263 The fact that birds choose the assignment of minimal overall phonology error, 264 irrespective of syntax, demonstrates a radical way of dealing with the intractability of 265 the general assignment problem. Namely, rather than getting entangled with high 266 complexity and large cognitive demand, birds decide to solve a much simpler 267 tractable problem and do this remarkably well. 268 269 The surprising implications are that birds do not consider the cost of 270 resequencing at all when correcting phonology errors. Phonology errors seem to be 271 associated with a high cost, perhaps reflecting the amount of effort required to change 272 syllable pitch. Counterintuitively, birds behave in this process as if there were no 273 resequencing cost at all, despite the fact that this cost is seemingly very high, given 274 that most birds try to re-sequence their syllable strings but only few succeed. Namely, 275 we found that many birds do not reach the global performance error minimum in 276 Equation (3) but get stuck somewhere on the way where some syntax errors but 277 usually no phonology errors remain. 278 279 In summary, song learning is a modular, two-fold process. In a first process, 280 birds choose assignments * ∆ by solving a linear problem based on their vocal 281 repertoire but not on their song sequence. In the second process, birds reduce 282 phonology errors defined by these correspondences and independently and more 283 slowly, also reduce the resulting syntax error. 284 285 Sub-syllabic notes 286 287 What is the smallest song unit to which our formalism applies? We deliberately 288 called j S a song element and i T a target element, implying these elements do not 289 necessarily have to represent entire song syllables but could also represent sub-290 syllabic notes. In the following we discuss this possibility. 291 292 In our treatment of the song learning problem, we implicitly assumed that birds 293 compute phonological error of a syllable by integrating over the errors in its 294 constituent notes. Essentially, we assumed that birds compute the error of a syllable 295 by globally aligning its notes with that of a template syllable. However, we have no 296 evidence for this mini-version of global alignment. Thus, it remains to be explored 297 whether birds can assign one of its syllable notes either to a note in a different syllable 298 of the template or to a note in a different position within the same template syllable. 299 300 Although it will not be possible to resolve this issue without further 301 experimenting, we imagine that our discovered assignment strategy cannot apply to 302 ever smaller song units. Namely, at some point, there must be an overload to short-303 term memory arising from all these pairwise comparisons between song and template 304 elements. It is therefore likely that assignment capabilities of zebra finches are limited 305 to the syllable level and do not generalize to smaller song units below that level. 306 307 308 How to match syllable vocabulary using the expectation maximization (EM)  309  algorithm and Gaussian mixture models  310  311 Song learning can be considered a density estimation problem in which the 312 unknown parameters of the developing song syllables must be identified such that 313 good matches with the sensory targets are achieved. In a Gaussian mixture model, the 314 observable data points i T ( , where j S is the mean pitch of 319 that syllable. It follows that the likelihood density that syllable j will produce the 320 Here we assume that σ 2 is the constant pitch variance of syllable j . The closer 323 the mean pitch j S is to the target pitch i T , the more likely the production of syllable 324 j will match the target i .

326
The goal of the EM algorithm is to identify the set of mean syllable pitches j S 327 that maximize the total probability P (or its logarithm) of reproducing all the target 328 pitches i T . The following function L is usually maximized by the EM algorithm: is the probability that target 330 i is produced by any of the n syllables, j p being the prior probability of singing 331 syllable j . 332 In the following, we assume that all syllables have identical prior probability, 333 The EM algorithm operating on Gaussian mixture models just outlined exhibits 353 several similarities with birds' strategy of minimizing performance error. Namely, if 354 we place the Gaussian model in Equation (6) into the function L L to be maximized in 355 Equation (7), then we obtain the following expression: 356 being a constant without relevance for the 358 maximization (because σ is assumed to be constant). The maximization in Equation

359
(7) is identical with the minimization in Equation (5), provided we interpret the 360 posterior probabilities | j i P as assignment weights , i j δ .

361
In the EM algorithm, the posterior probabilities | j i P are not constrained to be 362 binary variables that take values either zero or one. Nevertheless, the E and M steps in 363 Equations (8) and (9) achieve to a good approximation the musical chairs competition 364 we found. 365 To see this, consider the case in which for a given target there is only a single 366 syllable with similar pitch ( | i j P is large only for a single syllable j ). According to 367 Equation (8), | j i P is close to 1 for that best matching syllable j and close to 0 for the 368 other, non-matching syllables. This means that Equation (8)  In a similar way, the normalization in Equation (9) implements a soft winner-372 takes-all mechanism. Namely, if for a given syllable j one of the posterior 373 probabilities (assignment weights) | j i P is large and the other very small, then by the 374 weighted sum in Equation (9), that syllable's pitch is drawn towards the pitch of the 375 assigned target. 376 To model the slow and gradual song development seen in birds, we simulated a 377 finely discretized version of the EM algorithm in Equations (8) and (9). Because birds 378 change pitch continuously and slowly unlike in Equations (8) and (9), we 379 implemented a slow dynamical system in which we replaced the possibly large and 380 discontinuous posterior probability and mean pitch changes in Equations (8) and (9)  381 by gradual iterative processes (iterating over renditions t ): 382 383 1) We replaced the mean pitch j S defined in Equation (9)  in Equation (6)

402
We simulated birds that produced three syllables ( 3 n = ) and had to match three 403 targets ( 3 m = ). Therefore, we iterated back and forth the above expressions for t j S 404 and | t j i P .

406
In simulations, we realized that the musical chairs competition is not well 407 captured by these equations: A call that did not match any target tended to converge 408 to a nearby target regardless whether the latter was occupied or not (Supplementary 409 Fig. 5a). To remedy this discrepancy, we hardened the musical chairs competition by 410 adding the constraint that at each sampled likelihood, ij b could be 1 for at most one 411 vocalization, implying that matched targets could not attract any unassigned syllables 412 or calls ( Supplementary Fig. 5b). We achieved this constraint by setting ij b to zero for 413 all j whenever for a given target i two or more ij b 's were sampled to be one ( Supplementary Fig. 5b-d. 427 The interesting property of these equations is that they in essence capture the 428 observations without requiring any parameter fitting other than b σ . The latter was set 429 to exceed the pitch standard deviation σ . This allowed for medium size pitch shifts 430 of two semitones. The integration rates α and 2 α dictates the speed of pitch shifting, 431 these parameters are set to yield smooth looking transitions. 432 433 In summary, the competition we find in birds is harder than that in the standard EM 434 algorithm, in the sense that the EM algorithm brings all Gaussian models (syllables) 435 to observables (targets), even if there are just 2 targets and 3 models, very unlike birds 436 that bring only one syllable or call to each target, in a presumed attempt to limit used 437 syllable resources. In a sense, birds are more efficient than the traditional EM 438 algorithm, similarly to ongoing machine-learning approaches for restricting the 439 effective number of model parameters to prevent overfitting, such as sparse priors 8,9 , 440 Bayesian learning, and Dirichlet processes 10,11 . We believe that greedy and 441 competitive error assignment during vocal learning illustrates the importance of 442 minimizing used resources. 443