Optimal features for auditory categorization

Humans and vocal animals use vocalizations to communicate with members of their species. A necessary function of auditory perception is to generalize across the high variability inherent in vocalization production and classify them into behaviorally distinct categories (‘words’ or ‘call types’). Here, we demonstrate that detecting mid-level features in calls achieves production-invariant classification. Starting from randomly chosen marmoset call features, we use a greedy search algorithm to determine the most informative and least redundant features necessary for call classification. High classification performance is achieved using only 10–20 features per call type. Predictions of tuning properties of putative feature-selective neurons accurately match some observed auditory cortical responses. This feature-based approach also succeeds for call categorization in other species, and for other complex classification tasks such as caller identification. Our results suggest that high-level neural representations of sounds are based on task-dependent features optimized for specific computational goals.


5
Here, we demonstrate using an information-theoretic approach that production-invariant classification of calls can be achieved by detecting mid-level acoustic features. Starting from randomly chosen marmoset call features, we used a greedy search algorithm to determine the most informative and least redundant set of features necessary for call classification. Call classification at >95% accuracy could be accomplished using only 10 10 -20 features per call type. Most importantly, predictions of the tuning properties of putative neurons selective for such features accurately matched some previously observed responses of superficial layer neurons in primary auditory cortex. Such a feature-based approach succeeded in categorizing calls of other species such as guinea pigs and macaque monkeys, and could also solve other complex classification 15 tasks such as caller identification. Our results suggest that high-level neural representations of sounds are based on task-dependent features optimized for specific computational goals.
Human speech recognition is a highly robust behavior, showing tolerance to 20 variations in prosody, stress, accents and pitch. For example, speech features such as formant frequencies exhibit large variations within-and between-speakers 1, 2 , arising from production mechanisms (production variability). To achieve accurate speech recognition, the auditory system must generalize across these variations. This challenge is not uniquely human. Animals produce species-specific vocalizations ('calls') with large 25 within-and between-caller variability 3 , and must classify these calls into distinct categories to produce appropriate behaviors. For example, in common marmosets (Callithrix jacchus), a highly vocal New World primate species, critical behaviors such as finding other marmosets when isolated depend on accurate extraction of call-type and caller information 4 -8 . Similar to human speech, marmoset call categories overlap in 30 their long-term spectra (Fig. 1A), precluding the possibility that calls can be classified based on spectral content alone, and requiring selectivity for fine spectrotemporal features to classify calls. At the same time, marmoset calls also show considerable production variability along a variety of acoustic parameters 8 . For example, 'twitter' calls produced by different marmosets vary in such parameters as dominant frequencies, 35 lengths, inter-phrase intervals, and harmonic ratios (Fig. 1). Tolerance to large variations in spectrotemporal features within each call type is thus necessary to generalize across this variability. Therefore, there is a simultaneous requirement for fine and broad selectivity for production-invariant call classification. The present study explores how the auditory system resolves these conflicting requirements. Histograms are overall parameter distributions, split into the training (blue) and testing (red) sets. These data show the large production variability captured by the training and test data 50 sets, over which the model must generalize. No systematic bias is evident in calls used for model training and testing.
This problem of requiring fine-and tolerant feature tuning, necessitated by high variability amongst members belonging to a category, is not unique to the auditory 55 domain. For example, in visual perception, object categories such as faces also possess a high degree of intrinsic variability 9 -12 . To classify faces from other objects, using an exemplar face as a 'template' typically fails because this does not generalize across within-class variability 12 . Face detection algorithms use combinations of midlevel features, such as regions with specific contrast relationships 13,14 , or combinations 60 of face parts 12 , to accomplish classification. Of these algorithms, the one proposed by Ullman et al. 12 is especially interesting because of its potential to generalize to other classification tasks across sensory modalities. In this algorithm, starting from a set of random fragments of faces, the authors used 'greedy' search to extract the most informative fragments that were highly conserved across all faces despite within-class 65 variability. Post-hoc analyses revealed that these fragments were 'mid-level', i.e., they typically contained combinations of face parts, such as eyes and a nose. The features identified using this algorithm were consistent with some physiological observations, for example at the level of BOLD responses 15 . While the differences between visual and auditory processing are vast, these results inspired us to ask whether a similar concept 70 sound categorization using combinations of acoustic featurescould be implemented by the auditory system.
The behavioral salience of calls for marmosets 4 -8 , and the increasing resources allocated to the processing of calls along the cortical processing hierarchy 17   on the y-axis. (B) Schematic for initial random feature generation for a twitter (within-class) versus other calls (outside-class) categorization task. Waveforms (top) were converted to cochleagrams (middle). Random initial features were picked from twitter cochleagrams (for example, magenta box). The maximum value of the normalized cross correlation function between each call (within-classblue, outside-classgreen) and each random feature was 115 taken to be the 'response' of a feature to a call. (C) Distributions (top) of a feature's responses to 500 within-class (blue) and 500 outside-class (green) calls. The mutual information (bottom) of a feature computed as a function of a parametrically varied threshold. The dotted line, corresponding to maximal mutual information, is taken to be each feature's optimal threshold. Feature 'response' has to be greater than this optimal threshold for a feature to be considered 120 present within a call.

Results
Features of intermediate lengths and complexities are more effective for call classification 125 We start with the premise that the first step in call processing is the categorization of calls into discrete call types, generalizing across the production variability that is inherent to calls.  Fig. 1). We first generated 6000 random initial features from the cochleagrams of 500 twitter calls 135 emitted by 8 marmosets ('training' set, blue histograms in Fig. 1). For the purposes of this study, a 'feature' is a randomly selected rectangular segment of the cochleagram, corresponding to the spatiotemporal activity pattern of a subset of auditory nerve fibers within a specified time window. For each random feature, we determined an optimal threshold at which its utility for classifying twitters from other calls was maximized. The 140 merit of each feature was taken to be the mutual information value at this optimal threshold in bits (Fig. 2).
In Supplementary Fig. 2

Call categorization can be accomplished using a handful of optimal features
Because we generated the initial features at random, many of these have low merit, and many are similar. Therefore, the set of optimal features for classification is expected to 185 be much smaller than this initial set. To determine the set of optimal features that together maximize classification performance, we used a greedy-search algorithm (see Methods  Figure 3, magenta boxes outline the top 5 MIFs that are optimal for each of these classification tasks (the first five MIFs in Fig. 4A). The optimal features that we arrive at are mostly intuitivefor 195 example, the top MIFs for classifying twitters detect the frequency contour of individual twitter phrases and the repetitive nature of the twitter call. In some cases, features seemed counter-intuitivefor example, the second MIF for trill classification seems to detect 'empty' regions of the cochleagram. In this theoretical framework, the lack of energy at those frequencies is also informative about the presence of a trill.

200
In Figure 4A, we show the pairwise information added by each MIF, the merits, and the weights of the top 10 MIFs for these classification tasks.  To validate our model and to test the effectiveness of using only the MIFs for classifying call types, we used a novel set of calls consisting of 500 new within-category and 500 new outside-category calls drawn from the same 8 marmosets. This 'test' call set did not 235 significantly differ from the training set along any of the characterized parameters (red histograms in Fig. 1). We conceptualized each MIF as a simulated template-matching neuron whose 'response' to a stimulus was defined as the maximum value of the normalized cross-correlation (NCC) function. This simulated MIF-selective neuron 'spiked' whenever its response crossed its optimal threshold, i.e., when an MIF was 240 detected in the stimulus. In Fig. 5, we plot the spike rasters of simulated MIF-selective neurons for twitter, phee, and trill (top 10 MIFs shown), responding to a train of randomly selected calls from the novel test set. Each spike was weighted by the loglikelihood ratio of the MIF and the weighted sum of responses in 50 ms time bins was taken as the evidence in support of the presence of a particular call type. Although 245 occasional false positives and misses occurred, over the set of MIFs, the evidence in support of the correct call type was almost always the highest. Therefore, productioninvariant call categorization is a two-step processfirst, MIFs are detected in the stimuli, and then each feature is weighted by its log-likelihood ratio to provide evidence for a call type. 250 We quantified the performance of the entire set of MIFs (n=11, 16, and 20 for twitter, phee, and trill respectively) for the classification of novel calls by parametrically varying an overall evidence threshold and computing the hit rate (true positives) and false alarm rate (false positives) at each threshold. From these data, we plotted receiver operating characteristic (ROC) curves (Fig. 6A). In these plots, the diagonal 255 corresponds to chance, and perfect performance corresponds to the upper left corner.
The MIFs achieved >95% classification performance for all call types with very low false alarm rates. for twitter (top, blue), phee (middle, red), and trill (bottom, yellow). Each dot represents spiking of a putative MIF-selective neuron (i.e. when the response of the MIF exceeds its optimal threshold). (C) The evidence for presence of a particular call type, defined as the normalized sum of the firing rate of all MIF-selective neurons, weighted by their log-likelihood ratio. Over the duration of each call, the call type with the most evidence is considered to be present. 265 Occasional false alarms are usually outweighed by true positive MIF detections.

Control simulations
First, we ensured that our selection of 6000 initial random features adequately sampled stimulus space. To do so, we iteratively selected sets of MIFs using our greedy 270 search algorithm from initial random sets from which previously picked MIFs were excluded. We found that distinct sets of MIFs that had similar classification performance could be selected in successive iterations ( Supplementary Fig. 3). This suggests that our initial random feature set indeed contained several redundant MIF-like features, confirming the adequacy of our initial sampling.

275
Second, in order to determine the contributions of various model assumptions and parameters, we repeated this process of random initial feature generation, threshold optimization, and MIF selection in different scenarios. To better visualize these differences, we used detection-error tradeoff curves (Fig. 6B), where perfect performance is the lower left corner. In this figure, the performance of the default model,   shows performance when using small features only (<100 ms and <1 oct.) or excluding small features, and using large features only (>250 ms and >2 oct.) or excluding large features. For trills, some of these conditions fall outside the range of the axes. Bottom row shows performance when the bandwidth and duration of features used for classification were independently varied. Note that because of the short duration of trill calls, we did not test the 370 effect of using only long duration features.
In this study, we used greedy search and pairwise maximization of information to find optimal features. However, it is possible that the greedy search algorithm does not find an optimal solution because of its inability to overcome local maxima. We do not 375 think this is the case because: 1) the model performs at high accuracy levels, leaving little room for significant improvements, 2) we could arrive at similar sets of MIFs and achieve similar performance levels from different initial feature sets, specifically when highly informative features were excluded (Supp. Fig. 3), and 3) we could match or outperform other machine learning based algorithms for marmoset call classification 19 .

380
Therefore, the implemented greedy search algorithm likely converges at a true optimal solution.

Factors contributing to the success of the MIF-based approach
Three factors were critical in the design and implementation of our approach. First, 385 focusing on a behaviorally critical task (call categorization), and choosing model species with rich vocal repertoires and behaviors (marmosets and guinea pigs) allowed us to clearly identify a computational goal of cortical processingcall categorization.
Previous experiments, both using electrophysiological 20 -24 and imaging techniques 17,25,26 , showing an increase in cortical resources allocated to call processing, validate our 390 choice of call categorization as a critical computational goal in vocal animals. Second, our analyses were based on a large sample of calls recorded from a large number of animals 8 . From this data set, we deliberately oversampled a large number of initial potential features. This ensured that the full extent of production variability was represented in this data set. Third, the greedy search algorithm efficiently identified 395 informative features from a training data set of a few hundred calls. Since clean and labelled training data sets are laborious to generate, the efficiency of greedy search provided a significant methodological advantage.

MIF-based reconstruction of call stimuli 400
The observation that an MIF-based approach successfully generalizes across production variability implies that most calls belonging to a category will contain one or more of the MIFs. Therefore, we asked how well calls could be reconstructed based on MIFs alone, using twitters as a specific example. To do so, we detected model twitter MIF neuron 'spiking' as earlier to the 500 training and 500 test twitters, and convolved 405 these spike times with an alpha function (with a time constant of 20 ms) to detect the peak locations of twitter MIFs within a twitter ( Supplementary Fig. 5A). We then placed copies of MIF cochleagrams at these peak locations, or added copies of MIF cochleagrams to previously placed feature cochleagrams. The final summed cochleagram was taken to be the reconstructed call ( Supplementary Fig. 5B). We We then asked if the auditory system uses such an optimal feature-based approach to call classification. To explore this possibility, as a first step, we generated 'tuning curves' of putative MIF-selective model neurons responding to commonly used acoustic stimuli 420 and asked if these tuning curves matched previous experimental observations. In this effort, we were restricted by the appropriateness and availability of previous data.  Fig. 6). We then compared the 430 MIF responses to available neural data from marmoset primary auditory cortex (A1). Although the MIF model was purely theoretical and did not have prior access to neurophysiological data, we found that model MIF neuron tuning recapitulated actual data to a remarkable degree, both at the population and single-unit levels. For example, the population of model MIFs showed high preference for natural calls compared to 450 reversed calls (Fig.8A, bottom), similar to observations by Wang and Kadia 27 (reproduced in Fig. 8A, top). The high sparseness of auditory cortical neurons is well-documented 28 -30 . The responses of model MIF-selective neurons were also sparseonly few MIF neurons were activated by any given stimulus set, and only after extensively optimizing the parameters of the stimulus set to drive specific model MIF 455 neurons. For example, in Fig. 8B (top), we show a single-unit recording from a marmoset A1 L2/3 neuron that did not respond to most stimulus types (reproduced from Sadagopan and Wang 30 ), and only strongly responded to two-tone stimuli. Twitter MIFs (Fig. 8B, bottom) were similarly not responsive to most stimulus types, and only responded to carefully optimized linear frequency-modulated (lFM) sweeps. None of the 460 model twitter and trill MIF-selective neurons responded to pure tones (Fig. 8B, bottom), similar to many A1 L2/3 neurons.
Most strikingly, we could recapitulate some specific and highly nonlinear singleneuron tuning properties as well. Figure 8C (top; reproduced from Sadagopan and Wang 30 ) is a single-unit recording from marmoset A1 L2/3 that did not respond to pure 465 tones, but selectively responded to upward lFM sweeps of specific lengths (~80 ms).
Responses of at least three of the top 5 twitter MIF-selective model neurons showed similar tuning for 80 ms-long upward lFM sweeps (Fig. 8C, bottom). A second peak at ~40 ms was also present in responses of two model twitter MIF-selective neurons, also matching the experimental data. Figure 8D  showed remarkably similar tuning (Fig. 8D, bottom). These model neurons did not respond to single sweeps as well, but responded to trains of at least 2 or more sweeps occurring with a 50 ms inter-sweep interval. Taken together, these data suggest neurons tuned to MIF-like features are present in A1 L2/3. Therefore, we would predict that a spectral-content based representation of calls in the ascending auditory pathway 480 becomes largely a feature-based representation in A1 L2/3. Consistent with the prediction of feature selectivity, we have found neurons in A1 of both marmosets and guinea pigs that respond selectively to conspecific call features.
In Fig. 9, we present the spike rasters of example single neurons in both marmoset and 485 guinea pig A1 responding to marmoset (Fig. 9A) and guinea pig calls (Fig. 9B) respectively. We presented multiple exemplars of each call type as stimuli. These  shading corresponds to stimulus duration (different calls have different lengths). Note that spikes occur at specific times, and in response to 2 or 3 call types, suggesting that the neurons are responding to smaller features within these calls. (B) Spike rasters of three single units from guinea pig A1 responding to guinea pig call stimuli. 505

Task-dependent MIF-based classification as a general auditory computation
Our approach has two limitations. First, the number of auditory tasks that an animal is potentially required to solve is ill-defined. While we mitigate this limitation by choosing ethologically critical tasks such as call categorization, it is likely that we are only probing a small subset of all behaviorally relevant auditory tasks. Consequently, 510 while a subset of neurons in auditory cortex match predictions from our model for call and caller classification, developing a larger bank of natural auditory behavior (for example, predator sounds versus neutral sounds) will allow us to model and predict a larger fraction of cortical responses. Second, our model derives features from the auditory nerve representation of stimuli. It is well-known that this representation is 515 transformed more than once before impinging on cortical neurons. Therefore, the actual representation from which cortical neurons detect features are not accurately modeled here. This limitation arises from the current lack of predictive models for central auditory processing stages. It is possible that the performance of our algorithm will increase if we could accurately model other sub-cortical processing stages.

520
Recognizing these limitations, we asked if MIF-based representations of sounds could also be used for optimally solving other tasks, such as caller identification, and if MIF-based call classification also generalized to other vocal species. To test these hypotheses, we performed three proof-of-principle simulations using limited available data sets. For caller identification, we generated training and test sets of 60 twitters 525 each from eight marmosets, and generated 500 initial random features from the training set. We applied the greedy-search algorithm to determine the MIFs for caller identification in a caller A vs. all other callers task (Fig. 10A). We found that similar to call categorization, caller identification could also be achieved using a small number of MIFs (n = 4). If caller identification was performed in a binary fashion (four 530 classifications between two animals each), in half of these tasks, classification could be accomplished using less than 3 MIFs, indicating that the calls of these marmosets probably differed along the frequency axis. This is because if there are clear differences in dominant frequency (for example, Animal 1 vs. 4 in Fig. 1E), all features that lie in one animal's frequency range will detect all of that animal's calls and none of the other 535 animal's calls. During the greedy search procedure, these features will be considered redundant and reduced to a single feature. In the other half, more MIFs were required for caller identification, and in general, MIFs were larger than those for call-type classification. This is likely because the differences between twitters produced by these animals are smaller compared to the differences between call types and can only be 540 resolved in a higher dimensional space. Thus, integration over more frequencies and a larger time window may be necessary to resolve caller differences. In Supplementary   Fig. 7, we plot the ROC for caller identification between a pair of marmosets with overlapping dominant frequencies. The MIF-based approach (n = 20 MIFs) achieved >80% hit rates with <10% false alarm rate for caller identification.

545
For determining the efficacy of MIF-based call classification in other species, we used guinea pig and macaque call classification as examples. Guinea pigs are highly vocal rodents that produce seven main call types 23, 31, 32 , which are highly overlapping in the low frequency end of the spectrum, and show high production variability. We used the MIF-based approach to classify guinea pig call types ('whine', 'wheek', and 'rumble') 550 from all other guinea pig call types. Similar to marmosets, guinea pig classification could be accomplished using a handful of features (12,9, and 3 MIFs for whine, wheek, and rumble), and MIF-based classification achieved high performance levels (Fig. 10B).
Similarly, we implemented the MIF-based algorithm to classify macaque calls (using 5, 4, and 9 MIFs for coos, grunts and harmonic arches) from a limited macaque call data 555 set 33 and achieved high classification performance (Fig. 10C). These proof-of-principle experiments demonstrate that an MIF-based approach indeed succeeds for different auditory classification tasks and in different species, suggesting that building representations of sounds using task-relevant features in auditory cortex may be a general auditory computation.

Discussion
In these experiments, we set out to understand the computations performed by the auditory system that enable the categorization of behaviorally critical sounds, such as calls, despite wide variations in the spectrotemporal structure of calls belonging to a category (production variability). We found that the optimal theoretical solution is to 570 detect the presence of informative mid-level features (termed MIFs) in calls. These MIFs generalize over production variability, and conjunctions of MIFs accomplish productioninvariant call classification with high accuracy. Critically, the tuning properties of putative MIF-selective neurons match previous recordings from marmoset A1 to a surprising degree. MIF-based classification was also successful for other tasks (marmoset caller identification), and in other species (guinea pig and macaque call recognition). Our results suggest that the representation of sounds in higher auditory cortical areas might enable performance of auditory tasks based on the detection of optimal task-relevant features.

Comparison to previous theoretical and experimental methods
An implication of our results is that in higher auditory processing stages, neural representations of sounds serve specific behavioral purposes. For example, the MIFbased classification approach that we proposed here is targeted to solve well-defined classification problems. At earlier stages of the auditory pathway, however, it may be 585 more important to faithfully represent sounds using basis sets that enable the accurate encoding of novel stimuli. Previous theoretical studies have proposed, for example, that natural sounds can be efficiently encoded using spike patterns, where each spike represents the magnitude and timing of input acoustic features 34 . However, when optimized to encode the complete waveforms of natural sound ensembles, the kernel 590 functions that elicit each spike show a striking similarity to cochlear filters. The advantage of this approach is that novel stimuli can be completely encoded using these kernel functions. In our approach, the input to our model implements a similar encoding schematicin the cochleagram, inputs are encoded as spatiotemporal spike patterns, where each spike is the result of cochlear filtering. In this early representation, while 595 information about category identity is present, it is distributed in the activity of many neurons in a high-dimensional space. We propose that in later processing stages, this early representation is transformed into a representation where category identity is more easily separable. By encoding MIF-like features, sound representation in later processing stages is less useful for high-fidelity encoding, but is instead goal-oriented. 600 However, this means that each task will require a distinct set of MIFs for optimal performance, and animals likely perform a large number of such behaviorally relevant tasks. The observed 1000-fold increase between the number of cochlear inputs and auditory cortical neurons may partially result from this necessity to encode a multitude  Previous experimental studies have described call selectivity primarily using two methods: 1) categorization of neural tuning along an exhaustive list of call parameters 41 , and 2) categorizing call tuning as tuning for regions of the modulation spectrum 42 -44 . In 645 the former study, marmoset calls were parametrized along multiple acoustic dimensions. Some of these parameters were common to all call types, such as the length or dominant frequency of a call. The more distinguishing parameters, however, were unique to individual call types, such as the inter-phrase interval for twitters, or sinusoidal frequency modulation rate for trills. Neural tuning to calls was described 650 using tuning to these parameters but did not use the same set of parameters across call types. In our study, different MIFs are used for classification of different call types, but MIFs are parametrized along the same axesbandwidth and integration window, allowing for a uniform basis for comparisons. In the latter set of studies, neural tuning for birdsong was described using selectivity for specific frequency and temporal 655 modulations. In this case, tuning could be expressed in a unified stimulus space (of spectral-and temporal modulation rates). Both these methods, however, serve to describe neural tuning, and not to explain why tuning to certain parameters or regions of modulation space are necessary in the first place. Our results suggest that generating selectivity for task-relevant features explains why selectivity for stimulus parameters 660 arises in the first place.

Possible mechanisms of generation of MIF-based representations
MIF-based representations are constructed from MIF-selective neurons. Neural selectivity for MIFs may be generated 1) gradually along the ascending auditory 665 pathway, or 2) de-novo in cortex. Single-neuron feature selectivity often (but not always, see below) leads to selectivity for one or a few call types, and analyzing call selectivity of neurons at different auditory processing stages could provide insight into where MIFbased representations might be generated in the auditory pathway. In early auditory processing stages, evidence for call selectivity at the single-neuron level is minimal. For 670 example, at the level of the cochlear nucleus, few single neurons in species other than mice show call selectivity 45 . At the level of inferior colliculus, a population-level bias in call-selectivity has been reported 45 -47 , but evidence for single-neuron level callselectivity is equivocal 48 . It is only at the level of auditory cortex where clear singleneuron selectivity for calls or call features has been observed. Therefore, it is quite likely 675 that selectivity for MIF-like features in species with spectrotemporally complex calls is generated at the level of auditory cortex. This is supported by the expansion in the number of cortical neurons mentioned above. Importantly, the cortical emergence of MIF-based representations is also supported by the fact that MIF-like responses have been observed in the superficial layers of marmoset A1 30 . 680 We propose the following hierarchical model for auditory processing based on the representation of task-relevant features. In thalamorecipient layers of A1, representation of sound identity is still based on spectral content. This is reflected in the strongly tone-tuned responses of A1 L4 neurons. From these neurons, tuning for MIFlike features may be generated using nonlinear mechanisms such as combination-685 sensitivity. For example, the tuning properties of the marmoset A1 responses shown in

Computations underlying the perception of auditory categories
In conclusion, we propose a hierarchical model for solving a central problem in 715 auditory perceptionthe goal-oriented categorization of sounds that show high withincategory variability such as speech 1, 2 or animal calls 3 . Our work has broad implications as to where in the auditory pathway categorization begins to emerge, and what features are optimal to learn in categorization tasks. For example, the lack of distinction of perceptual categories of English /r/ and /l/ by native Japanese speakers, and the 720 success of bilingual Japanese speakers in accomplishing this classification, suggests that categorical differences can be learned 50 . Our model suggests that native speakers do not distinguish /r/-/l/ differences because the optimal features necessary for /r/-/l/ categorization are not encoded, as this categorization is not task-relevant for Japanese speech. FMRI evidence supports this conjecture 51 . Our model would predict that what is 725 learned in bilingual speakers are optimal features that maximize /r/-/l/ differences. Our model would further predict that this learning would be primarily reflected in changes to the A1 L2/3 circuit. Consistent with this hypothesis, a recent study showed that training humans to categorize monkey calls resulted in finer tuning for call features in the auditory cortex 52 . We therefore suggest that the neural representation of sounds at 730 higher cortical processing stages uses task-dependent features as building blocks, and that new blocks can be added to this representation to enable novel perceptual requirements. that demonstrated the advantages of features learnt using multiple binary classifications compared to those learnt using a single multi-way classification. Specifically, in that 760 study, multiple binary classifications resulted in features that were distinctive and highly tolerant to distortions 56 . For each classification task, we first generated training data sets, which consisted of 500 random within-class calls (e.g., twitters) produced by 8 animals (about 60 calls per animal), and 500 random outside-class calls (e.g., trills, phees, other calls) produced by the same 8 animals. In order to convert sound 765 waveforms of the calls into a physiologically meaningful quantity, we transformed these calls into cochleagrams using a previously published auditory nerve model 54 using human auditory nerve parameters with high spontaneous rate. We used human auditory nerve parameters because of the close similarity between marmoset and human audiograms 55 . The output of this model was the time-varying activity pattern of the entire 770 population of auditory nerve fibers, and resembles the spectrogram of the call ( Fig. 2A,   B). We then extracted 6000 random features from these 500 within-class cochleagrams.

Vocalizations
To do so, we randomly chose a center frequency, bandwidth, onset time and length and extracted a snippet of activity from the cochleagram. Each feature thus corresponded to the spatiotemporal pattern of activity of a subset of auditory nerve fibers within a 775 specified time window (magenta box in Fig. 2B). We used rectangular feature shapes rather than other shapes to minimize assumptionsfor example, an ellipse shaped feature would imply that the weighting of individual auditory nerve fibers changes over time. To ensure that smaller features were well-sampled, 2000 of these features were restricted to have a bandwidth less than 1 octave and a duration less than 100 ms. The 780 bandwidth and duration of the remaining 4000 fragments were not constrained.
Threshold optimization: We defined the 'response' of a feature to a call as the maximum value of the normalized cross correlation (NCC) function between the feature's cochleagram and the call's cochleagram, restricted to the auditory nerve fibers that are 785 represented in the feature. We effectively implemented a one-dimensional version of NCC by only considering the auditory nerve fibers that overlapped between the call and the feature. Note that this means features can only be detected in the frequency range that they span, but can be detected anywhere in time within a call. NCC is a commonly used metric to quantify template-match. To compute the NCC, the feature and the where P(C) was assumed to be 0.10. We empirically verified that features identified were insensitive to variations of this value. The optimal threshold for each feature was taken to be the threshold value at which the mutual information was maximal, and the merit of each feature was taken to be the maximum mutual information value in bits (Fig. 2C). The 'weight' of each feature was taken to be its log-likelihood ratio. At the end 810 of this procedure, each of the initial 6000 features were allocated a merit, a weight, and an optimal threshold at which each individual feature's utility for classifying calls as belonging to within-or outside-class was maximized. Note that merit and weight are distinct quantities that need not be monotonically related. For example, if the lack of energy in a frequency band is indicative of a target category, features that contain 815 energy in this frequency band will be detected often in the other categories, but not in the target category. The feature will thus have high merit for classification, as it is informative by its absence, but have a negative weight.
Greedy search: Because we chose initial features are random, many of these features 820 individually provided low information about call category, and many of the best features for classification were self-similar, or redundant. Therefore, to extract maximal information from a minimal set of features for classification, we used a greedy search algorithm 12 to iteratively 1) eliminate redundant features, and 2) pick features that add the most information to the set of selected features. The minimal set of features that 825 together maximize information about call type were termed maximally informative features (MIFs). The first MIF was chosen to be the feature with maximal merit from the set of all 6000 initial random features. Every consecutive MIF was chosen to maximize pairwise added information with respect to the previously chosen MIFs. Note that these consecutive features need not have high merit individually. We iteratively added MIFs 830 until we could no longer increase the hit rate without increasing the false alarm rate.
Practically, this meant adding features until total information reached 0.999 bits, or individual features added less than 0.001 bits, whichever was reached earlier. At the end of this procedure, a small set of MIFs, containing the optimal set of features for call classification was obtained.

835
Analysis and statistics: To test how well novel calls could be classified using these MIFs alone, we generated from the same 8 animals a test set of 500 within-and outside-class calls that the model had not been exposed to before. We computed the NCC between each test call and MIF, and considered the MIF to be detected in the call if the 840 maximum value of the NCC function exceeded its optimal threshold. If detected, the MIF provided evidence in favor of a test call belonging to a call type, proportional to its loglikelihood ratio. We then summed the evidence provided by all MIFs and generated ROC curves of classification performance by systematically varying an overall evidence threshold. We used the area under the curve (AUC) to compare ROC curves for 845 classification performance by MIFs generated with different constraints (see Results).
Statistical significance was evaluated using non-parametric methods for comparing between these conditions, and for comparing performance to a large number of simulations generated using random MIFs. values that could be conceptualized as equivalent to membrane potential (Vm) responses. These were converted to firing rates by applying a power law nonlinearity, of the form: Where FR is the firing rate response in spk/s, is the MIF's optimal threshold, p is the 860 exponential nonlinearity set to a value of 4, and k is an arbitrary scaling factor.
Call reconstruction from MIFs: To reconstruct calls, we conceptualized MIFs as MIFselective neurons, and considered the times at which NCC values exceeded the optimal threshold to be the spike times of these neurons. MIF spike times were computed with a 865 time resolution of 2 ms to simulate refractoriness, and alpha-functions were convolved with the spike times to determine the peak time at which each MIF was detected. A copy of the MIF cochleagram was then placed at the peak time, or summed (with loglikelihood weights) if overlapping with a previously placed cochleagram. The accuracy of reconstruction was defined as the NCC between the original stimulus and its 870 reconstructed version at zero lag.
Electrophysiology methods: Predictions generated from the MIFs were compared to earlier recordings from marmoset A1. Details of recording procedures are available from original experimental data sources. All recordings were from adult marmosets. 875 Population data comparing natural to reversed twitters were obtained from Wang and Kadia 27 . These experiments were performed in anesthetized marmosets. Single-neuron data regarding feature selectivity were obtained from Sadagopan and Wang 30 . These recordings were from awake, passively-listening marmosets. Single-neuron data regarding feature selectivity in guinea pigs were obtained from adult, head-fixed, 880 passively-listening guinea pigs at the University of Pittsburgh. Briefly, a headpost and recording chambers were secured to the skull using dental cement following aseptic procedures. Animals were placed in a double-walled, anechoic, sound attenuated booth. A small craniotomy was performed over auditory cortex. High-impedance tungsten electrodes (3 -5 MΩ, A-M Systems Inc. or FHC, Inc.) were advanced through 885 the dura into cortex to record neural activity. Stimuli were generated in MATLAB, and presented (TDT Inc.) from the best location in an azimuthal speaker array (B&W-600S3 or Fostex FT-28D for marmosets, TangBand 4" full-range driver for guinea pigs). Single units were sorted online using a template matching algorithm (Alpha Omega Inc. or Ripple, Inc), and for guinea pigs, refined offline (MKSort). All analyses were performed 890 using custom MATLAB code.
Code availability: Custom code will be provided upon request to the corresponding author (SS).