Zebra finches identify individuals using vocal signatures unique to each call type

Individual recognition is critical in social animal communication, but it has not been demonstrated for a complete vocal repertoire. Deciphering the nature of individual signatures across call types is necessary to understand how animals solve the problem of combining, in the same signal, information about identity and behavioral state. We show that distinct signatures differentiate zebra finch individuals for each call type. The distinctiveness of these signatures varies: contact calls bear strong individual signatures while calls used during aggressive encounters are less individualized. We propose that the costly solution of using multiple signatures evolved because of the limitations of the passive filtering properties of the birds’ vocal organ for generating sufficiently individualized features. Thus, individual recognition requires the memorization of multiple signatures for the entire repertoire of conspecifics of interests. We show that zebra finches excel at these tasks.

4 and 6 and compare the performance (Percent of correct classification, PCC) of three different regularized classifiers (Linear Discriminant Analysis, LDA; Quadratic Discriminant Analysis, QDA; Random Forest, RF) and two feature spaces (18 pre-defined acoustic features, PAFs; spectrogram, Spectro) for testing the existence of voice-cues. The performance of the classifiers is always quantified by cross-validation, separating training and testing data. For the Same condition in A and the diagonal in B, the testing dataset is composed of other renditions of the same call type. For the Other condition in A and the off-diagonal in B, the testing dataset is composed of call renditions from other call types. If the same acoustic features are used to classify vocalizers irrespective of call type, then the performance of a given classifier should be similar between Same and Other. A. Classifiers' performance at categorizing vocalizers when tested on each call type (DC, Te, Tu, So, Th, Wh, Ne, Ws; labels are defined in Figure 4A) and trained either with vocalizations of the Same call type (S) or with vocalizations of all Other call types (O). Error bars indicate 95% confidence intervals. Even if the performance drastically drops for every vocalization type between the Same and Other conditions, several performance values stay above chance level indicating some degree of transferability of the acoustic features learned to discriminate vocalizers from all other categories. Changing the features representing the vocalizations or the type of classifier used does not drastically change this result. B. Performance of classifiers on pairwise sets of call types in the two different feature spaces. The color code indicates the classifier performance when trained with the call types indicated in columns and tested on the call type indicated in rows. Red stars indicate significance of the PCC compared to chance level (direct binomial test, p<0.05). Irrespective of the feature space or the classifier, the performance of classification drops when the training and testing sets are not from the same call type (bins outside of the diagonal). Supplementary Fig. 4.
Increase role of temporal and fundamental features in coding identity information as compared to call type information. This figure complements Figure 8C and shows scatter plots of the classifier performance as measured by the percent of correct classification (PCC) at categorizing pairs of vocalizers within each call type when trained and tested in the space defined by 3 sets of PAFs: Spect, 8 spectral parameters only; Temp, 5 temporal parameters only; Fund, 5 fundamental parameters only. The dashed lines indicate the relative performance of the classifier at discriminating call type (Call Type classification) when using the same set of PAFs. Dots above the dashed line indicate call types for which the relative contribution of the set of PAFs on the y-axis vs the set of PAFs on the x-axis for identity discrimination is larger as compared to Call Type classification. For example, in the first two scatter plots, contact calls (DC, LT and Te) are all above the dashed lines. This indicates that the relative contribution of Temporal and Fundamental features as compared to Spectral features is larger for encoding identity information than it is for encoding call type information.

Subjects and housing conditions
Subjects used for the behavioral experiments were thirteen adult domestic zebra finches (Taeniopygia guttata; 7 females and 6 males) bred in our colony at the University of California, Berkeley. The vocalization databank used as stimuli for the behavioral experiments and for the acoustic analyses has been previously described (see (1)) and was obtained from acoustic recordings of another set of 45 birds (20 females, 23 males and two chicks of unknown sex).
Birds were maintained at a constant temperature of 22-24°C and with a 14:10 light-dark cycle. Before the beginning of experiments, birds were housed in groups of 6-12 birds in a mixsex environment. Food and water were provided ad libitum, with salad and egg supplement given once a week. For the duration of the shaping and testing days and while not in the testing chamber, the subjects were housed individually or in pairs in the colony room and fasted: their food intake was fixed to 1.5g of mixed seeds for finches per individual and was given at the end of each day upon returning to the colony room. The weight of each subject was closely monitored daily so that it remained between 85 and 90% of the initial body weight.
Behavioral experiments: study of the discrimination of vocalizers by zebra finches.
Chamber apparatus and test procedure The sound level is calibrated on song recordings to match the natural peak intensity levels of 70 dB SPL at 10 cm. The behavior of the subject is further monitored using a webcam (Logitech) placed inside the soundproof booth.
Sound playbacks and various functions of the test chamber were controlled by a computer running a custom program (Matlab, Mathworks, Cambridge, MA, USA), that communicated to the test chamber through a simple DAQ card (Measurement Computing Corporation, Norton, MA 02766, USA). The control of the test chambers included illuminating the key-pad, recording pecking events at 10Hz sampling rate and activating the feeder. A test consisted in three sessions of 30 min each per day with a minimum inter-session rest period of 90 min. The illumination of the key-pad signaled to the bird that it was active. The code detected the beginning of each session (when the bird pecked the-key pad for the first time) and ended the session 30 min later. Each hit on the key-pad triggered the playback of a different 6s stimulus. Interruption occurred when the bird pecked while the computer played a 6s stimulus resulting in the immediate trigger of another stimulus.

Acoustic stimuli
Each acoustic stimulus consisted of a sequence of six or three band-pass filtered (0.25-12 kHz) vocalizations of the same vocalizer and of the same call type, randomly assigned within a 6s window. More precisely, for the longer Begging sequences and Songs, each stimulus consisted of sequences of 3 different renditions, while for the other call types (Distance call, Nest call, Tet call, Tuck call, Whine call, Wsst call and Long Tonal call) each stimulus consisted of 6 different renditions. Each stimulus started and ended with a rendition. The 5 or 2 intervals between renditions in a given stimulus were randomly drawn from a uniform distribution. Before each session, the computer was randomly constructing a minimum of 80 Re stimuli and 320 NoRe stimuli using a vocalization bank of 5-104 (37.7±1.4) different renditions per vocalizer and per call type (see Supplementary Table 2). A total of 3283 vocalizations were used for these experiments. Each of the 400 stimuli (e.i. sequence of six or three different renditions) was only played once during the session.

Shaping
Birds were shaped to use the operant chamber over a short period of time (2-5 days) using two songs from different male zebra finches as Re and NoRe stimuli. Shaping consisted of the following steps: acclimation to the cage, finding the feeder, getting the association between pecking the key-pad and triggering a vocalization play-back and getting the association between hearing a Re vocalization and earning access to the feeder. Once the procedure to activate the feeder using the key-pad was acquired, birds were encouraged to interrupt by introducing the NoRe vocalization and increasing its probability in steps up to 80%. A subject was considered to have learned the task if it was pecking at least 50 times per day, interrupting the NoRe stimuli at least 20% of the time and if the percentage of interruption of NoRe stimuli was at least 20% higher than the percentage of interruption of Re stimuli.

Testing
For every subject, tests started on Day 0 with 3 sessions of discrimination between the two songs used during the shaping procedure. This way, each subject started the series of tests with the same prior experience with the apparatus, and having only heard stimuli that were different from those used in the actual experiment.
For each subject, a random pair of males, a random pair of females and a random pair of chicks were chosen from 24 vocalizers of our vocalization bank (7 females, 6 males and 11 chicks). Subjects were then tested for their ability to discriminate these vocalizers across all call types using the following 6 different types of discrimination tasks (Supplementary Table 2 Male vocalizer all-call-type discrimination (1 test per subject, All-M): discrimination of 2 male vocalizers across all call types. In this test, each of the 7 adult call types was represented by 60 (NoRe vocalizer) and 12 (Re vocalizer) stimuli. The categories tested were: Distance calls, Nest calls, Songs, Tet calls, Thuk calls, Whine calls, and Wsst calls. Similar to the vocalizer single-call-type discrimination task, each stimulus was constructed by randomly selecting and combining renditions of the same call type and emitted by the same individual.
Female vocalizer single-call-type discrimination (6 tests per subject): discrimination of 2 female vocalizers within the same call type (same call types as with male vocalizations with the omission of the Song that is not emitted by females, each tested on consecutive days: Distance calls, DC-F; Nest calls, Ne-F; Tet calls, Te-F; Thuk calls, Th-F; Whine calls, Wh-F; and Wsst calls, Ws-F). Acoustic stimuli were constructed following the same procedure as in Male vocalizer single-call-type discrimination.
Female vocalizer all-call-type discrimination (1 test per subject, All-F): discrimination of 2 female vocalizers across all call types. In this test, each of the 6 female adult call types was represented by 54 (NoRe vocalizer) and 14 (Re vocalizer) stimuli. The categories tested were: Distance calls, Nest calls, Songs, Tet calls, Tuck calls, Whine calls, and Wsst calls. Similar to the vocalizer single-call-type discrimination task, each stimulus was constructed by randomly selecting and combining renditions of the same call type and emitted by the same individual.
Young vocalizer single-call-type discrimination (2 tests per subject): discrimination of 2 young vocalizers (chicks or C) within the same call type (2 call types each tested on consecutive days: Long Tonal call, LT-C and Begging calls, Be-C). Acoustic stimuli were constructed following the same procedure as in Male vocalizer single-call-type discrimination.
Random test (1 test per subject, RAN): Acoustic stimuli from two vocalizers of the same sex were prepared as for a Vocalizer all-call-type discrimination test but stimuli were randomly assigned to either the Re stimulus category or the NoRe stimulus category.
Note that vocalizer single-call-type discrimination tests were always performed before vocalizer all-call-type discrimination tests. For the vocalizer single-call-type discrimination tests, the order in which call types were tested was randomly assigned to each subject. Some tests were removed from the dataset because of stimulus assignment errors (Supplementary Table 2). All tests were performed in series of maximum 10 consecutive days and always started after a shaping day (Day 0).
To investigate the effect of the familiarity with vocalizations acquired during vocalizer single-call-type discrimination tests on the behavioral performance of birds during vocalizer allcall-type discrimination tests, 7 female subjects run an additional set of vocalizer all-call-type discrimination tests (All-F2 and All-M2) on vocalizations of birds they had never heard before. overall OR was also calculated for each test by estimating probabilities using all the trials (shown as a large diamond marker placed on the right of the time-lines in Fig. 1B, 3C, 6A and 6C). To correct for the biases due to small numbers, the median unbiased estimate as proposed by Parzen et al. (2) were used for the calculation of probabilities of interruption / )*+, and / +, .
For a given test, the significance of the overall OR being different than 1 was calculated using an exact test: its value was compared to the distribution of OR values expected from two binomial distributions for the Re and NoRe interruptions, each with the corresponding observed number of trials, and assuming / +, = / )*+, = 0.5. A given value of OR was called significant if it was found in the upper percentile of the random distribution (p < 0.01).
Using the glmefit function of matlab, the behavioral performance across subjects was statistically tested with binomial Generalized Linear Mixed Effects models (GLME) where the response variable is the probability of interruption (Int) and the random variable is the bird subject (Subject). Birds are able to perform the task when models that include the Vocalizer Type (VocType) that codes whether a stimulus is from the Rewarded (Re) or Non-Rewarded (NoRe) vocalizer perform significantly better than models that don't include VocType (significance tested with a Likelihood ratio test, LRT) and the NoRe beta coefficient is positive. To investigate the effects of the call type (CallType) and of the session (Session) on the probability of stimulus interruption, these variables were added as co-variates in the previous model, and the comparison of models with or without the co-variate was conducted with an LRT to determine it significance. The subjects ID was set as a random variable in all these models. To test if the subjects were memorizing each rendition or generalizing across renditions to do the discrimination, we calculated, for each daily test, the rank of renditions (RendRank) and rank of vocalizer type (Re vs NoRe) presentation (VocRank). In the binomial GLME, the response variable considered was the probability of correct response (CR, probability of interrupting the NoRe or refraining from interrupting the Re), and the random variables were Subject and the day of the test (Date) nested within Subject. The response variable was changed here to maintain the power of the test despite the increase in the number of random groups in the GLME. Because VocRank and RendRank were highly correlated, the effect of RendRank was revealed by measuring its predictive power on CR once the effect of VocRank was removed. This was achieved by comparing two GLME with and without RendRank as a variable, but both predicting CR, with VocType as a co-variate, Subject and Date:Subject as random variables, and an offset based on the predictions p of a third GLME. This third GLME was predicting CR with VocType and VocRank as variables, Subject and Date:Subject as random variables. The offset was calculated as log(p/(1-p)).
A fine description of all the GLME tests performed is given by Supplementary Table 1.
Acoustical analysis: study of the discrimination of vocalizers by classifiers.

Feature Spaces
The PAF (predefined acoustic features) consisted of 18 features describing the spectral (8), temporal (5) and fundamental (5) characteristics of each sound (see also (1)). The spectral features were extracted from the frequency power spectrum (called spectral envelope here). The spectral envelope was estimated using Welch's average periodogram (window = 49 ms, 50% overlap, Hanning window). From the normalized spectral envelope (to have unit integral), we calculated the first moments: the mean spectrum, the spectral standard deviation (i.e. the spectral bandwidth), the spectral skew and the spectral kurtosis. To capture an overall measure of spectral envelope variability, we also calculated the spectral entropy. Finally, we also calculated the 3 quartiles (the 25% quartile, the median and 75% quartile) as these are often used in bioacoustical analyses. A temporal envelope was estimated by rectifying the sound pressure waveform and low-pass filtering at 20 Hz. From the normalized temporal envelope, we obtained the temporal mean, the temporal standard deviation (i.e. the duration), the temporal skew and temporal kurtosis. Overall variability was quantified with the temporal entropy. Five fundamental parameters were obtained from a time-varying estimation of the instantaneous fundamental frequency (1 kHz sampling). The fundamental (F0) was estimated using a hybrid temporal/spectral approach: the auto-correlation function of the signal was first analyzed to estimate the period of F0 based on the largest non-zero time-lagged peak in the auto-correlation function with a frequency below 1500Hz; this initial estimate was then used as an initial guess for matching the spectral periodicity found in the spectrogram at the corresponding time window (see Elie and Theunissen, 2016, for more details). The ratio of amplitude of the non-zero delay peak in the auto-correlation function with the peak at zero delay was used to estimate the periodicity of the sound at each time point. The pitch saliency of each vocalization was taken as the average value of this amplitude ratio over time points. F0 was only estimated for periodic time points showing values of pitch saliency above 0.5. In addition, we obtained the mean F0, the min F0, the max F0, and the coefficient of variation of F0. Equations and additional details for the calculations of PAF can be found in Elie and Theunissen (2016). Note that in this analysis, we did not use any features that described the intensity of the sound (e.g. RMS, peak amplitude) because these might have been affected by systematic differences in the position of the birds relative to the microphone and could bias the classifier for discriminating vocalizer identity. In some of our analyses, we used only the 8 spectral or only the 5 temporal or only the 5 fundamental features in the classifiers in order to compare the relative importance of these three types of acoustic features.
In addition to the PAF, we also used a practically invertible spectrographic representation to describe the sounds (3). The spectrogram was estimated using Gaussian-shaped windows (52 Hz wide in spectral domain and, correspondingly, 3 ms in the time domain) and resulted in 231 frequency bands between 0 and 12 kHz and a sampling rate of 1017 Hz yielding 357 points in time for the 350ms window used to frame each vocalization. In this representation, the vocalizations were centered within this 350ms window based on the time of the peak of their amplitude envelopes. Vocalizations for which the beginning or end occurred before the end of this spectrographic window were padded with zeros. Vocalizations that were longer than this time interval where truncated. In this manner, all sounds could be represented by the same 357x231= 82,467 feature vector. Similar to the PAF, we did not want to take into account the amplitude of the sound signal as an indicator of vocalizer identity. Thus, we normalized all spectrograms relative to their maximum amplitude.

Classifiers
The three supervised classifiers (Linear discriminant analysis or LDA, quadratic discriminant analysis or QDA and random forest or RF) were used on all the data and with the same regularization procedure. Before training each classifier, principal component analysis was applied to the feature space chosen for sound representation (PAF or spectrograms) in order to minimize over-fitting. In previous work (Elie and Theunissen, 2016), we systematically varied the number of principal components or PCs and chose the number that gave the best performance in cross-validated data. Here to minimize computational time and to use the same dimensionality reduction for all three classifiers, we used a prescriptive rule: the number of PCs used was equal to the square root of n/5, where n is the number of sounds used to train the classifier (this corresponds to approximately 10 degrees of freedom for each entry in the feature space covariance matrix). This dimensionality reduction step allowed us to have robust estimates of the stimulus covariance matrix.
The classification performance from regularized LDA, QDA and RF were very similar (see Supplementary Fig. 1) and results from LDA only are reported in the main paper. The classifiers software was based on the scikit learn library (version 0.19.1) for Python 2.7 (http://scikit-learn.org/stable/) augmented with the dimensionality reduction and cross-validation algorithms implemented in Theunissen Lab code (discriminate.py found in https://github.com/theunissenlab/soundsig ; tutorials in https://github.com/theunissenlab/BiosoundTutorial ).
For the analyses on vocalizer discriminability, we chose a pair-wise approach where classifiers were trained and tested on all possible pair-wise comparisons of vocalizers. The pairwise approach will be useful to compare the performance described in this paper with future work (in this species or other) that investigates vocalizer id or voice id and where the number of vocalizers tested or measured will vary. Indeed, we suggest that the methodology proposed here be used as a standard approach to study individual recognition such that comparative studies or meta analyses are facilitated.

Statistical analysis
The significance of the classifier performance for a given pair of vocalizers against chance (50%) was tested by an exact binomial test based on the number of vocalizations correctly classified and the total of number of vocalizations tested in the cross-validation procedure. Discrimination for a bird pair was considered significant if p<0.05. We then performed a second exact binomial test to determine whether the number of significant bird pair discriminations for a particular call type was above the 5% (expected Type 1 Error).
To obtain the average performance across all bird pair comparisons and for all call types, we fitted Generalized Linear Mixed Effects models (GLME) where the response variable is the number of cross-validated trials correctly classified versus total number tested, the fixed effect is the call type (CallType), the distribution is set to binomial and the random effect is the pair of vocalizers. The model coefficients of these GLME's are then used as the average responses and plotted on summary plots (such as in Fig. 4B). Furthermore, the effect of call type (CallType), training set (TrainSet) or the set of acoustic features used (Feature Space) were tested for significance by Likelihood ratio tests that compare the reduction in Deviance in models with, as compared to without, theses explanatory variables of interest to the expected reduction in Deviance that would be obtained by chance (Chi-Square test). GLME were fitted in R with the lme4 library. The model coefficients and their 95% confidence intervals were obtained with the R effects library. A fine description of all the GLME tests performed is given by Supplementary  Table 1.