Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Learning metrics on spectrotemporal modulations reveals the perception of musical instrument timbre


Humans excel at using sounds to make judgements about their immediate environment. In particular, timbre is an auditory attribute that conveys crucial information about the identity of a sound source, especially for music. While timbre has been primarily considered to occupy a multidimensional space, unravelling the acoustic correlates of timbre remains a challenge. Here we re-analyse 17 datasets from published studies between 1977 and 2016 and observe that original results are only partially replicable. We use a data-driven computational account to reveal the acoustic correlates of timbre. Human dissimilarity ratings are simulated with metrics learned on acoustic spectrotemporal modulation models inspired by cortical processing. We observe that timbre has both generic and experiment-specific acoustic correlates. These findings provide a broad overview of former studies on musical timbre and identify its relevant acoustic substrates according to biologically inspired models.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Two different approaches to investigate the auditory perception of musical instrument timbre.
Fig. 2: Replicability of the MDS-based approach.
Fig. 3: Correspondence between fitted metrics and standard deviations of the stimuli.
Fig. 4: Generalizability of the metrics learned for the different datasets.

Data availability

The data that support the findings of this study are available from the corresponding author upon request and at

Code availability

Custom codes that support the findings of this study are available from the corresponding author upon request and at


  1. 1.

    Huang, N., Slaney, M. & Elhilali, M. Connecting deep neural networks to physical, perceptual, and electrophysiological auditory signals. Front. Neurosci. 12, 532 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  2. 2.

    Kell, A. J., Yamins, D. L., Shook, E. N., Norman-Haignere, S. V. & McDermott, J. H. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron 98, 630–644 (2018).

    CAS  PubMed  Article  Google Scholar 

  3. 3.

    Moore, B. C. An Introduction to the Psychology of Hearing 6th edn (Emerald, 2012).

  4. 4.

    Siedenburg, K. & McAdams, S. Four distinctions for the auditory “wastebasket” of timbre. Front. Psychol. 8, 1747 (2017).

    PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    Plomp, R. in Frequency Analysis and Periodicity Detection in Hearing (eds Plomp, R. & Smoorenburg, G. F.) 397–414 (Sijthoff, 1970).

  6. 6.

    Wessel, D. L. Timbre space as a musical control structure. Comput. Music J. 3, 45–52 (1979).

    Article  Google Scholar 

  7. 7.

    Grey, J. M. & Gordon, J. W. Perceptual effects of spectral modifications on musical timbres. J. Acoustical Soc. Am. 63, 1493–1500 (1978).

    Article  Google Scholar 

  8. 8.

    Grey, J. M. Multidimensional perceptual scaling of musical timbres. J. Acoustical Soc. Am. 61, 1270–1277 (1977).

    CAS  Article  Google Scholar 

  9. 9.

    Krumhansl, C. L. in Structure and Perception of Electroacoustic Sound and Music (eds Nielzen, S. & Olsson, O.) 43–53 (Excerpta Medica, 1989).

  10. 10.

    Iverson, P. & Krumhansl, C. L. Isolating the dynamic attributes of musical timbre. J. Acoustical Soc. Am. 94, 2595–2603 (1993).

    CAS  Article  Google Scholar 

  11. 11.

    McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G. & Krimphoff, J. Perceptual scaling of synthesized musical timbres: common dimensions, specificities, and latent subject classes. Psychological Res. 58, 177–192 (1995).

    CAS  Article  Google Scholar 

  12. 12.

    Lakatos, S. A common perceptual space for harmonic and percussive timbres. Percept. Psychophys. 62, 1426–1439 (2000).

    CAS  PubMed  Article  Google Scholar 

  13. 13.

    Barthet, M., Guillemain, P., Kronland-Martinet, R. & Ystad, S. From clarinet control to timbre perception. Acta Acust. U. Acust. 96, 678–689 (2010).

    Article  Google Scholar 

  14. 14.

    Patil, K., Pressnitzer, D., Shamma, S. & Elhilali, M. Music in our ears: the biological bases of musical timbre perception. PLoS Comput. Biol. 8, e1002759 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Elliott, T. M., Hamilton, L. S. & Theunissen, F. E. Acoustic structure of the five perceptual dimensions of timbre in orchestral instrument tones. J. Acoustical Soc. Am. 133, 389–404 (2013).

    Article  Google Scholar 

  16. 16.

    Siedenburg, K., Jones-Mollerup, K. & McAdams, S. Acoustic and categorical dissimilarity of musical timbre: evidence from asymmetries between acoustic and chimeric sounds. Front. Psychol. 6, 1977 (2016).

    PubMed  PubMed Central  Article  Google Scholar 

  17. 17.

    Ogg, M. & Slevc, L. R. Acoustic correlates of auditory object and event perception: speakers, musical timbres and environmental sounds. Front. Psychol. 10, 1594 (2019).

    PubMed  PubMed Central  Article  Google Scholar 

  18. 18.

    McAdams, S. in Timbre: Acoustics, Perception, and Cognition (eds Siedenburg, K. et al.) 23–57 (Springer, 2019).

  19. 19.

    Macherey, O. & Delpierre, A. Perception of musical timbre by cochlear implant listeners: a multidimensional scaling study. Ear Hearing 34, 426–436 (2013).

    PubMed  Article  Google Scholar 

  20. 20.

    Peeters, G., Giordano, B. L., Susini, P., Misdariis, N. & McAdams, S. The timbre toolbox: extracting audio descriptors from musical signals. J. Acoustical Soc. Am. 130, 2902–2916 (2011).

    Article  Google Scholar 

  21. 21.

    Chi, T., Ru, P. & Shamma, S. A. Multiresolution spectrotemporal analysis of complex sounds. J. Acoustical Soc. Am. 118, 887–906 (2005).

    Article  Google Scholar 

  22. 22.

    Albouy, P., Benjamin, L., Morillon, B. & Zatorre, R. J. Distinct sensitivity to spectrotemporal modulation supports brain asymmetry for speech and melody. Science 367, 1043–1047 (2020).

    CAS  PubMed  Article  Google Scholar 

  23. 23.

    Theunissen, F. E., Sen, K. & Doupe, A. J. Spectral-temporal receptive fields of nonlinear auditory neurons obtained using natural sounds. J. Neurosci. 20, 2315–2331 (2000).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  24. 24.

    Shamma, S. On the role of space and time in auditory processing. Trends Cogn. Sci. 5, 340–348 (2001).

    CAS  PubMed  Article  Google Scholar 

  25. 25.

    Chi, T., Gao, Y., Guyton, M. C., Ru, P. & Shamma, S. Spectro-temporal modulation transfer functions and speech intelligibility. J. Acoustical Soc. Am. 106, 2719–2732 (1999).

    CAS  Article  Google Scholar 

  26. 26.

    Suied, C., Dremeau, A., Pressnitzer, D., & Daudet, L. Auditory sketches: sparse representations of sounds based on perceptual models. Proc. International Symposium on Computer Music Modeling and Retrieval 2012 Lecture Notes in Computer Science (eds Aramaki, M. et al.) 7900, 154–170 (Springer, 2013).

  27. 27.

    Isnard, V., Taffou, M., Viaud-Delmon, I. & Suied, C. Auditory sketches: very sparse representations of sounds are still recognizable. PLoS ONE 11, e0150313 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  28. 28.

    Thoret, E., Depalle, P. & McAdams, S. Perceptually salient spectrotemporal modulations for recognition of sustained musical instruments. J. Acoustical Soc. Am. 140, EL478–EL483 (2016).

    Article  Google Scholar 

  29. 29.

    Thoret, E., Depalle, P. & McAdams, S. Perceptually salient regions of the modulation power spectrum for musical instrument identification. Front. Psychol. 8, 587 (2017).

    PubMed  PubMed Central  Article  Google Scholar 

  30. 30.

    Halpern, A. R., Zatorre, R. J., Bouffard, M. & Johnson, J. A. Behavioral and neural correlates of perceived and imagined musical timbre. Neuropsychologia 42, 1281–1292 (2004).

    PubMed  Article  Google Scholar 

  31. 31.

    Allen, E. J., Burton, P. C., Olman, C. A. & Oxenham, A. J. Representations of pitch and timbre variation in human auditory cortex. J. Neurosci. 37, 1284–1293 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  32. 32.

    Ogg, M., Moraczewski, D., Kuchinsky, S. E. & Slevc, L. R. Separable neural representations of sound sources: speaker identity and musical timbre. Neuroimage 191, 116–126 (2019).

    PubMed  Article  Google Scholar 

  33. 33.

    Terasawa, H., Slaney, M., & Berger, J. The thirteen colors of timbre. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (New Paltz, NY, 2005) 323–326 (IEEE, 2005).

  34. 34.

    Fritz, J., Shamma, S., Elhilali, M. & Klein, D. Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex. Nat. Neurosci. 6, 1216–1223 (2003).

    CAS  PubMed  Article  Google Scholar 

  35. 35.

    Kraus, N., Skoe, E., Parbery-Clark, A. & Ashley, R. Experience-induced malleability in neural encoding of pitch, timbre, and timing: implications for language and music. Ann. N. Y. Acad. Sci. 1169, 543–557 (2009).

    PubMed  PubMed Central  Article  Google Scholar 

  36. 36.

    David, S. V., Fritz, J. B. & Shamma, S. A. Task reward structure shapes rapid receptive field plasticity in auditory cortex. Proc. Natl Acad. Sci. USA 109, 2144–2149 (2012).

    CAS  PubMed  Article  Google Scholar 

  37. 37.

    Mesgarani, N. & Chang, E. F. Selective cortical representation of attended speaker in multi-talker speech perception. Nature 485, 233–236 (2012).

    CAS  PubMed  Article  Google Scholar 

  38. 38.

    Kaya, E. M. & Elhilali, M. Modelling auditory attention. Phil. Trans. R. Soc. B: Biol. Sci. 372, 1–10 (2017).

    Google Scholar 

  39. 39.

    Allen, E. J. et al. Encoding of natural timbre dimensions in human auditory cortex. Neuroimage 166, 60–70 (2018).

    PubMed  Article  Google Scholar 

  40. 40.

    Flinker, A., Doyle, W. K., Mehta, A. D., Devinsky, O. & Poeppel, D. Spectrotemporal modulation provides a unifying framework for auditory cortical asymmetries. Nat. Hum. Behav. 3, 393–405 (2019).

    PubMed  PubMed Central  Article  Google Scholar 

  41. 41.

    Ponsot, E., Burred, J. J., Belin, P. & Aucouturier, J. J. Cracking the social code of speech prosody using reverse correlation. Proc. Natl Acad. Sci. USA 115, 3972–3977 (2018).

    CAS  PubMed  Article  Google Scholar 

  42. 42.

    Nelken, I. & De Cheveigne, A. An ear for statistics. Nat. Neurosci. 16, 381 (2013).

    CAS  PubMed  Article  Google Scholar 

  43. 43.

    Bregman, M. R., Patel, A. D. & Gentner, T. Q. Songbirds use spectral shape, not pitch, for sound pattern recognition. Proc. Natl Acad. Sci. USA 113, 1666–1671 (2016).

    CAS  PubMed  Article  Google Scholar 

  44. 44.

    Lartillot, O., Toiviainen, P., & Eerola, T. in Data Analysis, Machine Learning and Applications (eds Preisach, C. et al.) 261–268 (Springer, 2008).

  45. 45.

    Aucouturier, J. J. & Bigand, E. Seven problems that keep MIR from attracting the interest of cognition and neuroscience. J. Intell. Inf. Syst. 41, 483–497 (2013).

    Article  Google Scholar 

  46. 46.

    Bellet, A., Habrard, A., & Sebban, M. A survey on metric learning for feature vectors and structured data. Preprint at arXiv (2013).

  47. 47.

    McDermott, J. H. & Simoncelli, E. P. Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71, 926–940 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  48. 48.

    Anden, J., Lostanlen, V. & Mallat, S. Joint time-frequency scattering. IEEE Trans. Signal Process. 67, 3704–3718 (2019).

    Article  Google Scholar 

  49. 49.

    Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. in Advances in Neural Information Processing Systems (eds Lee, D. D. et al.) 3630–3638 (Curran Associates, 2016).

  50. 50.

    Goldberger, J., Hinton, G. E., Roweis, S. T., & Salakhutdinov, R. R. in Advances in Neural Information Processing Systems (eds Saul, L. K., Weiss, Y. & Bottou, L.) 513–520 (MIT Press, 2005).

  51. 51.

    Zhu, C., Byrd, R. H., Lu, P. & Nocedal, J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 23, 550–560 (1997).

    Article  Google Scholar 

Download references


This work was supported by the Canadian Natural Sciences and Engineering Research Council awarded to S.M. (grant nos. RGPIN-2015-05280 and RGPAS 478121-15) and to P.D. (RGPIN- 2018-05662), as well as a Canada Research Chair (grant nos. 950-223484 and 950-231872) awarded to S.M. E.T. was funded through an ILCB/BLRI grant no. ANR-16-CONV-0002 (ILCB), ANR-11-LABX-0036 (BLRI) and the Excellence Initiative of Aix-Marseille University (A*MIDEX), B.C. was founded through EU Marie Skłodowska-Curie fellowship (Project MIM, H2020-MSCA-IF-2014, grant agreement no. 659232). B.C. acknowledges STMS IRCAM-CNRS-Sorbonne Université in Paris where he recieved support from a Marie Sklodowska Curie research fellowship at the beginning of the project. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We thank the authors of the studies re-analysed for providing the stimuli and data from their experiments; G. Mestdagh, E. Ponsot and B. Morillon for helpful discussions on earlier versions of the manuscript; and M. Elhilali and D. Pressnitzer for help in the initial implementation of the optimization framework.

Author information




E.T., B.C., P.D. and S.M. worked on the conceptualization and methodology, and also reviewed and edited the article. E.T. and B.C. worked on the software, formal analysis and the investigation. E.T. conducted data curation, wrote the original draft and worked on the visualization. SM. supervised the work, conducted project administration and obtained funding.

Corresponding author

Correspondence to Etienne Thoret.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Primary Handling Editor: Marike Schiffer.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Details of the datasets.

Summary of the properties of the 17 datasets from the 8 different studies: dataset name (when applicable), number of stimuli in the dataset (Nb Sounds), fundamental frequency of the stimuli in Hz (f0), number and type of participants, type of sounds, and supplemental information when applicable.

Extended Data Fig. 2 Multi-Dimensional Scaling analysis.

Spearman correlation (ρ2) of the LAT and SC values with the positions of stimuli along the first two dimensions of the timbre spaces using the same MDS method for all datasets. The full statistics are provided in the Supplementary Table 1.

Extended Data Fig. 3 Replicability of the MDS-based analyses.

Spearman correlations (ρ2) of LAT and SC with perceptual dimensions reported in the original studies and determined with the same MDS parameters here (see Methods). It is noticeable that for almost all datasets, the original correlations reported in the studies are quasi-systematically lower than those computed in this meta-analysis. The full statistics are reported in the Supplementary Table 1.

Extended Data Fig. 4 Cross-validation of the metrics.

For each dataset, the metrics were cross-validated to test their generalizability within the dataset. Explained variances (r2) of the human ratings by the cross-validated metrics for each dataset are presented for: the Training correlations (fitted on the N-1 sounds), the Testing correlations (tested on the removed sound), the within correlation between the N*(N-1)/2 metric pairs characterizing the Internal consistency of the fitted metrics in each dataset, and the average correlation (r2) with the metric Refitted on all sounds with those on the N-1 subsets showing the extent to which this metric is different from the cross-validated one. On median, the metrics were cross-validated with r2 = 0.51 on the testing sets. For each dataset, they are highly consistent within the N-folds (r2: Mdn=0.85), and they strongly correlate with the metric fitted on the whole dataset (r2: Mdn=0.92). For each dataset, the correlation of the refitted metric on whole sounds with those fitted for the cross-validation (last column) are high showing that the metric fitted on whole sounds can be used to perform the analyses.

Supplementary information

Supplementary Information

Supplementary Figs. 1–18 and Supplementary Tables 1–19.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Thoret, E., Caramiaux, B., Depalle, P. et al. Learning metrics on spectrotemporal modulations reveals the perception of musical instrument timbre. Nat Hum Behav 5, 369–377 (2021).

Download citation

Further reading


Quick links