Selective cortical representation of attended speaker in multi-talker speech perception

Journal name:
Date published:
Published online

Humans possess a remarkable ability to attend to a single speaker’s voice in a multi-talker background1, 2, 3. How the auditory system manages to extract intelligible speech under such acoustically complex and adverse listening conditions is not known, and, indeed, it is not clear how attended speech is internally represented4, 5. Here, using multi-electrode surface recordings from the cortex of subjects engaged in a listening task with two simultaneous speakers, we demonstrate that population responses in non-primary human auditory cortex encode critical features of attended speech: speech spectrograms reconstructed based on cortical responses to the mixture of speakers reveal the salient spectral and temporal features of the attended speaker, as if subjects were listening to that speaker alone. A simple classifier trained solely on examples of single speakers can decode both attended words and speaker identity. We find that task performance is well predicted by a rapid increase in attention-modulated neural selectivity across both single-electrode and population-level cortical responses. These findings demonstrate that the cortical representation of speech does not merely reflect the external acoustic environment, but instead gives rise to the perceptual aspects relevant for the listener’s intended goal.

At a glance


  1. Acoustic and neural reconstructed spectrograms for speech from a single speaker or a mixture of speakers.
    Figure 1: Acoustic and neural reconstructed spectrograms for speech from a single speaker or a mixture of speakers.

    a, b, Example acoustic waveform and auditory spectrograms of speaker one (male; a) and speaker two (female; b). c, Waveform and spectrogram of the mixture of the two shows highly overlapping energy distributions. d, Difference spectrogram highlights the mixture regions where speaker one (blue) or two (red) has more acoustic energy. e, f, Neural-population-based stimulus reconstruction of speaker one (e) and speaker two (f) alone shows similar spectrotemporal features as the original spectrograms in a and b. g, h, The reconstructed spectrograms from the same mixture sound when attending to either speaker one (g) or two (h) highly resemble the single speaker reconstructions, shown in e and f, respectively. i, Overlay of the spectrogram contours at 50% of maximum energy from the reconstructed spectrograms in e, f, g and h.

  2. Quantifying the attentional modulation of neural responses.
    Figure 2: Quantifying the attentional modulation of neural responses.

    a, b, Correlation coefficients of reconstructed mixture spectrograms under attentional control and the corresponding single speaker original spectrograms in correct and error trials (examples in Fig. 1g, h shown with black outline). c, d, Mean and standard error of correlation values for correct and error trials (28 mixtures). The dashed line corresponds to the average intrinsic correlation between randomly chosen original speech phrases. Brackets indicate pairwise statistical comparisons. NS, not significant. e, f, Average difference reconstructed spectrograms of speakers one and two from responses to single speaker (e) and attended mixture (f). g, Time course of average and standard error of AMIspec of 28 mixtures for correct (black) and error (red) trials. Grey curve shows the upper bound of AMIspec.

  3. Decoding spoken words and the identity of the attended speaker.
    Figure 3: Decoding spoken words and the identity of the attended speaker.

    a, Classification rate and standard deviation for spoken words (call sign, colour and number) of the attended speaker from the neural responses to the 28 mixtures. Classifiers were trained on single speaker examples only. Colour and number of the attended speech are decoded with high accuracy (77.2% and 80.2%, P<10×10−4, t-test) in correct trials, but not the call sign (48.0%, not significant (NS), t-test). b, In error trials, the classifier showed a systematic bias towards the words of the masker speaker (34.1%, 30.0%, 30.1%, P<10×10−4, t-test). c, Attended speaker identification rate and standard deviation in correct for target, incorrect (for both target and masker), and correct for masker trials.

  4. Attentional modulation of individual electrode sites.
    Figure 4: Attentional modulation of individual electrode sites.

    a, Electrodes picking up a significant difference between responses to silence and speech sounds (P<0.01, t-test). b, STRF of this representative electrode site shows a preference for high frequency sounds. c, Mixture difference spectrogram for a selected duration containing a high frequency component for each speaker (circled). d, The electrode shows an increased response to high frequency sounds of single speakers (dashed lines, peak neural response is delayed by about 120ms). However, the neural response to the same mixture sound in two attention conditions (solid lines) showed an enhanced response to high frequency sounds only for the target, but with responses for similar sounds in the masker speaker suppressed.


  1. Cherry, E. C. Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25, 975979 (1953)
  2. Shinn-Cunningham, B. G. Object-based auditory and visual attention. Trends Cogn. Sci. 12, 182186 (2008)
  3. Bregman, A. S. Auditory Scene Analysis: The Perceptual Organization of Sound (MIT Press, 1994)
  4. Kerlin, J., Shahin, A. & Miller, L. Attentional gain control of ongoing cortical speech representations in a “cocktail party”. J. Neurosci. 30, 620628 (2010)
  5. Besle, J. et al. Tuning of the human neocortex to the temporal dynamics of attended events. J. Neurosci. 31, 31763185 (2011)
  6. Bee, M. & Micheyl, C. The cocktail party problem: what is it? How can it be solved? And why should animal behaviorists study it? J. Comparative Psychol. 122, 235252 (2008)
  7. Shinn-Cunningham, B. G. & Best, V. Selective attention in normal and impaired hearing. Trends Amplif. 12, 283299 (2008)
  8. Scott, S. K., Rosen, S., Beaman, C. P., Davis, J. P. & Wise, R. J. S. The neural processing of masked speech: evidence for different mechanisms in the left and right temporal lobes. J. Acoust. Soc. Am. 125, 17371743 (2009)
  9. Elhilali, M., Xiang, J., Shamma, S. A. & Simon, J. Z. Interaction between attention and bottom-up saliency mediates the representation of foreground and background in an auditory scene. PLoS Biol. 7, e1000129 (2009)
  10. Chang, E. F. et al. Categorical speech representation in human superior temporal gyrus. Nature Neurosci. 13, 14281432 (2010)
  11. Crone, N. E., Boatman, D., Gordon, B. & Hao, L. Induced electrocorticographic gamma activity during auditory perception. Clin. Neurophysiol. 112, 565582 (2001)
  12. Steinschneider, M., Fishman, Y. I. & Arezzo, J. C. Spectrotemporal analysis of evoked and induced electroencephalographic responses in primary auditory cortex (A1) of the awake monkey. Cereb. Cortex 18, 610625 (2008)
  13. Scott, S. K. & Johnsrude, I. S. The neuroanatomical and functional organization of speech perception. Trends Neurosci. 26, 100107 (2003)
  14. Hackett, T. A. Information flow in the auditory cortical network. Hear. Res. 271, 133146 (2011)
  15. Bolia, R. S., Nelson, W. T., Ericson, M. A. & Simpson, B. D. A speech corpus for multitalker communications research. J. Acoust. Soc. Am. 107, 10651066 (2000)
  16. Brungart, D. S. Informational and energetic masking effects in the perception of two simultaneous talkers. J. Acoust. Soc. Am. 109, 11011109 (2001)
  17. Mesgarani, N., David, S. V., Fritz, J. B. & Shamma, S. A. Influence of context and behavior on stimulus reconstruction from neural activity in primary auditory cortex. J. Neurophysiol. 102, 33293339 (2009)
  18. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R. & Warland, D. Reading a neural code. Science 252, 18541857 (1991)
  19. Pasley, B. N. et al. Reconstructing speech from human auditory cortex. PLoS Biol. 10, e1001251 (2012)
  20. Garofolo, J. S. et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus (Linguistic Data Consortium, 1993)
  21. Rifkin, R., Yeo, G. & Poggio, T. Regularized least-squares classification. Nato Science Series Sub Series III Computer and Systems Sciences 190, 131154 (2003)
  22. Formisano, E., De Martino, F., Bonte, M. & Goebel, R. “Who” is saying “what”? Brain-based decoding of human voice and speech. Science 322, 970973 (2008)
  23. Staeren, N., Renvall, H., De Martino, F., Goebel, R. & Formisano, E. Sound categories are represented as distributed patterns in the human auditory cortex. Curr. Biol. 19, 498502 (2009)
  24. Shamma, S. A., Elhilali, M. & Micheyl, C. Temporal coherence and attention in auditory scene analysis. Trends Neurosci. 34, 114123 (2010)
  25. Darwin, C. J. Auditory grouping. Trends Cogn. Sci. 1, 327333 (1997)
  26. Warren, R. M. Perceptual restoration of missing speech sounds. Science 167, 392393 (1970)
  27. Kidd, G., Jr, Arbogast, T. L., Mason, C. R. & Gallun, F. J. The advantage of knowing where to listen. J. Acoust. Soc. Am. 118, 38043815 (2005)
  28. Shen, W., Olive, J. & Jones, D. Two protocols comparing human and machine phonetic discrimination performance in conversational speech. INTERSPEECH 16301633. (2008)
  29. Cooke, M., Hershey, J. R. & Rennie, S. J. Monaural speech separation and recognition challenge. Comput. Speech Lang. 24, 115 (2010)

Download references

Author information


  1. Departments of Neurological Surgery and Physiology, UCSF Center for Integrative Neuroscience, University of California, San Francisco, California 94143, USA

    • Nima Mesgarani &
    • Edward F. Chang


N.M. and E.F.C. designed the experiment, collected the data, evaluated results and wrote the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Figures (1.4M)

    This file contains Supplementary Figures 1-3.

Additional data