Abstract
Human speech sounds are produced through a coordinated movement of structures along the vocal tract. Here we show highly structured neuronal encoding of vowel articulation. In medial–frontal neurons, we observe highly specific tuning to individual vowels, whereas superior temporal gyrus neurons have nonspecific, sinusoidally modulated tuning (analogous to motor cortical directional tuning). At the neuronal population level, a decoding analysis reveals that the underlying structure of vowel encoding reflects the anatomical basis of articulatory movements. This structured encoding enables accurate decoding of volitional speech segments and could be applied in the development of brain–machine interfaces for restoring speech in paralysed individuals.
Introduction
The articulatory features that distinguish different vowel sounds are conventionally described along a two-dimensional coordinate system that naturally represents the position of the highest point of the tongue during articulation, for example, in the International Phonetic Alphabet (IPA) chart1. The two axes of this system are height (tongue vertical position relative to the roof of the mouth or the aperture of the jaw) and backness (tongue position relative to the back of the mouth). How is the structured articulatory production encoded and controlled in brain circuitry? The gross functional neuroanatomy of speech production was described by multiple imaging, lesion and stimulation studies2,3, and includes primary, supplementary and pre-motor areas, Broca's area, superior temporal gyrus (STG), anterior cingulate cortex (ACC) and other medial–frontal regions2,3,4. The temporal dynamics of collective neural activity was studied in Broca's area using local field potentials2,5. However, the basic encoding of speech features in the firing patterns of neuronal populations remains unknown.
Here, we study the neuronal encoding of vowel articulation in the human cerebrum at both the single-unit and the neuronal population levels. At the single-neuron level, we find signatures of two structured coding strategies: highly specific, sharp tuning to individual vowels (in medial–frontal neurons) and nonspecific, sinusoidally modulated tuning (in the STG). At the neural population level, we find that the encoding of vowels reflects the underlying articulatory movement structure. These findings may have important implications for the development of high-accuracy brain–machine interfaces for the restoration of speech in paralysed individuals.
Results
Speech-related neurons
Neuronal responses in human temporal and frontal lobes were recorded from 11 patients with intractable epilepsy monitored with intracranial depth electrodes to identify seizure foci for potential surgical treatment (see Methods). Following an auditory cue, subjects uttered one of five vowels (a/a/, e/ε/, i/i/, o/o/ and u/u/) or simple syllables containing these vowels (consonant+vowel: da/da/, de/dε/, di/di/, do/do/ and du/du/...). We recorded the activity of 716 temporal and frontal lobe units. As this study focuses on speech and owing to the inherent difficulty to distinguish between auditory- and speech-related neuronal activations, we analysed only the 606 units that did not respond to auditory stimuli. A unit was considered speech-related if its firing rate during speech differed significantly from the pre-cue baseline period (see Methods). Overall, 8% of the analysed units (49) were speech-related, of which, more than a half (25) were vowel-tuned, showing significantly different activation for the five vowels (see Supplementary Fig. S1).
Sharp and broad vowel tuning
Two areas commonly activated in speech studies2, the STG and a medial–frontal region overlying the rostral anterior cingulate and the adjacent medial orbitofrontal cortex (rAC/MOF; Brodmann areas 11 and 12; See Supplementary Fig. S2 for anatomical locations of the electrodes), had the highest proportions of speech-related (75% and 11%, respectively) and vowel-tuned units (58% and 77% of these units). In imaging and electrocorticography studies, the rACC was shown to participate in speech control2,6, the orbitofrontal cortex in speech comprehension and reading7, and the STG in speech production at the phoneme level8. Involvement of STG neurons in speech production was also observed in earlier single-unit recordings in humans9. We analysed neuronal tuning in these two areas and found that it had divergent characteristics: broadly tuned units that responded to all vowels, with a gradual modulation in the firing rate between vowels, comprised 93% of tuned units in STG (13/14) but were not found in rAC/MOF (0/10), whereas sharply tuned units that had significant activation exclusively for one or two vowels comprised 100% of the tuned rAC/MOF units (10/10) but were rare in STG (1/14).
Figure 1 displays responses of five sharply tuned units in rAC/MOF, each exhibiting strong, robust increases in their firing rate specifically for one or two vowel sounds, whereas for the other vowels firing remains at the baseline rate. For example, a single unit in the right rostral anterior cingulate cortex (rACC) (Fig. 1, top row) elevated its firing rate to an average of 97 spikes/s when the patient said 'a', compared with 6 spikes/s for 'i', 'e', 'o' and 'u' (P<10−13, one-sided two-sample t-test). Anecdotally, in the first two trials of this example (red arrow) the firing rate remained at the baseline level, unlike the rest of the 'a' trials; in these two trials, the patient wrongly said 'ah' rather than 'a' (confirmed by the sound recordings).
Raster plots and peri-stimulus time histograms of five units during the utterance of the five vowels a, e, i, u and o. For each of the units, significant change in firing rate from the baseline occurred for one or two vowels only (Methods). Red vertical dashed lines indicate speech onset. All vertical scale bars correspond to firing rates of 20 spikes/s.
A completely different encoding of vowels was found in the STG, where the vast majority of tuned units exhibited broad variation of the response over the vowel space, during the articulation of both vowels (Fig. 2a) and simple syllables containing these vowels (Supplementary Fig. S3a). This structured variation is well approximated by sinusoidal tuning curves (Fig. 2b and Supplementary Fig. S3b) analogous to the directional tuning curves commonly observed in motor cortical neurons10. Units shown in Fig. 2 had maximal responses ('preferred vowel', in analogy to 'preferred direction') to the vowels 'i' and 'u', which correspond to a closed articulation where the tongue is maximally raised, and minimal ('anti-preferred') response to 'a' and 'o' where it is lowered.
(a) Raster plots and peri-stimulus time histograms during the utterance of the five vowels a, e, i, u and o. Significant change in firing rate from the baseline occurred for all or most vowels, with modulated firing rate (Methods). Red vertical dashed lines indicate speech onset; vertical bars, 10 spikes/s. (b) Tuning curves of the respective units in a over the vowel space, showing orderly variation in the firing rate of STG units with the articulated vowel.
Population-level decoding and structure
Unlike directional tuning curves, where angles are naturally ordered, vowels can have different orderings. In the tuning curves of Fig. 2, we ordered the vowels according to their place and manner of articulation as expressed by their location in the IPA chart1, but is this ordering natural to the neural representation? Instead of assuming a certain ordering, we could try and deduce the natural organization of speech features represented in the population-level neural code. That is, we can try to infer a neighbourhood structure (or order) of the vowels where similar (neighbouring) neuronal representations are used for neighbouring vowels. We reasoned that this neighbourhood structure could be extracted from the error structure of neuronal classifiers: when a decoder, such as the population vector11 errs, it is more likely to prefer a value that is a neighbour of the correct value than a more distant one. Thus, classification error rates are expected to be higher between neighbours than between distant features when feature ordering accurately reflects the neural representation neighbourhood structure. In this case, classification error rates expressed by the classifier's confusion matrix will have a band-diagonal structure.
To apply this strategy, we decoded the population firing patterns using multivariate linear classifiers with a sparsity constraint to infer the uttered vowel (Methods). The five vowels were decoded with a high average (cross-validated) accuracy of 93% (significantly above the 20% chance level, P<10−5, one-sided one-sample t-test, n=6 cross-validation runs; Supplementary Table S1), and up to 100% when decoding pairs of vowels (Fig. 3a). Next, we selected the vowel ordering that leads to band-diagonal confusion matrices (Fig. 3b). Interestingly, this ordering is consistent across different neuronal subpopulations (Fig. 3b and Supplementary Fig. S4) and exactly matches the organization of vowels according to their place and manner of articulation as reflected by the IPA chart (Fig. 3c). As the vowel chart represents the position of the highest point of the tongue during articulation, the natural organization of speech features by neuronal encoding reflects a functional spatial-anatomical axis in the mouth.
(a) Average decoding accuracy (±s.e.) versus the number of decoded vowel classes. Red dashed lines represent the chance level. (b) Confusion matrix for decoding population activity of all analysed units to infer the uttered vowels. Band-diagonal structure indicates adjacency of vowels in the order a, e, i, u and o in the neural representation. High confusion in the corner (between u and i) implies cyclicity. (c) The vowels IPA chart, representing the highest point of the tongue during articulation, on top of a vocal tract diagram. The inferred connections (blue lines) demonstrate neuronal representation of articulatory physiology.
Discussion
These results suggest that speech-related rAC/MOF neurons use sparse coding for vowels, in analogy to the sparse bursts in songbirds' area HVc12 and to the sparse, highly selective responses observed in the human medial–temporal lobe13. In contradistinction, the gradually modulated speech encoding in STG implies previously unrecognized correlates with a hallmark of motor cortical control—broad, sinusoidal tuning, implying a role in motor control of speech production9. Interestingly, speech encoding in these anatomical areas is opposite in nature to that of other modalities: broad tuning for motor control is common in the frontal lobe10 (versus STG in the temporal lobe) whereas, sparse tuning to high-level concepts is common in the temporal lobe13 (versus rAC/MOF in the frontal lobe). Analogous to the recently found subpathway between the visual dorsal and ventral streams14, our findings may lend support to a speech-related 'dorsal stream' where sensorimotor prediction supports speech production by a state feedback control3. The sparse rAC/MOF representation may serve as predictor state, in line with anterior cingulate15 and orbitofrontal16 roles in reward prediction. The broad STG tuning may support evidence that the motor system is capable of modulating the perception system to some degree3,17,18.
Our finding of sharply tuned neurons in rAC/MOF agrees with the DIVA model of the human speech system19, which suggests that higher-level prefrontal cortex regions involved in phonological encoding of an intended utterance sequentially activate speech sound map neurons that correspond to the syllables to be produced. Activation of these neurons leads to the readout of feedforward motor commands to the primary motor cortex. Owing to orbitofrontal connections to both STG20 and ventral pre-motor cortex21, rAC/MOF neurons may participate also in the feedback control map, where sharply tuned neurons may provide a high-level 'discrete' representation of the sound to utter, based on STG input from the auditory error map, before low-level commands are sent to the articulator velocity and position maps via ventral pre-motor cortex. Our broadly tuned STG units may also be part of the transition from the auditory error map to the feedback control map, providing a lower-level 'continuous' population representation of the sound to utter.
Our results further demonstrate that the neuronal population encoding of vowel generation appears to be organized according to a functional representation of a spatial-anatomical axis: tongue height. This axis was shown to have a significant main effect on decoding from speech motor cortex units22. Whether these structured multi-level encoding schemes also exist in other speech areas like Broca's2 and speech motor cortex, and how they contribute to the coordinated production of speech are important open questions. Notwithstanding, the structured encoding observed naturally facilitates high-fidelity decoding of volitional speech segments and may have implications for restoring speech faculties in individuals who are completely paralysed or 'locked-in'.23,24,25,26,27,28
Methods
Patients and electophysiology
Eleven patients with pharmacologically resistant epilepsy undergoing invasive monitoring with intracranial depth electrodes to identify the seizure focus for potential surgical treatment29 (9 right handed, 7 females, ages 19–53) participated in a total of 14 recording sessions, each on a different day. Based exclusively on clinical criteria, each patient had 8–12 electrodes for 1–2 weeks, each of which terminated with a set of nine 40-μm platinum–iridium microwires. Their locations were verified by magnetic resonance imaging or by computer tomography coregistered to preoperative magnetic resonance imaging. Bandpass-filtered signals (0.3–3 kHz) from these microwires and the sound track were synchronously recorded at 30 kHz using a 128-channel acquisition system (Neuroport, Blackrock). Sorted units (WaveClus30 and SUMU31) recorded in different sessions are treated as different in this study. All studies conformed with the guidelines of the Medical Institutional Review Board at the University of California Los Angeles.
Experimental paradigms
Patients first listened to isolated auditory cues (beeps) and to another individual uttering the vowel sounds and three syllables (me/lu/ha) following beeps (auditory controls). Then, following an oral instruction, patients uttered the instructed syllable multiple times, each following a randomly spaced (2–3 s) beep. Syllables consisted of either monophthongal vowels (a/a/, e/ε/, i/i/, o/o/ and u/u/) or a consonant (of: d/d/, g/g/, h/h/, j/dε/, l/l/, m/m/, n/n/, r//, s/s/ and v/v/), and one of the aforementioned vowels (for example, da/da/, de/dε/, di/di/, do/do/ and du/du/)1. For simplicity, this paper employs the English rather than IPA transcription as described above. All sessions were conducted at the patient's quiet bed-side.
Data analysis
Of the 716 recorded units, we analysed 606 that were not responsive during any auditory control (rAC and adjacent MOF cortex (rAC/MOF): 123 of 156; dorsal and subcollosal ACC: 68/72; entorhinal cortex: 124/138; hippocampus: 103/114; amygdala: 92/106; parahippocampal gyrus: 64/66; and STG: 32/64). The anatomical subdivisions of the ACC are according to McCormick et al.32 Owing to clinical considerations29, no electrode was placed in the primary or pre-motor cortex in this patient population. Each brain region was recorded in at least three subjects. A unit is considered speech-related when the firing rate differs significantly between the baseline ([−1000, 0]ms relative to the beep) and the response ([0, 200]ms relative to speech onset; paired t-test, P<0.05, adjusted for false discovery rate33 control for multiple units and vowels, q<0.05, n ranges between 6 and 12 trials depending on the session). For these units, we found the maximal response among the four 100-ms bins starting 100 ms before speech onset, and computed mean firing rates in a 300-ms window around this bin. Tuned units are speech-related units for which the mean firing rate is significantly different between the five vowel groups (analysis of variance; F-test, P<0.05, false discovery rate33 adjusted, q<0.05, n between 6 and 12 for each group). Broad, sinusoidally tuned units are tuned units whose firing rate correlates with: a+bcos (c+i2π/5) (where i=0,...,4 is the index of the vowel in a, e, i, u, o) with coefficient of determination R2>0.7.10 Sharply tuned units are tuned units for which the mean firing rate in the three vowel groups of lowest mean firing rate is the same with high probability (analysis of variance; F-test, P > 0.1, n between 6 and 12 for each group). The vowel decoder is a regularized multivariate linear solver, which minimizes ||x||L1 subject to ||Ax−b||L2≤σ (basis pursuit denoising problem34). It has superior decoding performance and speed relative to neuron-dropping decoders35. A contains the feature inputs to the decoder: spike counts of all units in a baseline bin ([−1000,0]ms relative to the beep) and in two 100-ms response bins that followed speech onset; b are 5-element binary vectors coding the individual vowels uniquely. All decoding results were sixfold cross-validated using trials that were not used for decoder training. The decoder was trained on all of the aforementioned features from the training set only, with no selection of the input neurons or their features. Instead, the sparse decoder described above automatically selects task-relevant features by higher weights it allocates to them using the minimal ||x||L1 constraint; task-unrelated features are thus diminished by low weights. Owing to the high accuracy in decoding, we randomly dropped 20% of the units (in each cross-validation training) when computing confusion matrices, to increase the amount of confusions and allow the extraction of a meaningful band-diagonal structure (except for the STG-only training, Supplementary Fig. S4).
The vowels in Fig. 3c were placed on the IPA chart according to the locations previously calculated for American speakers (ref. 1, page 42), and the overlaid connections (blue lines) were inferred by the maximal non-diagonal element for each row and each column of the confusion matrix.
Additional information
How to cite this article: Tankus, A. et al. Structured neuronal encoding and decoding of human speech features. Nat. Commun. 3:1015 doi: 10.1038/ncomms1995 (2012).
References
International Phonetic Association. Handbook of the International Phonetic Association (Cambridge University Press, 1999).
Sahin, N. T., Pinker, S., Cash, S. S., Schomer, D. & Halgren, E. Sequential processing of lexical, grammatical, and phonological information within Broca's area. Science 326, 445–449 (2009).
Hickok, G., Houde, J. & Rong, F. Sensorimotor integration in speech processing: computational basis and neural organization. Neuron 69, 407–422 (2011).
Ghosh, S. S., Tourville, J. A. & Guenther, F. H. A neuroimaging study of premotor lateralization and cerebellar involvement in the production of phonemes and syllables. J. Speech Lang. Hear. Res. 51, 1183–1202 (2008).
Halgren, E. et al. Spatiotemporal stages in face and word-processing. 2. Depth- recorded potentials in the human frontal and rolandic cortices. J. Physiol. Paris 88, 51–80 (1994).
Paus, T., Petrides, M., Evans, A. C. & Meyer, E. Role of the human anterior cingulate cortex in the control of oculomotor, manual, and speech responses: a positron emission tomography study. J. Neurophysiol. 70, 453–469 (1993).
Sabri, M. et al. Attentional and linguistic interactions in speech perception. Neuroimage 39, 1444–1456 (2008).
Buchsbaum, B. R., Hickok, G. & Humphries, C. Role of left posterior superior temporal gyrus in phonological processing for speech perception and production. Cognitive Sci. 25, 663–678 (2001).
Ojemann, G. A., Creutzfeldt, O., Lettich, E. & Haglund, M. M. Neuronal activity in human lateral temporal cortex related to short-term verbal memory, naming and reading. Brain 111 (Pt 6), 1383–1403 (1988).
Georgopoulos, A. P., Kalaska, J. F., Caminiti, R. & Massey, J. T. On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci. 2, 1527–1537 (1982).
Georgopoulos, A. P., Schwartz, A. B. & Kettner, R. E. Neuronal population coding of movement direction. Science 233, 1416–1419 (1986).
Hahnloser, R. H. R., Kozhevnikov, A. A. & Fee, M. S. An ultra-sparse code underlies the generation of neural sequences in a songbird. Nature 419, 65–70 (2002).
Quian-Quiroga, R., Reddy, L., Kreiman, G., Koch, C. & Fried, I. Invariant visual representation by single neurons in the human brain. Nature 435, 1102–1107 (2005).
Tankus, A. & Fried, I. Visuomotor coordination and motor representation by human temporal lobe neurons. J. Cogn. Neurosci. 24, 600–610 (2011).
Hayden, B. Y., Heilbronner, S. R., Pearson, J. M. & Platt, M. L. Surprise signals in anterior cingulate cortex: neuronal encoding of unsigned reward prediction errors driving adjustment in behavior. J. Neurosci. 31, 4178–4187 (2011).
Wallis, J. D. Cross-species studies of orbitofrontal cortex and value-based decision-making. Nat. Neurosci. 15, 13–19 (2012).
Hickok, G. & Poeppel, D. The cortical organization of speech processing. Nat. Rev. Neurosci. 8, 393–402 (2007).
Hickok, G. Computational neuroanatomy of speech production. Nat. Rev. Neurosci. 13, 135–145 (2012).
Guenther, F. H. & Vladusich, T. A neural theory of speech acquisition and production. J. Neurolinguistics 25, 408–422 (2012).
Cavada, C., Compañy, T., Tejedor, J., Cruz-Rizzolo, R. J. & Reinoso-Suárez, F. The anatomical connections of the macaque monkey orbitofrontal cortex. A review. Cereb. Cortex 10, 220–242 (2000).
Carmichael, S. T. & Price, J. L. Sensory and premotor connections of the orbital and medial prefrontal cortex of macaque monkeys. J. Comp. Neurol. 363, 642–664 (1995).
Brumberg, J. S., Wright, E. J., Andreasen, D. S., Guenther, F. H. & Kennedy, P. R. Classification of intended phoneme production from chronic intracortical microelectrode recordings in speech-motor cortex. Front Neurosci. 5, 65 (2011).
Guenther, F. H. et al. A wireless brain-machine interface for real-time speech synthesis. PLoS ONE 4, e8218 (2009).
Brumberg, J. S., Nieto-Castanon, A., Kennedy, P. R. & Guenther, F. H. Brain-computer interfaces for speech communication. Speech Commun. 52, 367–379 (2010).
Pasley, B. N. et al. Reconstructing speech from human auditory cortex. PLoS Biol. 10, e1001251 (2012).
Pei, X., Barbour, D. L., Leuthardt, E. C. & Schalk, G. Decoding vowels and consonants in spoken and imagined words using electrocorticographic signals in humans. J. Neural Eng. 8, 046028 (2011).
Leuthardt, E. C. et al. Using the electrocorticographic speech network to control a brain-computer interface in humans. J. Neural Eng. 8, 036004 (2011).
Formisano, E., De Martino, F., Bonte, M. & Goebel, R. 'Who' is saying 'what'? Brain-based decoding of human voice and speech. Science 322, 970–973 (2008).
Fried, I. et al. Cerebral microdialysis combined with single-neuron and electroencephalographic recording in neurosurgical patients. J. Neurosurg. 91, 697–705 (1999).
Quian-Quiroga, R., Nadasdy, Z. & Ben-Shaul, Y. Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering. Neural Comput. 16, 1661–1687 (2004).
Tankus, A., Yeshurun, Y. & Fried, I. An automatic measure for classifying clusters of suspected spikes into single cells versus multiunits. J. Neural Eng. 6, 056001 (2009).
McCormick, L. M. et al. Anterior cingulate cortex: an MRI-based parcellation method. NeuroImage 32, 1167–1175 (2006).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodological 57, 289–300 (1995).
van den Berg, E. & Friedlander, M. P. Probing the pareto frontier for basis pursuit solutions. SIAM J. Sci. Comput. 31, 890–912 (2008).
Tankus, A., Fried, I. & Shoham, S. Sparse decoding of multiple spike trains for brain–machine interfaces. J. Neural Eng. 9 (in press) (2012).
Acknowledgements
We thank D. Pourshaban, E. Behnke, T. Fields and Prof. P. Keating of UCLA, and Prof. A. Cohen and A. Alfassy of the Technion for assistance, and the European Research Council (STG 211055), NINDS, Dana Foundation, Lady Davis and L. and L. Richmond research funds for financial support.
Author information
Authors and Affiliations
Contributions
A.T., I.F. and S.S. designed the study and wrote the manuscript; I.F. performed neurosurgeries; A.T. prepared experimental setup and performed the experiments; A.T. and S.S. analysed the data.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Information
Supplementary Figures S1-S4 and Supplementary Table S1 (PDF 885 kb)
Rights and permissions
About this article
Cite this article
Tankus, A., Fried, I. & Shoham, S. Structured neuronal encoding and decoding of human speech features. Nat Commun 3, 1015 (2012). https://doi.org/10.1038/ncomms1995
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/ncomms1995
This article is cited by
-
Brain-Computer Interface: Applications to Speech Decoding and Synthesis to Augment Communication
Neurotherapeutics (2022)
-
The Potential for a Speech Brain–Computer Interface Using Chronic Electrocorticography
Neurotherapeutics (2019)
-
Functional and spatial segregation within the inferior frontal and superior temporal cortices during listening, articulation imagery, and production of vowels
Scientific Reports (2017)
-
Neuroimaging as a Window into Gait Disturbances and Freezing of Gait in Patients with Parkinson’s Disease
Current Neurology and Neuroscience Reports (2013)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.