Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators. Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferrable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences. These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.
This is a preview of subscription content
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The data that support the findings of this study are available from the corresponding author upon request.
All code may be freely obtained for non-commercial use by contacting the corresponding author.
Fager, S. K., Fried-Oken, M., Jakobs, T. & Beukelman, D. R. New and emerging access technologies for adults with complex communication needs and severe motor impairments: state of the science. Augment. Altern. Commun. https://doi.org/10.1080/07434618.2018.1556730 (2019).
Brumberg, J. S., Pitt, K. M., Mantie-Kozlowski, A. & Burnison, J. D. Brain–computer interfaces for augmentative and alternative communication: a tutorial. Am. J. Speech Lang. Pathol. 27, 1–12 (2018).
Pandarinath, C. et al. High performance communication by people with paralysis using an intracortical brain–computer interface. eLife 6, e18554 (2017).
Guenther, F. H. et al. A wireless brain–machine interface for real-time speech synthesis. PLoS ONE 4, e8218 (2009).
Bocquelet, F., Hueber, T., Girin, L., Savariaux, C. & Yvert, B. Real-time control of an articulatory-based speech synthesizer for brain computer interfaces. PLOS Comput. Biol. 12, e1005119 (2016).
Browman, C. P. & Goldstein, L. Articulatory phonology: an overview. Phonetica 49, 155–180 (1992).
Sadtler, P. T. et al. Neural constraints on learning. Nature 512, 423–426 (2014).
Golub, M. D. et al. Learning by neural reassociation. Nat. Neurosci. 21, 607–616 (2018).
Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005).
Crone, N. E. et al. Electrocorticographic gamma activity during word production in spoken and sign language. Neurology 57, 2045–2053 (2001).
Nourski, K. V. et al. Sound identification in human auditory cortex: differential contribution of local field potentials and high gamma power as revealed by direct intracranial recordings. Brain Lang. 148, 37–50 (2015).
Pesaran, B. et al. Investigating large-scale brain dynamics using field potential recordings: analysis and interpretation. Nat. Neurosci. 21, 903–919 (2018).
Bouchard, K. E., Mesgarani, N., Johnson, K. & Chang, E. F. Functional organization of human sensorimotor cortex for speech articulation. Nature 495, 327–332 (2013).
Mesgarani, N., Cheung, C., Johnson, K. & Chang, E. F. Phonetic feature encoding in human superior temporal gyrus. Science 343, 1006–1010 (2014).
Flinker, A. et al. Redefining the role of Broca’s area in speech. Proc. Natl Acad. Sci. USA 112, 2871–2875 (2015).
Chartier, J., Anumanchipalli, G. K., Johnson, K. & Chang, E. F. Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron 98, 1042–1054 (2018).
Mugler, E. M. et al. Differential representation of articulatory gestures and phonemes in precentral and inferior frontal gyri. J. Neurosci. 38, 9803–9813 (2018).
Huggins, J. E., Wren, P. A. & Gruis, K. L. What would brain–computer interface users want? Opinions and priorities of potential users with amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. 12, 318–324 (2011).
Luce, P. A. & Pisoni, D. B. Recognizing spoken words: the neighborhood activation model. Ear Hear. 19, 1–36 (1998).
Wrench, A. MOCHA: multichannel articulatory database. http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html (1999).
Kominek, J., Schultz, T. & Black, A. Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. In Proc. The first workshop on Spoken Language Technologies for Under-resourced languages (SLTU-2008) 63–68 (2008).
Davis, S. B. & Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In Readings in speech recognition. IEEE Trans. Acoust. 28, 357–366 (1980).
Gallego, J. A., Perich, M. G., Miller, L. E. & Solla, S. A. Neural manifolds for the control of movement. Neuron 94, 978–984 (2017).
Sokal, R. R. & Rohlf, F. J. The comparison of dendrograms by objective methods. Taxon 11, 33–40 (1962).
Brumberg, J. S. et al. Spatio-temporal progression of cortical activity related to continuous overt and covert speech production in a reading task. PLoS ONE 11, e0166872 (2016).
Mugler, E. M. et al. Direct classification of all American English phonemes using signals from functional speech motor cortex. J. Neural Eng. 11, 035015 (2014).
Herff, C. et al. Brain-to-text: decoding spoken phrases from phone representations in the brain. Front. Neurosci. 9, 217 (2015).
Moses, D. A., Mesgarani, N., Leonard, M. K. & Chang, E. F. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity. J. Neural Eng. 13, 056004 (2016).
Pasley, B. N. et al. Reconstructing speech from human auditory cortex. PLoS Biol. 10, e1001251 (2012).
Akbari, H., Khalighinejad, B., Herrero, J. L., Mehta, A. D. & Mesgarani, N. Towards reconstructing intelligible speech from the human auditory cortex. Sci. Rep. 9, 874 (2019).
Martin, S. et al. Decoding spectrotemporal features of overt and covert speech from the human cortex. Front. Neuroeng. 7, 14 (2014).
Dichter, B. K., Breshears, J. D., Leonard, M. K. & Chang, E. F. The control of vocal pitch in human laryngeal motor cortex. Cell 174, 21–31 (2018).
Wessberg, J. et al. Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature 408, 361–365 (2000).
Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R. & Donoghue, J. P. Instant neural control of a movement signal. Nature 416, 141–142 (2002).
Taylor, D. M., Tillery, S. I. & Schwartz, A. B. Direct cortical control of 3D neuroprosthetic devices. Science 296, 1829–1832 (2002).
Hochberg, L. R. et al. Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442, 164–171 (2006).
Collinger, J. L. et al. High-performance neuroprosthetic control by an individual with tetraplegia. Lancet 381, 557–564 (2013).
Aflalo, T. et al. Decoding motor imagery from the posterior parietal cortex of a tetraplegic human. Science 348, 906–910 (2015).
Ajiboye, A. B. et al. Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: a proof-of-concept demonstration. Lancet 389, 1821–1830 (2017).
Prahallad, K., Black, A. W. & Mosur, R. Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis. In Proc. 2006 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP, 2006).
Anumanchipalli, G. K., Prahallad, K. & Black, A. W. Festvox: tools for creation and analyses of large speech corpora. http://www.festvox.org (2011).
Hamilton, L. S., Chang, D. L., Lee, M. B. & Chang, E. F. Semi-automated anatomical labeling and inter-subject warping of high-density intracranial recording electrodes in electrocorticography. Front. Neuroinform. 11, 62 (2017).
Richmond, K., Hoole, P. & King, S. Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In Proc. Interspeech 2011 1505–1508 (2011).
Paul, B. D. & Baker, M. J. The design for the Wall Street Journal-based CSR corpus. In Proc. Workshop on Speech and Natural Language (Association for Computational Linguistics, 1992).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems. http://www.tensorflow.org (2015).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Maia, R., Toda, T., Zen, H., Nankaku, Y. & Tokuda, K. An excitation model for HMM-based speech synthesis based on residual modeling. In Proc. 6th ISCA Speech synthesis Workshop (SSW6) 131–136 (2007).
Wolters, M. K., Isaac, K. B. & Renals, S. Evaluating speech synthesis intelligibility using Amazon Mechanical Turk. In Proc. 7th ISCA Speech Synthesis Workshop (SSW7) (2010).
Berndt, D. J. & Clifford, J. Using dynamic time warping to find patterns in time series. In Proc. 10th ACM Knowledge Discovery and Data Mining (KDD) Workshop 359–370 (1994).
We thank M. Leonard, N. Fox and D. Moses for comments on the manuscript and B. Speidel for his help reconstructing MRI images. This work was supported by grants from the NIH (DP2 OD008627 and U01 NS098971-01). E.F.C. is a New York Stem Cell Foundation-Robertson Investigator. This research was also supported by The William K. Bowes Foundation, the Howard Hughes Medical Institute, The New York Stem Cell Foundation and The Shurl and Kay Curci Foundation.
Nature thanks David Poeppel and the other anonymous reviewer(s) for their contribution to the peer review of this work.
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
a, b, Median spectrograms, time-locked to the acoustic onset of phonemes from original (a) and decoded (b) audio (/i/, n = 112; /z/, n = 115; /p/, n = 69, /ae/, n = 86). These phonemes represent the diversity of spectral features. Original and decoded median phoneme spectrograms were well-correlated (Pearson’s r > 0.9 for all phonemes, P = 1 × 10−18).
a, b, WERs for individually transcribed trials for pools with a size of 25 (a) or 50 (b) words. Listeners transcribed synthesized sentences by selecting words from a defined pool of words. Word pools included correct words found in the synthesized sentence and random words from the test set. One trial is one transcription of one listener of one synthesized sentence.
MRI reconstructions of participants’ brains with overlay of electrocorticographic electrode (ECoG) array locations. P1–5, participants 1–5.
Data from participant 1. a, Correlations of all 33 decoded articulatory kinematic features with ground-truth (n = 101 sentences). EMA features represent x and y coordinate traces of articulators (lips, jaw and three points of the tongue) along the midsagittal plane of the vocal tract. Manner features represent complementary kinematic features to EMA that further describe acoustically consequential movements. b, Correlations of all 32 decoded spectral features with ground-truth (n = 101 sentences). MFCC features are 25 mel-frequency cepstral coefficients that describe power in perceptually relevant frequency bands. Synthesis features describe glottal excitation weights necessary for speech synthesis. Box plots as described in Fig. 2.
Extended Data Fig. 5 Comparison of cumulative variance explained in kinematic and acoustic state–spaces.
For each representation of speech—kinematics and acoustics—a principal components analysis was computed and the explained variance for each additional principal component was cumulatively summed. Kinematic and acoustic representations had 33 and 32 features, respectively.
Acoustic similarity matrix compares acoustic properties of decoded phonemes and originally spoken phonemes. Similarity is computed by first estimating a Gaussian kernel density for each phoneme (both decoded and original) and then computing the Kullback–Leibler (KL) divergence between a pair of decoded and original phoneme distributions. Each row compares the acoustic properties of a decoded phoneme with originally spoken phonemes (columns). Hierarchical clustering was performed on the resulting similarity matrix. Data from participant 1.
The acoustic properties of ground-truth spoken phonemes are compared with one another. Similarity is computed by first estimating a Gaussian kernel density for each phoneme and then computing the Kullback–Leibler divergence between a pair of a phoneme distributions. Each row compares the acoustic properties of two ground-truth spoken phonemes. Hierarchical clustering was performed on the resulting similarity matrix. Data from participant 1.
a, b, Comparison metrics included spectral distortion (a) and the correlation between decoded and original spectral features (b). Decoder performance for these two types of sentences was compared and no significant difference was found (P = 0.36 (a) and P = 0.75 (b), n = 51 sentences, Wilcoxon signed-rank test). A novel sentence consists of words and/or a word sequence not present in the training data. A repeated sentence is a sentence that has at least one matching word sequence in the training data, although with a unique production. Comparison was performed on participant 1 and the evaluated sentences were the same across both cases with two decoders trained on differing datasets to either exclude or include unique repeats of sentences in the test set. ns, not significant; P > 0.05. Box plots as described in Fig. 2.
Extended Data Fig. 9 Kinematic state–space trajectories for phoneme-specific vowel–consonant transitions.
Average trajectories of principal components 1 (PC1) and 2 (PC2) for transitions from either a consonant or a vowel to specific phonemes. Trajectories are 500 ms and centred at transition between phonemes. a, Consonant to corner vowels (n = 1,387, 1,964, 2,259, 894, respectively, for aa, ae, iy and uw). PC1 shows separation of all corner vowels and PC2 delineates between front vowels (iy, ae) and back vowels (uw, aa). b, Vowel to unvoiced plosives (n = 2,071, 4,107 and 1,441, respectively, for k, p and t). PC1 was more selective for velar constriction (k) and PC2 for bilabial constriction (p). c, Vowel to alveolars (n = 3,919, 3,010 and 4,107, respectively, for n, s and t). PC1 shows separation by manner of articulation (nasal, plosive or fricative) whereas PC2 is less discriminative. d, PC1 and PC2 show little, if any, delineation between voiced and unvoiced alveolar fricatives (n = 3,010 and 1,855, respectively, for s and z).
This file contains: a) Place-manner tuples used to augment EMA trajectories; b) Sentences used in listening tests Original Source: MOCHA-TIMIT20 dataset; c) Class sizes for the listening tests; d) Transcription interface for the intelligibility assessment; and e) Number of listeners used for intelligibility assessments.
The video presents examples of synthesized audio from neural recordings of spoken sentences. In each example, electrode activity corresponding to a sentence is displayed (top). Next, simultaneous decoding of kinematics and acoustics are visually and audible presented. Decoded articulatory movements are displayed (middle left) as the synthesized speech spectrogram unfolds. Following the decoding, the original audio, as spoken by the patient during neural recording, is played. Lastly, the decoded movements and synthesized speech is once again presented. This format is repeated for a total of five examples (from participants P1 and P2). On the last example, kinematics and audio are also decoded and synthesized for silently mimed speech.
About this article
Cite this article
Anumanchipalli, G.K., Chartier, J. & Chang, E.F. Speech synthesis from neural decoding of spoken sentences. Nature 568, 493–498 (2019). https://doi.org/10.1038/s41586-019-1119-1
BMC Biology (2021)
BioMedical Engineering OnLine (2021)
Nature Reviews Neuroscience (2021)