Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Speech synthesis from neural decoding of spoken sentences

Abstract

Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators. Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferrable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences. These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Speech synthesis from neurally decoded spoken sentences.
Fig. 2: Synthesized speech intelligibility and feature-specific performance.
Fig. 3: Speech synthesis from neural decoding of silently mimed speech.
Fig. 4: Kinematic state–space representation of speech production.

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author upon request.

Code availability

All code may be freely obtained for non-commercial use by contacting the corresponding author.

References

  1. Fager, S. K., Fried-Oken, M., Jakobs, T. & Beukelman, D. R. New and emerging access technologies for adults with complex communication needs and severe motor impairments: state of the science. Augment. Altern. Commun. https://doi.org/10.1080/07434618.2018.1556730 (2019).

    Article  Google Scholar 

  2. Brumberg, J. S., Pitt, K. M., Mantie-Kozlowski, A. & Burnison, J. D. Brain–computer interfaces for augmentative and alternative communication: a tutorial. Am. J. Speech Lang. Pathol. 27, 1–12 (2018).

    Article  Google Scholar 

  3. Pandarinath, C. et al. High performance communication by people with paralysis using an intracortical brain–computer interface. eLife 6, e18554 (2017).

    Article  Google Scholar 

  4. Guenther, F. H. et al. A wireless brain–machine interface for real-time speech synthesis. PLoS ONE 4, e8218 (2009).

    Article  ADS  Google Scholar 

  5. Bocquelet, F., Hueber, T., Girin, L., Savariaux, C. & Yvert, B. Real-time control of an articulatory-based speech synthesizer for brain computer interfaces. PLOS Comput. Biol. 12, e1005119 (2016).

    Article  ADS  Google Scholar 

  6. Browman, C. P. & Goldstein, L. Articulatory phonology: an overview. Phonetica 49, 155–180 (1992).

    Article  CAS  Google Scholar 

  7. Sadtler, P. T. et al. Neural constraints on learning. Nature 512, 423–426 (2014).

    Article  ADS  CAS  Google Scholar 

  8. Golub, M. D. et al. Learning by neural reassociation. Nat. Neurosci. 21, 607–616 (2018).

    Article  CAS  Google Scholar 

  9. Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005).

    Article  Google Scholar 

  10. Crone, N. E. et al. Electrocorticographic gamma activity during word production in spoken and sign language. Neurology 57, 2045–2053 (2001).

    Article  CAS  Google Scholar 

  11. Nourski, K. V. et al. Sound identification in human auditory cortex: differential contribution of local field potentials and high gamma power as revealed by direct intracranial recordings. Brain Lang. 148, 37–50 (2015).

    Article  Google Scholar 

  12. Pesaran, B. et al. Investigating large-scale brain dynamics using field potential recordings: analysis and interpretation. Nat. Neurosci. 21, 903–919 (2018).

    Article  CAS  Google Scholar 

  13. Bouchard, K. E., Mesgarani, N., Johnson, K. & Chang, E. F. Functional organization of human sensorimotor cortex for speech articulation. Nature 495, 327–332 (2013).

    Article  ADS  CAS  Google Scholar 

  14. Mesgarani, N., Cheung, C., Johnson, K. & Chang, E. F. Phonetic feature encoding in human superior temporal gyrus. Science 343, 1006–1010 (2014).

    Article  ADS  CAS  Google Scholar 

  15. Flinker, A. et al. Redefining the role of Broca’s area in speech. Proc. Natl Acad. Sci. USA 112, 2871–2875 (2015).

    Article  ADS  CAS  Google Scholar 

  16. Chartier, J., Anumanchipalli, G. K., Johnson, K. & Chang, E. F. Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron 98, 1042–1054 (2018).

    Article  CAS  Google Scholar 

  17. Mugler, E. M. et al. Differential representation of articulatory gestures and phonemes in precentral and inferior frontal gyri. J. Neurosci. 38, 9803–9813 (2018).

    Article  CAS  Google Scholar 

  18. Huggins, J. E., Wren, P. A. & Gruis, K. L. What would brain–computer interface users want? Opinions and priorities of potential users with amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. 12, 318–324 (2011).

    Article  Google Scholar 

  19. Luce, P. A. & Pisoni, D. B. Recognizing spoken words: the neighborhood activation model. Ear Hear. 19, 1–36 (1998).

    Article  CAS  Google Scholar 

  20. Wrench, A. MOCHA: multichannel articulatory database. http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html (1999).

  21. Kominek, J., Schultz, T. & Black, A. Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. In Proc. The first workshop on Spoken Language Technologies for Under-resourced languages (SLTU-2008) 63–68 (2008).

  22. Davis, S. B. & Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In Readings in speech recognition. IEEE Trans. Acoust. 28, 357–366 (1980).

    Article  Google Scholar 

  23. Gallego, J. A., Perich, M. G., Miller, L. E. & Solla, S. A. Neural manifolds for the control of movement. Neuron 94, 978–984 (2017).

    Article  CAS  Google Scholar 

  24. Sokal, R. R. & Rohlf, F. J. The comparison of dendrograms by objective methods. Taxon 11, 33–40 (1962).

    Article  Google Scholar 

  25. Brumberg, J. S. et al. Spatio-temporal progression of cortical activity related to continuous overt and covert speech production in a reading task. PLoS ONE 11, e0166872 (2016).

    Article  Google Scholar 

  26. Mugler, E. M. et al. Direct classification of all American English phonemes using signals from functional speech motor cortex. J. Neural Eng. 11, 035015 (2014).

    Article  ADS  Google Scholar 

  27. Herff, C. et al. Brain-to-text: decoding spoken phrases from phone representations in the brain. Front. Neurosci. 9, 217 (2015).

    Article  Google Scholar 

  28. Moses, D. A., Mesgarani, N., Leonard, M. K. & Chang, E. F. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity. J. Neural Eng. 13, 056004 (2016).

    Article  ADS  Google Scholar 

  29. Pasley, B. N. et al. Reconstructing speech from human auditory cortex. PLoS Biol. 10, e1001251 (2012).

    Article  CAS  Google Scholar 

  30. Akbari, H., Khalighinejad, B., Herrero, J. L., Mehta, A. D. & Mesgarani, N. Towards reconstructing intelligible speech from the human auditory cortex. Sci. Rep. 9, 874 (2019).

    Article  ADS  Google Scholar 

  31. Martin, S. et al. Decoding spectrotemporal features of overt and covert speech from the human cortex. Front. Neuroeng. 7, 14 (2014).

    Article  Google Scholar 

  32. Dichter, B. K., Breshears, J. D., Leonard, M. K. & Chang, E. F. The control of vocal pitch in human laryngeal motor cortex. Cell 174, 21–31 (2018).

    Article  CAS  Google Scholar 

  33. Wessberg, J. et al. Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature 408, 361–365 (2000).

    Article  ADS  CAS  Google Scholar 

  34. Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R. & Donoghue, J. P. Instant neural control of a movement signal. Nature 416, 141–142 (2002).

    Article  ADS  CAS  Google Scholar 

  35. Taylor, D. M., Tillery, S. I. & Schwartz, A. B. Direct cortical control of 3D neuroprosthetic devices. Science 296, 1829–1832 (2002).

    Article  ADS  CAS  Google Scholar 

  36. Hochberg, L. R. et al. Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442, 164–171 (2006).

    Article  ADS  CAS  Google Scholar 

  37. Collinger, J. L. et al. High-performance neuroprosthetic control by an individual with tetraplegia. Lancet 381, 557–564 (2013).

    Article  Google Scholar 

  38. Aflalo, T. et al. Decoding motor imagery from the posterior parietal cortex of a tetraplegic human. Science 348, 906–910 (2015).

    Article  ADS  CAS  Google Scholar 

  39. Ajiboye, A. B. et al. Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: a proof-of-concept demonstration. Lancet 389, 1821–1830 (2017).

    Article  Google Scholar 

  40. Prahallad, K., Black, A. W. & Mosur, R. Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis. In Proc. 2006 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP, 2006).

  41. Anumanchipalli, G. K., Prahallad, K. & Black, A. W. Festvox: tools for creation and analyses of large speech corpora. http://www.festvox.org (2011).

  42. Hamilton, L. S., Chang, D. L., Lee, M. B. & Chang, E. F. Semi-automated anatomical labeling and inter-subject warping of high-density intracranial recording electrodes in electrocorticography. Front. Neuroinform. 11, 62 (2017).

    Article  ADS  Google Scholar 

  43. Richmond, K., Hoole, P. & King, S. Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In Proc. Interspeech 2011 1505–1508 (2011).

  44. Paul, B. D. & Baker, M. J. The design for the Wall Street Journal-based CSR corpus. In Proc. Workshop on Speech and Natural Language (Association for Computational Linguistics, 1992).

  45. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems. http://www.tensorflow.org (2015).

  46. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  CAS  Google Scholar 

  47. Maia, R., Toda, T., Zen, H., Nankaku, Y. & Tokuda, K. An excitation model for HMM-based speech synthesis based on residual modeling. In Proc. 6th ISCA Speech synthesis Workshop (SSW6) 131–136 (2007).

  48. Wolters, M. K., Isaac, K. B. & Renals, S. Evaluating speech synthesis intelligibility using Amazon Mechanical Turk. In Proc. 7th ISCA Speech Synthesis Workshop (SSW7) (2010).

  49. Berndt, D. J. & Clifford, J. Using dynamic time warping to find patterns in time series. In Proc. 10th ACM Knowledge Discovery and Data Mining (KDD) Workshop 359–370 (1994).

Download references

Acknowledgements

We thank M. Leonard, N. Fox and D. Moses for comments on the manuscript and B. Speidel for his help reconstructing MRI images. This work was supported by grants from the NIH (DP2 OD008627 and U01 NS098971-01). E.F.C. is a New York Stem Cell Foundation-Robertson Investigator. This research was also supported by The William K. Bowes Foundation, the Howard Hughes Medical Institute, The New York Stem Cell Foundation and The Shurl and Kay Curci Foundation.

Reviewer information

Nature thanks David Poeppel and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author information

Authors and Affiliations

Authors

Contributions

G.K.A., J.C. and E.F.C. conceived the study; G.K.A. inferred articulatory kinematics; G.K.A. and J.C. designed the decoder; J.C. performed decoder analyses; G.K.A., E.F.C. and J.C. collected data and prepared the manuscript; E.F.C. supervised the project.

Corresponding author

Correspondence to Edward F. Chang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Median original and decoded spectrograms.

a, b, Median spectrograms, time-locked to the acoustic onset of phonemes from original (a) and decoded (b) audio (/i/, n = 112; /z/, n = 115; /p/, n = 69, /ae/, n = 86). These phonemes represent the diversity of spectral features. Original and decoded median phoneme spectrograms were well-correlated (Pearson’s r > 0.9 for all phonemes, P = 1 × 10−18).

Extended Data Fig. 2 Transcription WER for individual trials.

a, b, WERs for individually transcribed trials for pools with a size of 25 (a) or 50 (b) words. Listeners transcribed synthesized sentences by selecting words from a defined pool of words. Word pools included correct words found in the synthesized sentence and random words from the test set. One trial is one transcription of one listener of one synthesized sentence.

Extended Data Fig. 3 Electrode array locations for participants.

MRI reconstructions of participants’ brains with overlay of electrocorticographic electrode (ECoG) array locations. P1–5, participants 1–5.

Extended Data Fig. 4 Decoding performance of kinematic and spectral features.

Data from participant 1. a, Correlations of all 33 decoded articulatory kinematic features with ground-truth (n = 101 sentences). EMA features represent x and y coordinate traces of articulators (lips, jaw and three points of the tongue) along the midsagittal plane of the vocal tract. Manner features represent complementary kinematic features to EMA that further describe acoustically consequential movements. b, Correlations of all 32 decoded spectral features with ground-truth (n = 101 sentences). MFCC features are 25 mel-frequency cepstral coefficients that describe power in perceptually relevant frequency bands. Synthesis features describe glottal excitation weights necessary for speech synthesis. Box plots as described in Fig. 2.

Extended Data Fig. 5 Comparison of cumulative variance explained in kinematic and acoustic state–spaces.

For each representation of speech—kinematics and acoustics—a principal components analysis was computed and the explained variance for each additional principal component was cumulatively summed. Kinematic and acoustic representations had 33 and 32 features, respectively.

Extended Data Fig. 6 Decoded phoneme acoustic similarity matrix.

Acoustic similarity matrix compares acoustic properties of decoded phonemes and originally spoken phonemes. Similarity is computed by first estimating a Gaussian kernel density for each phoneme (both decoded and original) and then computing the Kullback–Leibler (KL) divergence between a pair of decoded and original phoneme distributions. Each row compares the acoustic properties of a decoded phoneme with originally spoken phonemes (columns). Hierarchical clustering was performed on the resulting similarity matrix. Data from participant 1.

Extended Data Fig. 7 Ground-truth acoustic similarity matrix.

The acoustic properties of ground-truth spoken phonemes are compared with one another. Similarity is computed by first estimating a Gaussian kernel density for each phoneme and then computing the Kullback–Leibler divergence between a pair of a phoneme distributions. Each row compares the acoustic properties of two ground-truth spoken phonemes. Hierarchical clustering was performed on the resulting similarity matrix. Data from participant 1.

Extended Data Fig. 8 Comparison between decoding novel and repeated sentences.

a, b, Comparison metrics included spectral distortion (a) and the correlation between decoded and original spectral features (b). Decoder performance for these two types of sentences was compared and no significant difference was found (P = 0.36 (a) and P = 0.75 (b), n = 51 sentences, Wilcoxon signed-rank test). A novel sentence consists of words and/or a word sequence not present in the training data. A repeated sentence is a sentence that has at least one matching word sequence in the training data, although with a unique production. Comparison was performed on participant 1 and the evaluated sentences were the same across both cases with two decoders trained on differing datasets to either exclude or include unique repeats of sentences in the test set. ns, not significant; P > 0.05. Box plots as described in Fig. 2.

Extended Data Fig. 9 Kinematic state–space trajectories for phoneme-specific vowel–consonant transitions.

Average trajectories of principal components 1 (PC1) and 2 (PC2) for transitions from either a consonant or a vowel to specific phonemes. Trajectories are 500 ms and centred at transition between phonemes. a, Consonant to corner vowels (n = 1,387, 1,964, 2,259, 894, respectively, for aa, ae, iy and uw). PC1 shows separation of all corner vowels and PC2 delineates between front vowels (iy, ae) and back vowels (uw, aa). b, Vowel to unvoiced plosives (n = 2,071, 4,107 and 1,441, respectively, for k, p and t). PC1 was more selective for velar constriction (k) and PC2 for bilabial constriction (p). c, Vowel to alveolars (n = 3,919, 3,010 and 4,107, respectively, for n, s and t). PC1 shows separation by manner of articulation (nasal, plosive or fricative) whereas PC2 is less discriminative. d, PC1 and PC2 show little, if any, delineation between voiced and unvoiced alveolar fricatives (n = 3,010 and 1,855, respectively, for s and z).

Supplementary information

Supplementary Information

This file contains: a) Place-manner tuples used to augment EMA trajectories; b) Sentences used in listening tests Original Source: MOCHA-TIMIT20 dataset; c) Class sizes for the listening tests; d) Transcription interface for the intelligibility assessment; and e) Number of listeners used for intelligibility assessments.

Reporting Summary

Supplemental Video 1: Examples of decoded kinematics and synthesized speech production

The video presents examples of synthesized audio from neural recordings of spoken sentences. In each example, electrode activity corresponding to a sentence is displayed (top). Next, simultaneous decoding of kinematics and acoustics are visually and audible presented. Decoded articulatory movements are displayed (middle left) as the synthesized speech spectrogram unfolds. Following the decoding, the original audio, as spoken by the patient during neural recording, is played. Lastly, the decoded movements and synthesized speech is once again presented. This format is repeated for a total of five examples (from participants P1 and P2). On the last example, kinematics and audio are also decoded and synthesized for silently mimed speech.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Anumanchipalli, G.K., Chartier, J. & Chang, E.F. Speech synthesis from neural decoding of spoken sentences. Nature 568, 493–498 (2019). https://doi.org/10.1038/s41586-019-1119-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-019-1119-1

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing