Article | Published:

Speech synthesis from neural decoding of spoken sentences

Naturevolume 568pages493498 (2019) | Download Citation

Abstract

Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators. Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferrable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences. These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

The data that support the findings of this study are available from the corresponding author upon request.

Code availability

All code may be freely obtained for non-commercial use by contacting the corresponding author.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Fager, S. K., Fried-Oken, M., Jakobs, T. & Beukelman, D. R. New and emerging access technologies for adults with complex communication needs and severe motor impairments: state of the science. Augment. Altern. Commun. https://doi.org/10.1080/07434618.2018.1556730 (2019).

  2. 2.

    Brumberg, J. S., Pitt, K. M., Mantie-Kozlowski, A. & Burnison, J. D. Brain–computer interfaces for augmentative and alternative communication: a tutorial. Am. J. Speech Lang. Pathol. 27, 1–12 (2018).

  3. 3.

    Pandarinath, C. et al. High performance communication by people with paralysis using an intracortical brain–computer interface. eLife 6, e18554 (2017).

  4. 4.

    Guenther, F. H. et al. A wireless brain–machine interface for real-time speech synthesis. PLoS ONE 4, e8218 (2009).

  5. 5.

    Bocquelet, F., Hueber, T., Girin, L., Savariaux, C. & Yvert, B. Real-time control of an articulatory-based speech synthesizer for brain computer interfaces. PLOS Comput. Biol. 12, e1005119 (2016).

  6. 6.

    Browman, C. P. & Goldstein, L. Articulatory phonology: an overview. Phonetica 49, 155–180 (1992).

  7. 7.

    Sadtler, P. T. et al. Neural constraints on learning. Nature 512, 423–426 (2014).

  8. 8.

    Golub, M. D. et al. Learning by neural reassociation. Nat. Neurosci. 21, 607–616 (2018).

  9. 9.

    Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005).

  10. 10.

    Crone, N. E. et al. Electrocorticographic gamma activity during word production in spoken and sign language. Neurology 57, 2045–2053 (2001).

  11. 11.

    Nourski, K. V. et al. Sound identification in human auditory cortex: differential contribution of local field potentials and high gamma power as revealed by direct intracranial recordings. Brain Lang. 148, 37–50 (2015).

  12. 12.

    Pesaran, B. et al. Investigating large-scale brain dynamics using field potential recordings: analysis and interpretation. Nat. Neurosci. 21, 903–919 (2018).

  13. 13.

    Bouchard, K. E., Mesgarani, N., Johnson, K. & Chang, E. F. Functional organization of human sensorimotor cortex for speech articulation. Nature 495, 327–332 (2013).

  14. 14.

    Mesgarani, N., Cheung, C., Johnson, K. & Chang, E. F. Phonetic feature encoding in human superior temporal gyrus. Science 343, 1006–1010 (2014).

  15. 15.

    Flinker, A. et al. Redefining the role of Broca’s area in speech. Proc. Natl Acad. Sci. USA 112, 2871–2875 (2015).

  16. 16.

    Chartier, J., Anumanchipalli, G. K., Johnson, K. & Chang, E. F. Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron 98, 1042–1054 (2018).

  17. 17.

    Mugler, E. M. et al. Differential representation of articulatory gestures and phonemes in precentral and inferior frontal gyri. J. Neurosci. 38, 9803–9813 (2018).

  18. 18.

    Huggins, J. E., Wren, P. A. & Gruis, K. L. What would brain–computer interface users want? Opinions and priorities of potential users with amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. 12, 318–324 (2011).

  19. 19.

    Luce, P. A. & Pisoni, D. B. Recognizing spoken words: the neighborhood activation model. Ear Hear. 19, 1–36 (1998).

  20. 20.

    Wrench, A. MOCHA: multichannel articulatory database. http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html (1999).

  21. 21.

    Kominek, J., Schultz, T. & Black, A. Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. In Proc. The first workshop on Spoken Language Technologies for Under-resourced languages (SLTU-2008) 63–68 (2008).

  22. 22.

    Davis, S. B. & Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In Readings in speech recognition. IEEE Trans. Acoust. 28, 357–366 (1980).

  23. 23.

    Gallego, J. A., Perich, M. G., Miller, L. E. & Solla, S. A. Neural manifolds for the control of movement. Neuron 94, 978–984 (2017).

  24. 24.

    Sokal, R. R. & Rohlf, F. J. The comparison of dendrograms by objective methods. Taxon 11, 33–40 (1962).

  25. 25.

    Brumberg, J. S. et al. Spatio-temporal progression of cortical activity related to continuous overt and covert speech production in a reading task. PLoS ONE 11, e0166872 (2016).

  26. 26.

    Mugler, E. M. et al. Direct classification of all American English phonemes using signals from functional speech motor cortex. J. Neural Eng. 11, 035015 (2014).

  27. 27.

    Herff, C. et al. Brain-to-text: decoding spoken phrases from phone representations in the brain. Front. Neurosci. 9, 217 (2015).

  28. 28.

    Moses, D. A., Mesgarani, N., Leonard, M. K. & Chang, E. F. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity. J. Neural Eng. 13, 056004 (2016).

  29. 29.

    Pasley, B. N. et al. Reconstructing speech from human auditory cortex. PLoS Biol. 10, e1001251 (2012).

  30. 30.

    Akbari, H., Khalighinejad, B., Herrero, J. L., Mehta, A. D. & Mesgarani, N. Towards reconstructing intelligible speech from the human auditory cortex. Sci. Rep. 9, 874 (2019).

  31. 31.

    Martin, S. et al. Decoding spectrotemporal features of overt and covert speech from the human cortex. Front. Neuroeng. 7, 14 (2014).

  32. 32.

    Dichter, B. K., Breshears, J. D., Leonard, M. K. & Chang, E. F. The control of vocal pitch in human laryngeal motor cortex. Cell 174, 21–31 (2018).

  33. 33.

    Wessberg, J. et al. Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature 408, 361–365 (2000).

  34. 34.

    Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R. & Donoghue, J. P. Instant neural control of a movement signal. Nature 416, 141–142 (2002).

  35. 35.

    Taylor, D. M., Tillery, S. I. & Schwartz, A. B. Direct cortical control of 3D neuroprosthetic devices. Science 296, 1829–1832 (2002).

  36. 36.

    Hochberg, L. R. et al. Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442, 164–171 (2006).

  37. 37.

    Collinger, J. L. et al. High-performance neuroprosthetic control by an individual with tetraplegia. Lancet 381, 557–564 (2013).

  38. 38.

    Aflalo, T. et al. Decoding motor imagery from the posterior parietal cortex of a tetraplegic human. Science 348, 906–910 (2015).

  39. 39.

    Ajiboye, A. B. et al. Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: a proof-of-concept demonstration. Lancet 389, 1821–1830 (2017).

  40. 40.

    Prahallad, K., Black, A. W. & Mosur, R. Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis. In Proc. 2006 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP, 2006).

  41. 41.

    Anumanchipalli, G. K., Prahallad, K. & Black, A. W. Festvox: tools for creation and analyses of large speech corpora. http://www.festvox.org (2011).

  42. 42.

    Hamilton, L. S., Chang, D. L., Lee, M. B. & Chang, E. F. Semi-automated anatomical labeling and inter-subject warping of high-density intracranial recording electrodes in electrocorticography. Front. Neuroinform. 11, 62 (2017).

  43. 43.

    Richmond, K., Hoole, P. & King, S. Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In Proc. Interspeech 2011 1505–1508 (2011).

  44. 44.

    Paul, B. D. & Baker, M. J. The design for the Wall Street Journal-based CSR corpus. In Proc. Workshop on Speech and Natural Language (Association for Computational Linguistics, 1992).

  45. 45.

    Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems. http://www.tensorflow.org (2015).

  46. 46.

    Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

  47. 47.

    Maia, R., Toda, T., Zen, H., Nankaku, Y. & Tokuda, K. An excitation model for HMM-based speech synthesis based on residual modeling. In Proc. 6th ISCA Speech synthesis Workshop (SSW6) 131–136 (2007).

  48. 48.

    Wolters, M. K., Isaac, K. B. & Renals, S. Evaluating speech synthesis intelligibility using Amazon Mechanical Turk. In Proc. 7th ISCA Speech Synthesis Workshop (SSW7) (2010).

  49. 49.

    Berndt, D. J. & Clifford, J. Using dynamic time warping to find patterns in time series. In Proc. 10th ACM Knowledge Discovery and Data Mining (KDD) Workshop 359–370 (1994).

Download references

Acknowledgements

We thank M. Leonard, N. Fox and D. Moses for comments on the manuscript and B. Speidel for his help reconstructing MRI images. This work was supported by grants from the NIH (DP2 OD008627 and U01 NS098971-01). E.F.C. is a New York Stem Cell Foundation-Robertson Investigator. This research was also supported by The William K. Bowes Foundation, the Howard Hughes Medical Institute, The New York Stem Cell Foundation and The Shurl and Kay Curci Foundation.

Reviewer information

Nature thanks David Poeppel and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author information

Author notes

  1. These authors contributed equally: Gopala K. Anumanchipalli, Josh Chartier

Affiliations

  1. Department of Neurological Surgery, University of California San Francisco, San Francisco, CA, USA

    • Gopala K. Anumanchipalli
    • , Josh Chartier
    •  & Edward F. Chang
  2. Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA

    • Gopala K. Anumanchipalli
    • , Josh Chartier
    •  & Edward F. Chang
  3. University of California Berkeley and University of California San Francisco Joint Program in Bioengineering, Berkeley, CA, USA

    • Josh Chartier
    •  & Edward F. Chang

Authors

  1. Search for Gopala K. Anumanchipalli in:

  2. Search for Josh Chartier in:

  3. Search for Edward F. Chang in:

Contributions

G.K.A., J.C. and E.F.C. conceived the study; G.K.A. inferred articulatory kinematics; G.K.A. and J.C. designed the decoder; J.C. performed decoder analyses; G.K.A., E.F.C. and J.C. collected data and prepared the manuscript; E.F.C. supervised the project.

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to Edward F. Chang.

Extended data figures and tables

  1. Extended Data Fig. 1 Median original and decoded spectrograms.

    a, b, Median spectrograms, time-locked to the acoustic onset of phonemes from original (a) and decoded (b) audio (/i/, n = 112; /z/, n = 115; /p/, n = 69, /ae/, n = 86). These phonemes represent the diversity of spectral features. Original and decoded median phoneme spectrograms were well-correlated (Pearson’s r > 0.9 for all phonemes, P = 1 × 10−18).

  2. Extended Data Fig. 2 Transcription WER for individual trials.

    a, b, WERs for individually transcribed trials for pools with a size of 25 (a) or 50 (b) words. Listeners transcribed synthesized sentences by selecting words from a defined pool of words. Word pools included correct words found in the synthesized sentence and random words from the test set. One trial is one transcription of one listener of one synthesized sentence.

  3. Extended Data Fig. 3 Electrode array locations for participants.

    MRI reconstructions of participants’ brains with overlay of electrocorticographic electrode (ECoG) array locations. P1–5, participants 1–5.

  4. Extended Data Fig. 4 Decoding performance of kinematic and spectral features.

    Data from participant 1. a, Correlations of all 33 decoded articulatory kinematic features with ground-truth (n = 101 sentences). EMA features represent x and y coordinate traces of articulators (lips, jaw and three points of the tongue) along the midsagittal plane of the vocal tract. Manner features represent complementary kinematic features to EMA that further describe acoustically consequential movements. b, Correlations of all 32 decoded spectral features with ground-truth (n = 101 sentences). MFCC features are 25 mel-frequency cepstral coefficients that describe power in perceptually relevant frequency bands. Synthesis features describe glottal excitation weights necessary for speech synthesis. Box plots as described in Fig. 2.

  5. Extended Data Fig. 5 Comparison of cumulative variance explained in kinematic and acoustic state–spaces.

    For each representation of speech—kinematics and acoustics—a principal components analysis was computed and the explained variance for each additional principal component was cumulatively summed. Kinematic and acoustic representations had 33 and 32 features, respectively.

  6. Extended Data Fig. 6 Decoded phoneme acoustic similarity matrix.

    Acoustic similarity matrix compares acoustic properties of decoded phonemes and originally spoken phonemes. Similarity is computed by first estimating a Gaussian kernel density for each phoneme (both decoded and original) and then computing the Kullback–Leibler (KL) divergence between a pair of decoded and original phoneme distributions. Each row compares the acoustic properties of a decoded phoneme with originally spoken phonemes (columns). Hierarchical clustering was performed on the resulting similarity matrix. Data from participant 1.

  7. Extended Data Fig. 7 Ground-truth acoustic similarity matrix.

    The acoustic properties of ground-truth spoken phonemes are compared with one another. Similarity is computed by first estimating a Gaussian kernel density for each phoneme and then computing the Kullback–Leibler divergence between a pair of a phoneme distributions. Each row compares the acoustic properties of two ground-truth spoken phonemes. Hierarchical clustering was performed on the resulting similarity matrix. Data from participant 1.

  8. Extended Data Fig. 8 Comparison between decoding novel and repeated sentences.

    a, b, Comparison metrics included spectral distortion (a) and the correlation between decoded and original spectral features (b). Decoder performance for these two types of sentences was compared and no significant difference was found (P = 0.36 (a) and P = 0.75 (b), n = 51 sentences, Wilcoxon signed-rank test). A novel sentence consists of words and/or a word sequence not present in the training data. A repeated sentence is a sentence that has at least one matching word sequence in the training data, although with a unique production. Comparison was performed on participant 1 and the evaluated sentences were the same across both cases with two decoders trained on differing datasets to either exclude or include unique repeats of sentences in the test set. ns, not significant; P > 0.05. Box plots as described in Fig. 2.

  9. Extended Data Fig. 9 Kinematic state–space trajectories for phoneme-specific vowel–consonant transitions.

    Average trajectories of principal components 1 (PC1) and 2 (PC2) for transitions from either a consonant or a vowel to specific phonemes. Trajectories are 500 ms and centred at transition between phonemes. a, Consonant to corner vowels (n = 1,387, 1,964, 2,259, 894, respectively, for aa, ae, iy and uw). PC1 shows separation of all corner vowels and PC2 delineates between front vowels (iy, ae) and back vowels (uw, aa). b, Vowel to unvoiced plosives (n = 2,071, 4,107 and 1,441, respectively, for k, p and t). PC1 was more selective for velar constriction (k) and PC2 for bilabial constriction (p). c, Vowel to alveolars (n = 3,919, 3,010 and 4,107, respectively, for n, s and t). PC1 shows separation by manner of articulation (nasal, plosive or fricative) whereas PC2 is less discriminative. d, PC1 and PC2 show little, if any, delineation between voiced and unvoiced alveolar fricatives (n = 3,010 and 1,855, respectively, for s and z).

Supplementary information

  1. Supplementary Information

    This file contains: a) Place-manner tuples used to augment EMA trajectories; b) Sentences used in listening tests Original Source: MOCHA-TIMIT20 dataset; c) Class sizes for the listening tests; d) Transcription interface for the intelligibility assessment; and e) Number of listeners used for intelligibility assessments.

  2. Reporting Summary

  3. Supplemental Video 1: Examples of decoded kinematics and synthesized speech production

    The video presents examples of synthesized audio from neural recordings of spoken sentences. In each example, electrode activity corresponding to a sentence is displayed (top). Next, simultaneous decoding of kinematics and acoustics are visually and audible presented. Decoded articulatory movements are displayed (middle left) as the synthesized speech spectrogram unfolds. Following the decoding, the original audio, as spoken by the patient during neural recording, is played. Lastly, the decoded movements and synthesized speech is once again presented. This format is repeated for a total of five examples (from participants P1 and P2). On the last example, kinematics and audio are also decoded and synthesized for silently mimed speech.

About this article

Publication history

Received

Accepted

Published

Issue Date

DOI

https://doi.org/10.1038/s41586-019-1119-1

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.