Speech synthesis from neural decoding of spoken sentences

Anumanchipalli, Gopala K.; Chartier, Josh; Chang, Edward F.

doi:10.1038/s41586-019-1119-1

Article
Published: 24 April 2019

Speech synthesis from neural decoding of spoken sentences

Gopala K. Anumanchipalli^1,2^na1,
Josh Chartier^1,2,3^na1 &
Edward F. Chang^1,2,3

Nature volume 568, pages 493–498 (2019)Cite this article

79k Accesses
447 Citations
3068 Altmetric
Metrics details

Subjects

Abstract

Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators. Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferrable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences. These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Speech synthesis from neurally decoded spoken sentences.**

**Fig. 2: Synthesized speech intelligibility and feature-specific performance.**

**Fig. 3: Speech synthesis from neural decoding of silently mimed speech.**

**Fig. 4: Kinematic state–space representation of speech production.**

A neural speech decoding framework leveraging deep learning and speech synthesis

Article Open access 08 April 2024

Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity

Article Open access 23 September 2021

A high-performance speech neuroprosthesis

Article Open access 23 August 2023

Data availability

The data that support the findings of this study are available from the corresponding author upon request.

Code availability

All code may be freely obtained for non-commercial use by contacting the corresponding author.

References

Fager, S. K., Fried-Oken, M., Jakobs, T. & Beukelman, D. R. New and emerging access technologies for adults with complex communication needs and severe motor impairments: state of the science. Augment. Altern. Commun. https://doi.org/10.1080/07434618.2018.1556730 (2019).
Article Google Scholar
Brumberg, J. S., Pitt, K. M., Mantie-Kozlowski, A. & Burnison, J. D. Brain–computer interfaces for augmentative and alternative communication: a tutorial. Am. J. Speech Lang. Pathol. 27, 1–12 (2018).
Article Google Scholar
Pandarinath, C. et al. High performance communication by people with paralysis using an intracortical brain–computer interface. eLife 6, e18554 (2017).
Article Google Scholar
Guenther, F. H. et al. A wireless brain–machine interface for real-time speech synthesis. PLoS ONE 4, e8218 (2009).
Article ADS Google Scholar
Bocquelet, F., Hueber, T., Girin, L., Savariaux, C. & Yvert, B. Real-time control of an articulatory-based speech synthesizer for brain computer interfaces. PLOS Comput. Biol. 12, e1005119 (2016).
Article ADS Google Scholar
Browman, C. P. & Goldstein, L. Articulatory phonology: an overview. Phonetica 49, 155–180 (1992).
Article CAS Google Scholar
Sadtler, P. T. et al. Neural constraints on learning. Nature 512, 423–426 (2014).
Article ADS CAS Google Scholar
Golub, M. D. et al. Learning by neural reassociation. Nat. Neurosci. 21, 607–616 (2018).
Article CAS Google Scholar
Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005).
Article Google Scholar
Crone, N. E. et al. Electrocorticographic gamma activity during word production in spoken and sign language. Neurology 57, 2045–2053 (2001).
Article CAS Google Scholar
Nourski, K. V. et al. Sound identification in human auditory cortex: differential contribution of local field potentials and high gamma power as revealed by direct intracranial recordings. Brain Lang. 148, 37–50 (2015).
Article Google Scholar
Pesaran, B. et al. Investigating large-scale brain dynamics using field potential recordings: analysis and interpretation. Nat. Neurosci. 21, 903–919 (2018).
Article CAS Google Scholar
Bouchard, K. E., Mesgarani, N., Johnson, K. & Chang, E. F. Functional organization of human sensorimotor cortex for speech articulation. Nature 495, 327–332 (2013).
Article ADS CAS Google Scholar
Mesgarani, N., Cheung, C., Johnson, K. & Chang, E. F. Phonetic feature encoding in human superior temporal gyrus. Science 343, 1006–1010 (2014).
Article ADS CAS Google Scholar
Flinker, A. et al. Redefining the role of Broca’s area in speech. Proc. Natl Acad. Sci. USA 112, 2871–2875 (2015).
Article ADS CAS Google Scholar
Chartier, J., Anumanchipalli, G. K., Johnson, K. & Chang, E. F. Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron 98, 1042–1054 (2018).
Article CAS Google Scholar
Mugler, E. M. et al. Differential representation of articulatory gestures and phonemes in precentral and inferior frontal gyri. J. Neurosci. 38, 9803–9813 (2018).
Article CAS Google Scholar
Huggins, J. E., Wren, P. A. & Gruis, K. L. What would brain–computer interface users want? Opinions and priorities of potential users with amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. 12, 318–324 (2011).
Article Google Scholar
Luce, P. A. & Pisoni, D. B. Recognizing spoken words: the neighborhood activation model. Ear Hear. 19, 1–36 (1998).
Article CAS Google Scholar
Wrench, A. MOCHA: multichannel articulatory database. http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html (1999).
Kominek, J., Schultz, T. & Black, A. Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. In Proc. The first workshop on Spoken Language Technologies for Under-resourced languages (SLTU-2008) 63–68 (2008).
Davis, S. B. & Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In Readings in speech recognition. IEEE Trans. Acoust. 28, 357–366 (1980).
Article Google Scholar
Gallego, J. A., Perich, M. G., Miller, L. E. & Solla, S. A. Neural manifolds for the control of movement. Neuron 94, 978–984 (2017).
Article CAS Google Scholar
Sokal, R. R. & Rohlf, F. J. The comparison of dendrograms by objective methods. Taxon 11, 33–40 (1962).
Article Google Scholar
Brumberg, J. S. et al. Spatio-temporal progression of cortical activity related to continuous overt and covert speech production in a reading task. PLoS ONE 11, e0166872 (2016).
Article Google Scholar
Mugler, E. M. et al. Direct classification of all American English phonemes using signals from functional speech motor cortex. J. Neural Eng. 11, 035015 (2014).
Article ADS Google Scholar
Herff, C. et al. Brain-to-text: decoding spoken phrases from phone representations in the brain. Front. Neurosci. 9, 217 (2015).
Article Google Scholar
Moses, D. A., Mesgarani, N., Leonard, M. K. & Chang, E. F. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity. J. Neural Eng. 13, 056004 (2016).
Article ADS Google Scholar
Pasley, B. N. et al. Reconstructing speech from human auditory cortex. PLoS Biol. 10, e1001251 (2012).
Article CAS Google Scholar
Akbari, H., Khalighinejad, B., Herrero, J. L., Mehta, A. D. & Mesgarani, N. Towards reconstructing intelligible speech from the human auditory cortex. Sci. Rep. 9, 874 (2019).
Article ADS Google Scholar
Martin, S. et al. Decoding spectrotemporal features of overt and covert speech from the human cortex. Front. Neuroeng. 7, 14 (2014).
Article Google Scholar
Dichter, B. K., Breshears, J. D., Leonard, M. K. & Chang, E. F. The control of vocal pitch in human laryngeal motor cortex. Cell 174, 21–31 (2018).
Article CAS Google Scholar
Wessberg, J. et al. Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature 408, 361–365 (2000).
Article ADS CAS Google Scholar
Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R. & Donoghue, J. P. Instant neural control of a movement signal. Nature 416, 141–142 (2002).
Article ADS CAS Google Scholar
Taylor, D. M., Tillery, S. I. & Schwartz, A. B. Direct cortical control of 3D neuroprosthetic devices. Science 296, 1829–1832 (2002).
Article ADS CAS Google Scholar
Hochberg, L. R. et al. Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442, 164–171 (2006).
Article ADS CAS Google Scholar
Collinger, J. L. et al. High-performance neuroprosthetic control by an individual with tetraplegia. Lancet 381, 557–564 (2013).
Article Google Scholar
Aflalo, T. et al. Decoding motor imagery from the posterior parietal cortex of a tetraplegic human. Science 348, 906–910 (2015).
Article ADS CAS Google Scholar
Ajiboye, A. B. et al. Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: a proof-of-concept demonstration. Lancet 389, 1821–1830 (2017).
Article Google Scholar
Prahallad, K., Black, A. W. & Mosur, R. Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis. In Proc. 2006 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP, 2006).
Anumanchipalli, G. K., Prahallad, K. & Black, A. W. Festvox: tools for creation and analyses of large speech corpora. http://www.festvox.org (2011).
Hamilton, L. S., Chang, D. L., Lee, M. B. & Chang, E. F. Semi-automated anatomical labeling and inter-subject warping of high-density intracranial recording electrodes in electrocorticography. Front. Neuroinform. 11, 62 (2017).
Article ADS Google Scholar
Richmond, K., Hoole, P. & King, S. Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In Proc. Interspeech 2011 1505–1508 (2011).
Paul, B. D. & Baker, M. J. The design for the Wall Street Journal-based CSR corpus. In Proc. Workshop on Speech and Natural Language (Association for Computational Linguistics, 1992).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems. http://www.tensorflow.org (2015).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article CAS Google Scholar
Maia, R., Toda, T., Zen, H., Nankaku, Y. & Tokuda, K. An excitation model for HMM-based speech synthesis based on residual modeling. In Proc. 6th ISCA Speech synthesis Workshop (SSW6) 131–136 (2007).
Wolters, M. K., Isaac, K. B. & Renals, S. Evaluating speech synthesis intelligibility using Amazon Mechanical Turk. In Proc. 7th ISCA Speech Synthesis Workshop (SSW7) (2010).
Berndt, D. J. & Clifford, J. Using dynamic time warping to find patterns in time series. In Proc. 10th ACM Knowledge Discovery and Data Mining (KDD) Workshop 359–370 (1994).

Download references

Acknowledgements

We thank M. Leonard, N. Fox and D. Moses for comments on the manuscript and B. Speidel for his help reconstructing MRI images. This work was supported by grants from the NIH (DP2 OD008627 and U01 NS098971-01). E.F.C. is a New York Stem Cell Foundation-Robertson Investigator. This research was also supported by The William K. Bowes Foundation, the Howard Hughes Medical Institute, The New York Stem Cell Foundation and The Shurl and Kay Curci Foundation.

Reviewer information

Nature thanks David Poeppel and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author information

These authors contributed equally: Gopala K. Anumanchipalli, Josh Chartier

Authors and Affiliations

Department of Neurological Surgery, University of California San Francisco, San Francisco, CA, USA
Gopala K. Anumanchipalli, Josh Chartier & Edward F. Chang
Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
Gopala K. Anumanchipalli, Josh Chartier & Edward F. Chang
University of California Berkeley and University of California San Francisco Joint Program in Bioengineering, Berkeley, CA, USA
Josh Chartier & Edward F. Chang

Authors

Gopala K. Anumanchipalli
View author publications
You can also search for this author in PubMed Google Scholar
Josh Chartier
View author publications
You can also search for this author in PubMed Google Scholar
Edward F. Chang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.K.A., J.C. and E.F.C. conceived the study; G.K.A. inferred articulatory kinematics; G.K.A. and J.C. designed the decoder; J.C. performed decoder analyses; G.K.A., E.F.C. and J.C. collected data and prepared the manuscript; E.F.C. supervised the project.

Corresponding author

Correspondence to Edward F. Chang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Median original and decoded spectrograms.

a, b, Median spectrograms, time-locked to the acoustic onset of phonemes from original (a) and decoded (b) audio (/i/, n = 112; /z/, n = 115; /p/, n = 69, /ae/, n = 86). These phonemes represent the diversity of spectral features. Original and decoded median phoneme spectrograms were well-correlated (Pearson’s r > 0.9 for all phonemes, P = 1 × 10⁻¹⁸).

Extended Data Fig. 2 Transcription WER for individual trials.

a, b, WERs for individually transcribed trials for pools with a size of 25 (a) or 50 (b) words. Listeners transcribed synthesized sentences by selecting words from a defined pool of words. Word pools included correct words found in the synthesized sentence and random words from the test set. One trial is one transcription of one listener of one synthesized sentence.

Extended Data Fig. 3 Electrode array locations for participants.

MRI reconstructions of participants’ brains with overlay of electrocorticographic electrode (ECoG) array locations. P1–5, participants 1–5.

Extended Data Fig. 4 Decoding performance of kinematic and spectral features.

Data from participant 1. a, Correlations of all 33 decoded articulatory kinematic features with ground-truth (n = 101 sentences). EMA features represent x and y coordinate traces of articulators (lips, jaw and three points of the tongue) along the midsagittal plane of the vocal tract. Manner features represent complementary kinematic features to EMA that further describe acoustically consequential movements. b, Correlations of all 32 decoded spectral features with ground-truth (n = 101 sentences). MFCC features are 25 mel-frequency cepstral coefficients that describe power in perceptually relevant frequency bands. Synthesis features describe glottal excitation weights necessary for speech synthesis. Box plots as described in Fig. 2.

Extended Data Fig. 5 Comparison of cumulative variance explained in kinematic and acoustic state–spaces.

For each representation of speech—kinematics and acoustics—a principal components analysis was computed and the explained variance for each additional principal component was cumulatively summed. Kinematic and acoustic representations had 33 and 32 features, respectively.

Extended Data Fig. 6 Decoded phoneme acoustic similarity matrix.

Acoustic similarity matrix compares acoustic properties of decoded phonemes and originally spoken phonemes. Similarity is computed by first estimating a Gaussian kernel density for each phoneme (both decoded and original) and then computing the Kullback–Leibler (KL) divergence between a pair of decoded and original phoneme distributions. Each row compares the acoustic properties of a decoded phoneme with originally spoken phonemes (columns). Hierarchical clustering was performed on the resulting similarity matrix. Data from participant 1.

Extended Data Fig. 7 Ground-truth acoustic similarity matrix.

The acoustic properties of ground-truth spoken phonemes are compared with one another. Similarity is computed by first estimating a Gaussian kernel density for each phoneme and then computing the Kullback–Leibler divergence between a pair of a phoneme distributions. Each row compares the acoustic properties of two ground-truth spoken phonemes. Hierarchical clustering was performed on the resulting similarity matrix. Data from participant 1.

Extended Data Fig. 8 Comparison between decoding novel and repeated sentences.

a, b, Comparison metrics included spectral distortion (a) and the correlation between decoded and original spectral features (b). Decoder performance for these two types of sentences was compared and no significant difference was found (P = 0.36 (a) and P = 0.75 (b), n = 51 sentences, Wilcoxon signed-rank test). A novel sentence consists of words and/or a word sequence not present in the training data. A repeated sentence is a sentence that has at least one matching word sequence in the training data, although with a unique production. Comparison was performed on participant 1 and the evaluated sentences were the same across both cases with two decoders trained on differing datasets to either exclude or include unique repeats of sentences in the test set. ns, not significant; P > 0.05. Box plots as described in Fig. 2.

Extended Data Fig. 9 Kinematic state–space trajectories for phoneme-specific vowel–consonant transitions.

Average trajectories of principal components 1 (PC1) and 2 (PC2) for transitions from either a consonant or a vowel to specific phonemes. Trajectories are 500 ms and centred at transition between phonemes. a, Consonant to corner vowels (n = 1,387, 1,964, 2,259, 894, respectively, for aa, ae, iy and uw). PC1 shows separation of all corner vowels and PC2 delineates between front vowels (iy, ae) and back vowels (uw, aa). b, Vowel to unvoiced plosives (n = 2,071, 4,107 and 1,441, respectively, for k, p and t). PC1 was more selective for velar constriction (k) and PC2 for bilabial constriction (p). c, Vowel to alveolars (n = 3,919, 3,010 and 4,107, respectively, for n, s and t). PC1 shows separation by manner of articulation (nasal, plosive or fricative) whereas PC2 is less discriminative. d, PC1 and PC2 show little, if any, delineation between voiced and unvoiced alveolar fricatives (n = 3,010 and 1,855, respectively, for s and z).

Supplementary information

Supplementary Information

This file contains: a) Place-manner tuples used to augment EMA trajectories; b) Sentences used in listening tests Original Source: MOCHA-TIMIT20 dataset; c) Class sizes for the listening tests; d) Transcription interface for the intelligibility assessment; and e) Number of listeners used for intelligibility assessments.

Reporting Summary

Supplemental Video 1: Examples of decoded kinematics and synthesized speech production

The video presents examples of synthesized audio from neural recordings of spoken sentences. In each example, electrode activity corresponding to a sentence is displayed (top). Next, simultaneous decoding of kinematics and acoustics are visually and audible presented. Decoded articulatory movements are displayed (middle left) as the synthesized speech spectrogram unfolds. Following the decoding, the original audio, as spoken by the patient during neural recording, is played. Lastly, the decoded movements and synthesized speech is once again presented. This format is repeated for a total of five examples (from participants P1 and P2). On the last example, kinematics and audio are also decoded and synthesized for silently mimed speech.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anumanchipalli, G.K., Chartier, J. & Chang, E.F. Speech synthesis from neural decoding of spoken sentences. Nature 568, 493–498 (2019). https://doi.org/10.1038/s41586-019-1119-1

Download citation

Received: 29 October 2018
Accepted: 21 March 2019
Published: 24 April 2019
Issue Date: 25 April 2019
DOI: https://doi.org/10.1038/s41586-019-1119-1

This article is cited by

Brain control of bimanual movement enabled by recurrent neural networks
- Darrel R. Deo
- Francis R. Willett
- Krishna V. Shenoy
Scientific Reports (2024)
Online speech synthesis using a chronically implanted brain–computer interface in an individual with ALS
- Miguel Angrick
- Shiyu Luo
- Nathan E. Crone
Scientific Reports (2024)
A neural speech decoding framework leveraging deep learning and speech synthesis
- Xupeng Chen
- Ran Wang
- Adeen Flinker
Nature Machine Intelligence (2024)
Single-neuronal elements of speech production in humans
- Arjun R. Khanna
- William Muñoz
- Ziv M. Williams
Nature (2024)
Artificial intelligence in neurology: opportunities, challenges, and policy implications
- Sebastian Voigtlaender
- Johannes Pawelczyk
- Sebastian F. Winter
Journal of Neurology (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.