A convolutional neural-network model of human cochlear mechanics and filter tuning for real-time applications

Abstract

Auditory models are commonly used as feature extractors for automatic speech-recognition systems or as front-ends for robotics, machine-hearing and hearing-aid applications. Although auditory models can capture the biophysical and nonlinear properties of human hearing in great detail, these biophysical models are computationally expensive and cannot be used in real-time applications. We present a hybrid approach where convolutional neural networks are combined with computational neuroscience to yield a real-time end-to-end model for human cochlear mechanics, including level-dependent filter tuning (CoNNear). The CoNNear model was trained on acoustic speech material and its performance and applicability were evaluated using (unseen) sound stimuli commonly employed in cochlear mechanics research. The CoNNear model accurately simulates human cochlear frequency selectivity and its dependence on sound intensity, an essential quality for robust speech intelligibility at negative speech-to-background-noise ratios. The CoNNear architecture is based on parallel and differentiable computations and has the power to achieve real-time human performance. These unique CoNNear features will enable the next generation of human-like machine-hearing applications.

A preprint version of the article is available at ArXiv.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: CoNNear overview.
Fig. 2: CoNNear hyperparameter tuning.
Fig. 3: Comparing cochlear excitation patterns across model architectures.
Fig. 4: Effect of adding context to the CoNNear simulations.
Fig. 5: Generalizability of CoNNear to unseen input.
Fig. 6: Cochlear dispersion and DPOAEs.

Data availability

The source code of the TL-model v.1.1 used for training is available via https://doi.org/10.5281/zenodo.3717431or https://github.com/HearingTechnology/Verhulstetal2018Model, the TIMIT speech corpus used for training can be found online45. Most figures in this paper can be reproduced using the CoNNear model repository.

Code availability

The code for the trained CoNNear model, including instructions of how to execute it is available from https://github.com/HearingTechnology/CoNNear_cochlea or https://doi.org/10.5281/zenodo.4056552. A non-commercial, academic Ghent University licence applies.

References

  1. 1.

    von Békésy, G. Travelling waves as frequency analysers in the cochlea. Nature 225, 1207–1209 (1970).

    Article  Google Scholar 

  2. 2.

    Narayan, S. S., Temchin, A. N., Recio, A. & Ruggero, M. A. Frequency tuning of basilar membrane and auditory nerve fibers in the same cochleae. Science 282, 1882–1884 (1998).

    Article  Google Scholar 

  3. 3.

    Robles, L. & Ruggero, M. A. Mechanics of the mammalian cochlea. Phys. Rev. 81, 1305–1352 (2001).

    Google Scholar 

  4. 4.

    Shera, C. A., Guinan, J. J. & Oxenham, A. J. Revised estimates of human cochlear tuning from otoacoustic and behavioral measurements. Proc. Natl Acad. Sci. USA 99, 3318–3323 (2002).

    Article  Google Scholar 

  5. 5.

    Oxenham, A. J. & Shera, C. A. Estimates of human cochlear tuning at low levels using forward and simultaneous masking. J. Assoc. Res. Otolaryngol. 4, 541–554 (2003).

    Article  Google Scholar 

  6. 6.

    Greenwood, D. D. A cochlear frequency-position function for several species—29 years later. J. Acoust. Soc. Am. 87, 2592–2605 (1990).

    Article  Google Scholar 

  7. 7.

    Jepsen, M. L. & Dau, T. Characterizing auditory processing and perception in individual listeners with sensorineural hearing loss. J. Acoust. Soc. Am. 129, 262–281 (2011).

    Article  Google Scholar 

  8. 8.

    Bondy, J., Becker, S., Bruce, I., Trainor, L. & Haykin, S. A novel signal-processing strategy for hearing-aid design: neurocompensation. Sig. Process. 84, 1239–1253 (2004).

    MATH  Article  Google Scholar 

  9. 9.

    Ewert, S. D., Kortlang, S. & Hohmann, V. A model-based hearing aid: psychoacoustics, models and algorithms. Proc. Meet. Acoust. 19, 050187 (2013).

    Article  Google Scholar 

  10. 10.

    Mondol, S. & Lee, S. A machine learning approach to fitting prescription for hearing aids. Electronics 8, 736 (2019).

    Article  Google Scholar 

  11. 11.

    Lyon, R .F. Human and Machine Hearing: Extracting Meaning from Sound (Cambridge Univ. Press, 2017).

  12. 12.

    Baby, D. & Van hamme, H. Investigating modulation spectrogram features for deep neural network-based automatic speech recognition. In Proc. Insterspeech 2479–2483 (ISCA, 2015).

  13. 13.

    de Boer, E. Auditory physics. Physical principles in hearing theory. I. Phys. Rep. 62, 87–174 (1980).

    MathSciNet  Article  Google Scholar 

  14. 14.

    Diependaal, R. J., Duifhuis, H., Hoogstraten, H. W. & Viergever, M. A. Numerical methods for solving one-dimensional cochlear models in the time domain. J. Acoust. Soc. Am. 82, 1655–1666 (1987).

    Article  Google Scholar 

  15. 15.

    Zweig, G. Finding the impedance of the organ of corti. J. Acoust. Soc. Am. 89, 1229–1254 (1991).

    Article  Google Scholar 

  16. 16.

    Talmadge, C. L., Tubis, A., Wit, H. P. & Long, G. R. Are spontaneous otoacoustic emissions generated by self-sustained cochlear oscillators? J. Acoust. Soc. Am. 89, 2391–2399 (1991).

    Article  Google Scholar 

  17. 17.

    Moleti, A. et al. Transient evoked otoacoustic emission latency and estimates of cochlear tuning in preterm neonates. J. Acoust. Soc. Am. 124, 2984–2994 (2008).

    Article  Google Scholar 

  18. 18.

    Epp, B., Verhey, J. L. & Mauermann, M. Modeling cochlear dynamics: interrelation between cochlea mechanics and psychoacoustics. J. Acoust. Soc. Am. 128, 1870–1883 (2010).

    Article  Google Scholar 

  19. 19.

    Verhulst, S., Dau, T. & Shera, C. A. Nonlinear time-domain cochlear model for transient stimulation and human otoacoustic emission. J. Acoust. Soc. Am. 132, 3842–3848 (2012).

    Article  Google Scholar 

  20. 20.

    Zweig, G. Nonlinear cochlear mechanics. J. Acoust. Soc. Am. 139, 2561–2578 (2016).

    Article  Google Scholar 

  21. 21.

    Hohmann, V. in Handbook of Signal Processing in Acoustics (eds Havelock, D. et al.) 205–212 (Springer, 2008).

  22. 22.

    Rascon, C. & Meza, I. Localization of sound sources in robotics: a review. Robot. Auton. Syst. 96, 184–210 (2017).

    Article  Google Scholar 

  23. 23.

    Morgan, N., Bourlard, H. & Hermansky, H. in Speech Processing in the Auditory System (eds Greenberg, S. et al.) 309–338 (Springer, 2004).

  24. 24.

    Patterson, R. D., Allerhand, M. H. & Giguère, C. Time-domain modeling of peripheral auditory processing: a modular architecture and a software platform. J. Acoust. Soc. Am. 98, 1890–1894 (1995).

    Article  Google Scholar 

  25. 25.

    Shera, C. A. Frequency glides in click responses of the basilar membrane and auditory nerve: their scaling behavior and origin in traveling-wave dispersion. J. Acoust. Soc. Am. 109, 2023–2034 (2001).

    Article  Google Scholar 

  26. 26.

    Shera, C. A. & Guinan, J. J. in Active Processes and Otoacoustic Emissions in Hearing (eds Manley, A. et al.) 305–342 (Springer, 2008).

  27. 27.

    Hohmann, V. Frequency analysis and synthesis using a Gammatone filterbank. Acta Acust. United Acust. 88, 433–442 (2002).

    Google Scholar 

  28. 28.

    Saremi, A. et al. A comparative study of seven human cochlear filter models. J. Acoust. Soc. Am. 140, 1618–1634 (2016).

    Article  Google Scholar 

  29. 29.

    Lopez-Poveda, E. A. & Meddis, R. A human nonlinear cochlear filterbank. J. Acoust. Soc. Am. 110, 3107–3118 (2001).

    Article  Google Scholar 

  30. 30.

    Lyon, R. F. Cascades of two-pole-two-zero asymmetric resonators are good models of peripheral auditory function. J. Acoust. Soc. Am. 130, 3893–3904 (2011).

    Article  Google Scholar 

  31. 31.

    Saremi, A. & Lyon, R. F. Quadratic distortion in a nonlinear cascade model of the human cochlea. J. Acoust. Soc. Am. 143, EL418–EL424 (2018).

    Article  Google Scholar 

  32. 32.

    Altoè, A., Charaziak, K. K. & Shera, C. A. Dynamics of cochlear nonlinearity: automatic gain control or instantaneous damping? J. Acoust. Soc. Am. 142, 3510–3519 (2017).

    Article  Google Scholar 

  33. 33.

    Baby, D. & Verhulst, S. SERGAN: speech enhancement using relativistic generative adversarial networks with gradient penalty. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 106–110 (2019).

  34. 34.

    Pascual, S., Bonafonte, A. & Serrà, J. SEGAN: speech enhancement generative adversarial network. In Interspeech 2017 3642–3646 (ISCA, 2017).

  35. 35.

    Drakopoulos, F., Baby, D. & Verhulst, S. Real-time audio processing on a Raspberry Pi using deep neural networks. In 23rd International Congress on Acoustics (ICA) (2019).

  36. 36.

    Altoè, A., Pulkki, V. & Verhulst, S. Transmission line cochlear models: Improved accuracy and efficiency. J. Acoust. Soc. Am. 136, EL302–EL308 (2014).

    Article  Google Scholar 

  37. 37.

    Verhulst, S., Altoè, A. & Vasilkov, V. Computational modeling of the human auditory periphery: auditory-nerve responses, evoked potentials and hearing loss. Hear. Res. 360, 55–75 (2018).

    Article  Google Scholar 

  38. 38.

    Oxenham, A. J. & Wojtczak, M. in Oxford Handbook of Auditory Science: Hearing (ed. Plack, C. J.) Ch. 2 (Oxford Univ. Press, 2010); https://doi.org/10.1093/oxfordhb/9780199233557.013.0002

  39. 39.

    Robles, L., Ruggero, M. A. & Rich, N. C. Two-tone distortion in the basilar membrane of the cochlea. Nature 349, 413 (1991).

    Article  Google Scholar 

  40. 40.

    Ren, T. Longitudinal pattern of basilar membrane vibration in the sensitive cochlea. Proc. Natl Acad. Sci. 99, 17101–17106 (2002).

    Article  Google Scholar 

  41. 41.

    Precise and Full-Range Determination of Two-Dimensional Equal Loudness Contours (International Organization for Standardization, 2003).

  42. 42.

    Lorenzi, C., Gilbert, G., Carn, H., Garnier, S. & Moore, B. C. Speech perception problems of the hearing impaired reflect inability to use temporal fine structure. Proc. Natl Acad. Sci. USA 103, 18866–18869 (2006).

    Article  Google Scholar 

  43. 43.

    Isola, P., Zhu, J. Y., Zhou, T. & Efros, A. A. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 5967–5976 (2017).

  44. 44.

    LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  Google Scholar 

  45. 45.

    Garofolo, J.S. et al. DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM (Linguistic Data Consortium, 1993).

  46. 46.

    Shera, C. A., Guinan, J. J. & Oxenham, A. J. Otoacoustic estimation of cochlear tuning: validation in the chinchilla. J. Assoc. Res. Otolaryngol. 11, 343–365 (2010).

    Article  Google Scholar 

  47. 47.

    Russell, I., Cody, A. & Richardson, G. The responses of inner and outer hair cells in the basal turn of the guinea-pig cochlea and in the mouse cochlea grown in vitro. Hear. Res. 22, 199–216 (1986).

    Article  Google Scholar 

  48. 48.

    Houben, R. et al. Development of a Dutch matrix sentence test to assess speech intelligibility in noise. Int. J. Audiol. 53, 760–763 (2014).

    Article  Google Scholar 

  49. 49.

    Gemmeke, J. F. et al. Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 776–780 (IEEE, 2017).

  50. 50.

    Paul, D. B. & Baker, J. M. The design for the Wall Street Journal-based CSR corpus. In Second International Conference on Spoken Language Processing, ICSLP (ISCA, 1992).

  51. 51.

    Dorn, P. A. et al. Distortion product otoacoustic emission input/output functions in normal-hearing and hearing-impaired human ears. J. Acoust. Soc. Am. 110, 3119–3131 (2001).

    Article  Google Scholar 

  52. 52.

    Janssen, T. & Müller, J. in Active Processes and Otoacoustic Emissions in Hearing 421–460 (Springer, 2008).

  53. 53.

    Verhulst, S., Ernst, F., Garrett, M. & Vasilkov, V. Suprathreshold psychoacoustics and envelope-following response relations: Normal-hearing, synaptopathy and cochlear gain loss. Acta Acus. United Acus. 104, 800–803 (2018).

    Article  Google Scholar 

  54. 54.

    Verhulst, S., Bharadwaj, H. M., Mehraei, G., Shera, C. A. & Shinn-Cunningham, B. G. Functional modeling of the human auditory brainstem response to broadband stimulation. J. Acoust. Soc. Am. 138, 1637–1659 (2015).

    Article  Google Scholar 

  55. 55.

    Kell, A. J., Yamins, D. L., Shook, E. N., Norman-Haignere, S. V. & McDermott, J. H. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron 98, 630–644 (2018).

    Article  Google Scholar 

  56. 56.

    Akbari, H., Khalighinejad, B., Herrero, J. L., Mehta, A. D. & Mesgarani, N. Towards reconstructing intelligible speech from the human auditory cortex. Sci. Rep. 9, 1–12 (2019).

    Article  Google Scholar 

  57. 57.

    Kell, A. J. & McDermott, J. H. Deep neural network models of sensory systems: windows onto the role of task constraints. Curr. Opin. Neurobiolog. 55, 121–132 (2019).

    Article  Google Scholar 

  58. 58.

    Amsalem, O. et al. An efficient analytical reduction of detailed nonlinear neuron models. Nat. Comm. 11, 1–13 (2020).

    Article  Google Scholar 

  59. 59.

    Richards, B. A. et al. A deep learning framework for neuroscience. Nat. Neurosci. 22, 1761–1770 (2019).

    Article  Google Scholar 

  60. 60.

    Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. International Conference on Learning Representations (ICLR, 2015).

  61. 61.

    Chollet, F. et al. Keras v.2.3.1 (2015); https://keras.io

  62. 62.

    Abadi, M. et al. TensorFlow v.1.13.2 (2015); https://www.tensorflow.org/

  63. 63.

    Moore, B. C. & Glasberg, B. R. Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. J. Acoust. Soc. Am. 74, 750–753 (1983).

    Article  Google Scholar 

  64. 64.

    Glasberg, B. R. & Moore, B. C. Derivation of auditory filter shapes from notched-noise data. Hear. Res. 47, 103–138 (1990).

    Article  Google Scholar 

  65. 65.

    Raufer, S. & Verhulst, S. Otoacoustic emission estimates of human basilar membrane impulse response duration and cochlear filter tuning. Hear. Res. 342, 150–160 (2016).

    Article  Google Scholar 

  66. 66.

    Ramamoorthy, S., Zha, D. J. & Nuttall, A. L. The biophysical origin of traveling-wave dispersion in the cochlea. Biophys. J. 99, 1687–1695 (2010).

    Article  Google Scholar 

  67. 67.

    Dau, T., Wegner, O., Mellert, V. & Kollmeier, B. Auditory brainstem responses with optimized chirp signals compensating basilar-membrane dispersion. J. Acoust. Soc. Am. 107, 1530–1540 (2000).

    Article  Google Scholar 

  68. 68.

    Neely, S. T., Johnson, T. A., Kopun, J., Dierking, D. M. & Gorga, M. P. Distortion-product otoacoustic emission input/output characteristics in normal-hearing and hearing-impaired human ears. J. Acoust. Soc. Am. 126, 728–738 (2009).

    Article  Google Scholar 

  69. 69.

    Kummer, P., Janssen, T., Hulin, P. & Arnold, W. Optimal L1L2 primary tone level separation remains independent of test frequency in humans. Hear. Res. 146, 47–56 (2000).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the European Research Council (ERC) under the Horizon 2020 Research and Innovation Programme (grant agreement no. 678120 RobSpear). We thank C. Shera and S. Shera for their help with the final edits.

Author information

Affiliations

Authors

Contributions

D.B. and S.V. conceptualized the study, developed the methodology and wrote the original draft of the manuscript with visualisation support from A.V.D.B.; D.B. and A.V.D.B. developed the software and performed the investigation and validation. D.B. and A.V.D.B completed the formal analysis and data curation. S.V. reviewed and edited the manuscript, supervised and administered the project and acquired the funding.

Corresponding authors

Correspondence to Deepak Baby or Sarah Verhulst.

Ethics declarations

Competing interests

A patent application (PCTEP2020065893) was filed by Ghent University on the basis of the research presented in this manuscript. Inventors on the application are S.V., D.B., F. Drakopoulos and A.V.D.B.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Overview of the CoNNear architecture parameters.

The table gives a brief description of the fixed CoNNear parameters and hyperparameters.

Extended Data Fig. 2 CNN layer depth comparison.

The first column details the CoNNear architecture. The next columns describe the total number of required model parameters, the required training time per epoch of 2310 TIMIT training sentences and average L1 loss across all windows of the TIMIT training set. Average L1 losses were also computed for BM displacement predictions to a number of unseen acoustic stimuli (click and 1-kHz pure tones) with levels between 0 and 90 dB SPL. Lastly, average L1 loss was also computed for the 550 sentences of the TIMIT test set. For each evaluated category, the best performing architecture is highlighted in bold font.

Extended Data Fig. 3 Activation function comparison.

The first column details the CoNNear architecture. The next columns describe the total number of required model parameters, the required training time per epoch of 2310 TIMIT training sentences and average L1 loss across all windows of the TIMIT training set. Average L1 losses were also computed for BM displacement predictions to a number of unseen acoustic stimuli (click and 1-kHz pure tones) with levels between 0 and 90 dB SPL. Lastly, average L1 loss was also computed for the 550 sentences of the TIMIT test set. For each evaluated category, the best performing architecture is highlighted in bold font.

Extended Data Fig. 4 Simulated BM displacements for a 10048-sample speech stimulus and a 16384-sample music stimulus.

The stimulus waveform is depicted in panel (a) and panels (b-c) depict instantaneous BM displacement intensities (darker colours = higher intensities) of the simulated TL-model (b) and CoNNear (c) outputs. The NCF=201 considered output channels are labelled per channel number: channel 1 corresponds to a CF of 12 kHz and channel 201 to a CF of 100 Hz. The same colour scale was used for both simulations and ranged between -0.5 μm (blue) and 0.5 μm (red). The left panels show simulations to a speech stimulus from the Dutch matrix test 48 and the right panels shows simulations to a music fragment (Radiohead - No Surprises).

Extended Data Fig. 5 Comparing TL and CoNNear model predictions at the median and maximum L1 prediction error.

The figure compares BM displacement intensities of the BM (b) and CoNNear (c) model to audio fragments which resulted in the median and maximum L1 errors of 0.008 and 0.038 simulated for the TIMIT test set (Fig. 5). The NCF=201 considered output channels are labelled per channel number: channel 1 corresponds to a CF of 12 kHz and channel 201 to a CF of 100 Hz. The same colour scale was used for both simulations and ranged between -0.5 μm (blue) and 0.5 μm (red).

Extended Data Fig. 6 Root mean-square error (RMSE) between simulated excitation patterns of the TL and CoNNear models reported as fraction of the TL excitation pattern maximum (cf. Fig. 3).

Using the PReLU activation function (a) leads to an overall high RMSE as this architecture failed to learn the level-dependent cochlear compression characteristics and filter shapes. The models using the tanh nonlinearity (b,c) did learn to capture the level-dependent properties of cochlear excitation patterns, and performed with errors below 5% for the frequency ranges and stimulus levels captured by the speech training data (for CFs below 5 kHz, and stimulation levels below 90 dB SPL) The RMSE increased above 5% for all architectures when evaluating its performance on 8- and 10-kHz excitation patterns. This decreased performance results from the limited frequency content of the TIMIT training material.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Baby, D., Van Den Broucke, A. & Verhulst, S. A convolutional neural-network model of human cochlear mechanics and filter tuning for real-time applications. Nat Mach Intell 3, 134–143 (2021). https://doi.org/10.1038/s42256-020-00286-8

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing