Hearing as adaptive cascaded envelope interpolation

Thoret, Etienne; Ystad, Sølvi; Kronland-Martinet, Richard

doi:10.1038/s42003-023-05040-5

Download PDF

Article
Open access
Published: 24 June 2023

Hearing as adaptive cascaded envelope interpolation

Communications Biology volume 6, Article number: 671 (2023) Cite this article

1401 Accesses
11 Altmetric
Metrics details

Subjects

Abstract

The human auditory system is designed to capture and encode sounds from our surroundings and conspecifics. However, the precise mechanisms by which it adaptively extracts the most important spectro-temporal information from sounds are still not fully understood. Previous auditory models have explained sound encoding at the cochlear level using static filter banks, but this vision is incompatible with the nonlinear and adaptive properties of the auditory system. Here we propose an approach that considers the cochlear processes as envelope interpolations inspired by cochlear physiology. It unifies linear and nonlinear adaptive behaviors into a single comprehensive framework that provides a data-driven understanding of auditory coding. It allows simulating a broad range of psychophysical phenomena from virtual pitches and combination tones to consonance and dissonance of harmonic sounds. It further predicts the properties of the cochlear filters such as frequency selectivity. Here we propose a possible link between the parameters of the model and the density of hair cells on the basilar membrane. Cascaded Envelope Interpolation may lead to improvements in sound processing for hearing aids by providing a non-linear, data-driven, way to preprocessing of acoustic signals consistent with peripheral processes.

Compression and amplification algorithms in hearing aids impair the selectivity of neural responses to speech

Article 03 May 2021

Binaural summation of amplitude modulation involves weak interaural suppression

Article Open access 26 February 2020

Adaptation of the human auditory cortex to changing background noise

Article Open access 07 June 2019

Introduction

What and how do we hear? Sound waves are transformed into electrical signals through the interactions between the basilar membrane and the inner and outer hair cells, a fragile process that occurs at the first stages of the auditory system. Modeling these hearing processes through sound signal-processing models is still a timely question, in particular for curing hearing deafness through cochlear implant technologies. Cochlear implants indeed still fail to accurately recreate sounds with high fidelity due to limitations in replicating the mechanical-to-electrical transduction that occurs at the cochlear level.

Two opposite theories have influenced the development of signal-processing models of cochlear processes. The theories in question had their origin as models for pitch perception and have a well-established history within the discipline. On the one hand, Seebeck¹ put forth a temporal coding approach and showed that the pitch of a complex tone is dependent on repeated temporal signal patterns. On the other hand, Helmholtz² viewed the ear as a frequency analyzer implemented through the mechanical properties of the basilar membrane that processes the spectral components of sounds, known as the place coding theory. This theory, which was later supported by von Békesy’s work on the physiology of hearing³, has remained dominant and continues to influence the design of signal representations used in neuroprosthetic technologies such as cochlear implants.

According to this theory, the inner ear maps sound frequencies to specific places on the basilar membrane, leading to models based on linear spectro-temporal filter banks and Fourier analysis^4,5_. Despite its widespread agreement, this model-based view is still challenged by unanswered questions, conflicting views, and incompatible observations which still lead to vivid debates⁶. For instance, the sensitivity of the ear to sound signals has traditionally been measured using pure tones, resulting in equal-loudness curves⁷ with a maximum sensitivity of around 4000 Hz. However, when the sound level of white noise, covering the whole frequency range, is reduced, neither its timbre nor its pitch changes⁸, indicating the limitations of considering the ear as a -static- linear Fourier analyzer. These examples demonstrate the ongoing challenges and limitations in fully understanding and modeling cochlear processes.

Another phenomenon that is incompatible with the consideration of the cochlea as linear resonators are the perception of phantom sounds^9,10, such as the missing fundamental^11,12 and combination tones^13,14. Although these phenomena have been thoroughly studied², their computational underpinnings remain unclear. Models based on linear Fourier analysis have incorporated posterior rectifications^15,16,17,18, but do not naturally account for these nonlinear phenomena. Whether combination tones and the missing fundamentals are caused by the same underlying phenomenon is still a source of debate¹⁹, and no consensus has been reached on the origins of these nonlinearities in the auditory system. When no plausible interpretations are found at the peripheral processing level, such phenomena are often attributed to higher cortical processing and modeled using deep-neural networks²⁰. There is currently no agreed-upon framework for explaining the generation of these nonlinearities in the auditory system.

What are we missing in the ear’s sound processing? Model-based frequency decomposition, inspired by the dominant place coding theory, is not necessary to accurately simulate complex auditory tasks²¹. The auditory neural code extracted at the cochlear level is adaptively optimized to fit the acoustical structure of natural sounds, such as speech^22,23,24. Temporal coding, based on timing cues also appears to play a crucial role after cochlear processes. Studies have demonstrated that amplitude modulations, driven by temporal coding, provide important acoustic cues both in the cochlea and auditory nerve^{25,26,27,28,29}. These findings suggest that both temporal and place coding are present at the cochlear level in our hearing system, and are effective for different frequency domains. However, there is still a lack of a unified model that can reconcile these seemingly opposing behaviors. Some models have attempted to compensate for this discrepancy by adding nonlinearities to their spectral models³⁰, but these approaches do not provide an intrinsically unified explanation of the auditory processes.

In this paper, we tackle the question of modeling how the auditory system processes sound at the cochlear level. Rather than focusing on the resonances of the basilar membrane that inspired Fourier-based decompositions, we examine the sampling occurring at the level of the stereocilia, positioned at the top of the inner hair cell bundles. These stereocilia move in response to the endolymph motions, the fluid that fills the cochlea and which conveys vibrations, herewith performing interpolations of the incoming sound signal. To model the underlying cochlear processes, we propose to place envelope interpolation at the center of the sound coding. This computational model is inspired by the empirical model decomposition (EMD)^31,32 and interpolates a given signal based on the upper and lower signal envelopes. Upper and lower envelopes are envelopes of the signal obtained by interpolating respectively the maxima and minima of a signal. EMD offers a promising solution for methods that take into account hearing specificities in the case of noise³³, frequency selectivity^10,31, and source separation³⁴. Our framework, therefore, provides an implementation that is compatible with a range of auditory adaptive and data-driven temporal coding phenomena. It can also account for nonlinear hearing behaviors. Furthermore, this model accounts for phenomena such as pure tone masking in noise³⁵, which is a canonical example of the cocktail party effect. It also provides a plausible explanation for adaptive coding²³ that occurs at the cochlear level. This framework is coherent with traditional cochlear filter properties and behaves as a constant-Q transform that takes into account the frequency selectivity of the ear. This also drives the perception of roughness, consonance, and dissonance of harmonic sounds¹².

This study embraces the challenge of modeling the human hearing sensory system. One issue of computational modeling is to accurately define the level of biological plausibility a model can afford and what it exactly accounts for. The relationship between biological observations and modeling is dynamic, with biological knowledge being influenced by the way we observe a phenomenon and modelers using functional interpretations of biological observations to inform their models. David Marr³⁶ proposed to categorize biological models into three levels—computational, algorithmic, and physical—to differentiate the relationship between the model from the underlying biological mechanism and from its computational function. Here, in the case of hearing, signal processing representations are used and can be placed at an intermediate level between the algorithmic and physical levels of Marr’s framework. These signal representations provide a formal description of the encoding of sounds in the auditory pathway from the cochlea to the primary auditory cortex³⁷ and more recently up to cortical areas based on deep-neural network activations^{38,39,40,41,42,43}. They are directly inspired by biophysical phenomena and measurement paradigms, such as auditory spectrograms that account for critical bandwidths of the basilar membrane or spectro-temporal modulation models based on neuronal responses in the auditory nerves and primary auditory cortex^37,44.

In the following sections, we detail the computational basis of this decomposition and demonstrate how it can be applied to modeling various psychoacoustic phenomena.

Cascaded envelope interpolation

Cascaded envelope interpolation (CEI) is a mathematical concept aiming to decompose signals into a finite set of modes, inspired by the EMD³². EMD consists of a cascaded process that extracts modes based on envelope interpolation and that shares striking similarities with auditory processes. Here, we introduce CEI, a variation of EMD with only one iteration and a fixed number of modes. This approach is different from EMD as the extracted mode does not fulfill the EMD’s mode criteria.

The CEI starts by extracting upper and lower temporal envelopes from the original signal, averaging them to compute an interpolative envelope, and subtracting it from the original signal to get the first mode of the decomposition and a residual signal corresponding to the difference between the original signal and the mode. The residual is then used as an input for the next iteration to extract the next mode and so on until all the modes have been extracted (see Fig. 1a). This decomposition differs from the EMD algorithm in that each mode is extracted with only one interpolative envelope extraction and is not driven by a convergence threshold. In the framework of David Marr’s theory, CEI is proposed as the algorithm that accounts for sound coding at the cochlear level.

**Fig. 1: Cascaded envelope interpolation.**

It must also be noted that high-frequency modes are computed first and that higher-order modes correspond to lower frequencies. This behavior aligns with what is known about cochlear processes where the highest frequencies are extracted near the cochlear base while the lowest frequencies are extracted near the apex. The number of modes was arbitrarily fixed at 6 for each studied phenomenon as the energy of modes higher than 6 was close to 0.

CEI also differs from traditional envelope extraction, which often relies on the Hilbert transform of a signal, and which results in symmetric upper and lower envelopes. Conversely, CEI extracts upper and lower envelopes through a numerical process that involves finding the maxima and minima of the signal and averaging between them. This leads to an interpolative envelope that is not perfectly symmetric.

It is also important to note that the CEI is fully data-driven, as the maxima and minima depend solely on the signal. In order to compare this signal approach with traditional time-frequency representations, the power spectrum density of each mode can be computed and then summed up to provide a short-term Fourier transform (STFT) of the CEIs modes. It should be noted that this spectral analysis representation of CEI modes does not imply that a Fourier transform is performed in the auditory system. We only use it as a mathematical tool to compare the CEI with common signal representations. By considering that each mode is processed separately, its spectral content can be analyzed and used to compare with psychoacoustics results.

The CEI decomposition intrinsically differs from Fourier or Wavelet transforms as it is data-driven. Unlike traditional fixed dictionary transforms, CEI decomposes signals based on their own structure, without any prior assumption on the basis functions of the dictionary. For example, in the case of a signal composed of a chirp, a pure tone, and a modulated pure tone, CEI naturally separates each component (as shown in Fig. 1b), while classical models using fixed bandpass filter banks or Fourier-based decompositions fail to provide such a signal specific decomposition. In the traditional framework, source separation is achieved from complex models such as convolutional deep-neural networks fed by Fourier-based representations such as spectrograms.

Based on these observations and phenomenological considerations, we here challenge the ability of CEI to account jointly for different fundamental auditory phenomena: nonlinearities leading to phantom tone perception, frequency selectivity of the cochlear filter bank at the origin of roughness, consonance, and dissonance, and the data-driven behavior considering the sound processing at the cochlear level as an adaptive filter bank which can be revealed by frequency masking experiments.

Results

Phantom sounds and virtual pitches

A vivid debate in hearing sciences lies in the origin of phantom sounds and virtual pitches which corresponds to situations where frequency components are perceived although they are not present in the Fourier spectrum⁹. These sounds, also known as combination tones (Fig. 2d), are created within the auditory system. They generally occur for signals composed of two pure tones with close frequencies, and have a lower frequency than the initial sounds^13,14. Another close situation, known as the missing fundamental perception (Fig. 2a–c), occurs when the low frequencies of a harmonic signal are missing, but are still heard. This phenomenon can be observed when listening to the speaker of an old cellphone with a narrow frequency bandwidth. The fundamental frequency which is not physically present is virtually perceived and created within the auditory system allowing to perceive speech prosody. How many frequencies are created within the auditory system remains an open and still debated question. However, it is related to an essential non-linear mechanism that for instance plays an important role for communication in noisy environments to restitute masked frequency components.

**Fig. 2: Cascaded envelope interpolation accounts for nonlinear phenomena.**

Here we observe that CEI naturally produces these non-linear perceptual phenomena. The fundamental frequency of a speech sound is naturally reconstructed when low frequencies are removed by filtering (Fig. 2a, b). Similarly, CEI reproduces the most canonical combination tones which appear when two sinusoids of frequencies f₁ and f₂ are played at the same time (f₁ < f₂), and a third component at a frequency 2f₁–f₂ is perceived (Fig. 2d, e and Methods). This phenomenon also appears in music where the combination of harmonic tones with distinct pitches are played at the same time, leading to a third perceived pitch (Fig. 2c). Combination tones were used by the baroque music composer Giuseppe Tartini⁴⁵ (Fig. 2f). These are striking demonstrations of the nonlinear behavior of hearing. While a posteriori rectifications are used to account for such behavior¹⁵, we here observe that CEI naturally reveals these intra-aural generated components (Fig. 2a–e). Such a non-linearity is a direct consequence of the interpolation which becomes visible in the CEI spectrum. The fact that the CEI spectrum reveals components that are not present in the Fourier spectrum and that fit with those actually perceived suggests that this nonlinearity could occur directly at the hair-cell level as suggested by physiological observations⁹. Further physiological measurements are obviously necessary to confirm this claim and to make the CEI decomposition biologically plausible. This striking similarity suggests that hair-cell bundle motions, driven by the fluid motion triggering the spikes at the input of the auditory nerve, could be at the origin of such virtual sounds.

CEI as an adaptive cochlear filter bank

To satisfactorily serve as a candidate model for cochlear processes, CEI should also be compatible with the processing of the full set of sounds such as broadband noise, sound textures, and their mixing with speech and harmonic sounds such as music. Traditionally, the cochlea is believed to perform a constant-Q transform leading to representations such as auditory spectrograms³⁷. In addition, cochlear processes are malleable and naturally adapt to the spectro-temporal content of incoming sounds²³. We here tested the compatibility of CEI when considering the cochlea as an adaptive filter bank. It is indeed remarkable that the spectral shape of the CEIs filter bank naturally adapts to the spectral content of the analyzed signal (Fig. 3a–c). We therefore here inquire whether CEI behaves as an adaptive filter bank whose properties fit with the equivalent rectangular bandwidth (ERB) model, the gold standard law accounting for the ears’ bandpass processing.

**Fig. 3: Cascaded envelope interpolation as an adaptive filter bank.**

Ears nonlinearly decompose sounds onto a code optimized for speech processing compatible with cochlear filters^46,47. This behavior is often modeled as a band pass filter bank whose properties, i.e., central frequency and bandwidth are closely linked and follow a nearly linear relationship. This filter bank is adaptive which means that the filters automatically adapt their bandwidth according to the spectral content of a sound. Here we observe that CEI strikingly follows such an adaptive behavior with the same frequency bandwidth dependence as traditional cochlear filters^46,47. We tested this by fitting a linear filter bank equivalent to CEI when processing a large database of speech sounds. This allowed us to compute the equivalent center frequencies and bandwidths for each CEI’s mode considered as the result of a band-pass filter on the input signal (see Fig. 3d). Strikingly, the relationship between the center frequency and the bandwidth was compatible with psychoacoustics models based on the ERB. Such an adaptive behavior differs from current model-driven constant-Q transforms that consider the cochlea as a fixed band-pass filter bank. This makes it potentially compatible with adaptive efficient coding^22,23. It is striking that CEI, like the auditory system, naturally adapts to the spectral content of a sound. CEI, therefore, opens a door to a computational basis for the acoustical niches that may drive the co-adaptation between acoustic environments and the communications abilities of species. This indeed suggests that the auditory systems of living beings are based on an adaptive sensory coding, tuned to the communication signals of their conspecifics and to the environment, rather than a fixed model-driven filter bank.

Frequency selectivity

One other property of the cochlea is its ability to separate meaningful signals, often harmonic, in a mixture of noise with concurrent harmonic sounds. This sensory preprocessing is fundamental to perceiving speech in noise and is often known as frequency selectivity. This has historically been studied in canonical situations with pure tones masked by noisy mixtures³⁵ or by other pure tones⁴⁸. Participants had to detect a probe signal in a given mixture in order to fit the bandpass filter properties. Here we reproduced these two situations in order to test whether CEI has the same frequency selectivity.

Noise masker

The human ability to detect sinusoids in noise is characterized by two main aspects: (1) The larger the noise frequency bandwidth the weaker the ability to detect the sinusoid. (2) The higher the frequency of the probed sinusoid the larger the bandwidth, also known as the auditory critical bandwidth. More precisely, above 500 Hz, the critical bandwidth increases linearly with respect to the logarithm of the sinusoid’s frequency. We here wanted to test whether CEI reproduces these phenomena. In Patterson’s masking experiment³⁵ sinusoids of a given frequency f₀ are played back simultaneously with a bandpass-filtered white noise centered at f₀ with a given bandwidth Δf. The noise level is increased until the subject is unable to detect the sinusoid, and the corresponding detection threshold is then determined accordingly. We simulated such an experiment for 5 frequencies f₀ (250 Hz; 500 Hz; 1000 Hz; 1500 Hz; 2000 Hz) and 15 frequency ratios Δf/f₀ between 0 and 1. We applied CEI for different bandwidths to determine a detection threshold of the sinusoids. Subjects were replaced by a virtual listener whose responses were simulated by CEI representations and a detection process. For a given frequency and a given ratio, the masking threshold was estimated thanks to an adaptive staircase procedure. CEI reproduces the expected behaviors (see Fig. 4). In particular, we fitted threshold curves similar to known psychophysical tuning curves obtained in perceptual experiments on humans⁴⁹. This result confirms that CEI is compatible with the behavior of cochlea as a filter bank, interestingly the bandwidth at −3 dB of the corresponding filter is proportional to the central frequency.

**Fig. 4: Masking thresholds of pure tones in noise.**

Pure tone masker

CEI hence naturally accounts for the frequency masking of a pure tone in noise. This is particularly important for the perception of meaningful signals such as speech embedded in a mixture of background noise. However, frequency masking appears also in contexts such as music when pure tones are interacting. In the canonical context of the combination of two pure tones, these beats lead to a sensation called roughness, which also relates to the notion of consonance or dissonance related to the frequency selectivity of the cochlear filter bank. In the speech, roughness drives the perception of aversiveness⁵⁰. Formally, auditory roughness can be described as the perception of very fast fluctuations in sounds. It is now well-known that for stimuli composed of two pure tones, i.e., two tones that each have a single frequency, the sensation of roughness is driven by the ratio between the frequencies of the components. When the ratio is close to one, the perception leads to slow beats perceived as one single signal. When the ratio increases, the perception gives rise to a rough sensation revealing that the auditory system is unable to disentangle the two components. When the ratio further increases and reaches the critical bandwidth, the ears separate the components, and two frequencies are perceived¹². From a mathematical point of view, the sum of two pure tones can indeed be interpreted either as one sinusoid modulated by a slow modulation whose frequency is driven by the difference between the two components, or it can be seen as the sum of two components. Said differently, below a given frequency ratio, a sensation of roughness is perceived due to the ears’ inability to disentangle the two pure tones. Here we aimed to test whether CEI behaves as hearing when decomposing a sum of two pure tones. We simulated situations in which CEI analyzes pairs of sinusoids and evaluated whether one or two frequencies were detected. When the two sinusoids are well separated by CEI, the first and the second mode of the decomposition respectively correspond to the first and the second sinusoid. But when the separability diminishes, either the third mode contains energy that can be related to the sensation of roughness, or the first mode corresponds to a sinusoid slowly modulated by another one. We here define an index of separability d allowing us to characterize these three phenomena: when d = 1, the CEI considers the sum of sinusoids as one component, when d = 0, the CEI considers that the signal is composed of two separate sinusoids when d is between 0 and 1, the CEI is not able to separate between the two pure tones and beats/roughness appear. Figure 5 shows the value of this separability index for different frequencies f₀, f₁, and ratios α with f₁ = αf₀. We observe that the three behaviors are compatible with human auditory perception. As for the masking with noise, it is noticeable that the CEIs separability ability does not depend on the frequency while it is known that roughness maxima change with frequencies. This is coherent with the fact that CEI acts as a constant-Q filter bank in the auditory system. However, one current limitation of the model is that the frequency selectivity doesn’t match with the theoretical roughness curves, see Fig. 5a.

**Fig. 5: Separability of two pure tones.**

An algorithmic way to address this limitation is to consider the EMD in its original form which includes a supplementary process, called sifting, which is an iterative process within each mode of computation (see Methods). However, there is currently no sound hypothesis on how a sifting process could be done by cochlear processes, or after. We removed this process by keeping only 1 iteration for each mode which makes the CEI process more potentially implementable from a biological point of view by hair cells stereocilia. On the contrary, an iterative process with a threshold has no obvious plausible implementation. Nevertheless, by considering this process, we observe that the number of sifting iterations controls the frequency selectivity which can be adjusted to be more or less important according to the lowest frequency f₀. The higher the number of sifting iterations, the tighter the frequency selectivity and therefore the lower the roughness maxima (see Fig. 5b). We hereby determined the number of sifting iterations necessary to fit the index of separability with a roughness maximum similar to the theoretical curve for different frequencies f₀. We found a logarithmic relationship between the number of sifting iterations and the frequency f₀. Although its implementation in the cochlea remains very speculative, the number of iterations necessary to fit with the theoretical roughness is coherent with the inner hair cell distribution on the cochlea. This distribution is almost constant from the base to the apex, but as the frequency tuning varies logarithmically with respect to the cochlear tonotopy (Fig. 5c)⁵¹, the number of inner hair cells involved per frequency band increases logarithmically with respect to the cochlear tonotopy. Such a correlation would suggest a possible link between the number of sifting iterations and the number of involved hair cells.

Interestingly, it is known that frequency selectivity decreases with aging because of deficient or damaged hair cells which leads to sensorineural hearing loss⁵², especially for high frequencies. The origin of this larger bandwidth due to hearing loss remains debated, but the reduction of the number of healthy hair cells are known to impact such hearing damage. Here, we provide a concrete possible perspective to link the number of healthy hair cells with the increase of the bandwidth of cochlear filters (see Fig. 5b). This opens the possibility to use CEI as a model for sensorineural hearing loss. However, it remains necessary to more precisely understand how the sifting or an equivalent process could be implemented at a biological level.

Taken together, masking simulations with noise and pure tones reveal that CEI behaves as the auditory system for frequency selectivity. In addition, the effect of sifting iterations can reflect the number of hair cells implicated in a mode extraction, as it is in accordance with cochlear tonotopy and with the effect of sensorineural hearing loss. This suggests a strong link between CEI and possible physiological implementation.

Consonance and dissonance

In relation to auditory roughness, when two musical instruments with two different timbres are playing the same note, roughness leads to beatings perceived as more or less consonant or dissonant^2,13. Musical consonance is often associated with the pleasantness of a musical sound and conversely for dissonance. The origins of musical consonance and dissonance perception have been extensively studied from sensory and cognitive points of views^53,54,55,56. In addition, models of consonance of complex sounds have been proposed based on the roughness curves obtained from pure tone combinations¹². Here we tested the ability to predict theoretical consonance curves directly from the CEI representation. We simulated pairwise comparison experiments between pairs of tones with harmonic spectra (sawtooth). The underlying metrics used to simulate the pairwise judgments were based on the separability between the CEI spectra of the sum of the considered harmonic tones and the sum of the CEI spectra of each harmonic tone separately. We finally computed an arbitrary dissonance score characterizing the probability of a given interval to be judged as more dissonant than another one (Fig. 6). For the sake of coherence with the literature, dissonance curves have been transformed into consonance curves after being subtracted from 1. Interestingly, we observed that well-known consonant/dissonant intervals are naturally revealed by such an analysis. In particular, the octave (P8), the perfect fifth (P5), the perfect fourth (P4), the major sixth (M6), and the major third (M3) provide the most consonant intervals followed by the minor third (m3) and the tritone (tt). This result is well-known in music theory and confirms that CEI also aligns with this well-known auditory phenomenon. In this context, this would suggest that consonance/dissonance perception, which can also be driven by cultural and cognitive functions, would be mainly driven by bottom-up acoustic features as a consequence of envelope interpolation at the very first step of the auditory system.

**Fig. 6: Dissonance curve of musical sounds.**

Discussion

We here present a data-driven framework based on a simple computational unit founded on CEI. By meta-analyzing well-known psychophysical phenomena in light of this transformation, we first show that it bridges linear, nonlinear, and adaptive principles of peripheral hearing under a single framework. It supports that envelope extraction by interpolation is at the core of nonlinear and adaptive cochlear processes.

Envelope interpolation might be at the common origin of combination and phantom tones¹⁹. One current understanding of the missing fundamental suggests that autocorrelation is performed at the stage of the auditory nerve⁵¹. Such a process involves the implicit computation of time delays, which is still not elucidated and challenged by behavioral observations⁵⁷. Combination tones have been considered as the consequence of different mechanisms such as nonlinear transduction at the hair-cell level⁹ or due to central processes⁵⁸. None of these models have to date provided a satisfying and unifying account of these psychophysical observations. In particular, whether peripheral and/or central processes are at the origin of these phenomena remains debated. On the other hand, CEI does not necessitate such a time delay computation hypothesis. If our study does not yet provide a precise account of how the complete mode extraction might be performed physiologically, the envelope interpolation is compatible with known intracellular recordings inside inner hair cells⁵⁹. An important perspective is a better understanding of the active mechanism of outer hair cells and in particular distortion products leading to otoacoustic emissions⁶⁰. The emitted frequencies corresponding to cubic distortion products could also potentially be explained by envelope interpolation distortions. Of course, this result doesn’t preclude that the central system also contributes to these phenomena and the interplay between the peripheral and the central level of the auditory system remains to be clarified. Phenomena such as the missing fundamental perception could also be generated at higher levels of the auditory system⁶¹. Conversely, adding adaptivity and non-linearity in peripheral transformation modeling can also help us refine the understanding of the processing of different acoustical patterns such as spectro-temporal modulations³⁷ that are central for the perception of speech⁶² and musical instrument timbre^63,64.

CEI offers a perspective on hearing adaptability by demonstrating that its computationally simple approach is compatible with the adaptive properties of hearing. The efficient coding kernel functions naturally comply with model-based cochlear filters²³. However, the mechanism for extracting adaptive codes within the auditory system is not yet known. We suggest that envelope interpolation may be at the core of this process, as it has a higher physiological plausibility than the matching pursuit algorithms used to derive the efficient auditory code²³, which is not implementable at a physiological level. In addition, CEI not only accounts for phenomena ignored by linear auditory models but also simulates well-known phenomena such as the masking of pure tones in noise, auditory roughness, and musical consonance perception. This unification under a single data-driven framework opens avenues for reconsidering still-misunderstood phenomena, such as the cocktail party effect and how the brain tracks meaningful signals in noisy backgrounds.

CEI provides also a perspective on modeling sensorineural hearing loss resulting from hair cell damage or deficiency. We observed that the increasing critical bandwidth associated with sensorineural hearing loss can be accounted for by the number of sifting iterations, although the exact implementation of these iterations at the cochlear level is unknown. One possible implementation is a joint operation of a population of hair cells coding at the same location on the cochlea and computing the iterative envelope by averaging their activity. However, this remains a major limitation of the current model, and future research may help to make it even more biologically plausible.

Understanding the computational mechanism behind CEI could have implications for hearing aids. Hearing aids and cochlear implants still use passive gamma tone filter banks as front-end representation, however, it is well established that temporal fine structures and temporal envelopes are essential for speech perception²². Here, we demonstrated that CEI can replicate the effects of hair-cell loss, which leads to an increase in the cochlear filter bandwidth. CEI thus offers a perspective on signal coding through implant electrodes, allowing for direct accounting of nonlinear cochlear behaviors while remaining compatible with cochlear filtering for non-stationary signals. In a broader context, efficient coding is also a fundamental principle of visual coding⁶⁵, and CEI may provide insights into how optimal codes can be extracted from other sensory systems such as vision.

From a larger theoretical perspective, CEI offers the possibility to reconsider the cortical processes of speech occurring after the cochlea in the primary and the first steps of the auditory system considering this unified data-driven framework. For speech perception, it has been shown that the brain tracks the sound envelope which is reminiscent of neural oscillations⁶⁶. This has been interpreted as a functional principle to process speech signals by extracting the relevant sensory-motor oscillations, however, the computational bases of speech envelope modulations extraction from spectro-temporal information remains unclear. Considering CEI as the input, envelope extraction by interpolation can be achieved at very early processing stages and provides a plausible peripheral mechanism that supports and bridges with the existing literature on neural oscillations.