Restoring speech intelligibility for hearing aid users with deep learning

Almost half a billion people world-wide suffer from disabling hearing loss. While hearing aids can partially compensate for this, a large proportion of users struggle to understand speech in situations with background noise. Here, we present a deep learning-based algorithm that selectively suppresses noise while maintaining speech signals. The algorithm restores speech intelligibility for hearing aid users to the level of control subjects with normal hearing. It consists of a deep network that is trained on a large custom database of noisy speech signals and is further optimized by a neural architecture search, using a novel deep learning-based metric for speech intelligibility. The network achieves state-of-the-art denoising on a range of human-graded assessments, generalizes across different noise categories and—in contrast to classic beamforming approaches—operates on a single microphone. The system runs in real time on a laptop, suggesting that large-scale deployment on hearing aid chips could be achieved within a few years. Deep learning-based denoising therefore holds the potential to improve the quality of life of millions of hearing impaired people soon.


Introduction
Hearing loss is a debilitating condition that is associated with a large range of negative health outcomes 3 , including higher levels of social isolation, dementia, depression, cortical thinning 4 and increased mortality 5 .Nevertheless, over 80% of people who would benefit from use of hearing aids do not wear them 6 , with the majority of hearing aid owners who do not wear them citing difficulties with hearing in noisy situations as a main problem 1,2 .
Noise reduction in most commercially available hearing aids is done by introducing spatial selectivity, e.g., by beamforming, which improves speech intelligibility for frontal sources in the presence of predominantly non-frontal noise [7][8][9][10] .Non-spatial (single microphone) noise reduction algorithms employed in hearing aids have so far not been able to provide improvements in speech intelligibility 7,[11][12][13][14][15][16] .A few recent studies have shown that deep learning-based denoising [17][18][19] or separation of multiple competing speakers 20,21 can provide improvements in speech intelligibility for cochlear implant users 22 and hearing aid users with severe-profound 6 hearing loss under fixed signal-to-noise (SNR) conditions [17][18][19] .For the majority of hearing aid users, with less severe hearing loss 6 , it is more challenging to provide intelligibility improvements through denoising.A very recent study has shown the ability of a deep learning based denoising system to moderately improve speech intelligibility for hearing aid users 23 .Our work builds upon these exciting results and demonstrates that deep learning based denoising may be used to provide large improvements in speech intelligibility for hearing aid users in the near future.
In order to be adopted into real-world use in hearing aids, the denoising system needs to work for: 1) a wide variety of a-priori unknown speakers, noise types, and SNR values; 2) real-time processing; 3) users with mild to severe hearing loss.In this study we present a new deep learningbased denoising system that simultaneously addresses all of those points.Based on 150,000 human ratings of three datasets covering a wide range of speakers, noises, and SNRs, our model improves upon state-of-the-art denoising systems, while being able to run on a laptop in real-time.Most importantly, in live intelligibility tests with dynamically adapting SNRs, our system improves speech intelligibility for hearing aid users with moderate-severe hearing loss to levels comparable to those for normal hearing listeners without our denoising system.

Deep learning-based noise reduction
Our noise reduction system comprises three key components: a denoising network; metrics that reflect human auditory perception; and an algorithm to find network architectures that maximally improve the quality of noisy speech as determined by these metrics (Figure 1).The denoising network has a U-Net 24 architecture, and predicts a complex-valued ideal ratio mask from the shorttime Fourier transform of noisy speech signals (see Methods).The U-Net is trained on tens of thousands of hours of noisy speech to enhance the speech signal and mask unwanted background noise using a mean-squared error loss.Since this loss does not reflect human perception well, we evaluate the network performance using a novel deep-learning based metric that estimates human acoustic perception.To optimize the U-Net architecture we performed an evolutionary architecture search [25][26][27][28] guided by our deep-learning metric.
Human auditory perception of the quality of algorithmically synthesized content is commonly assessed by "mean opinion scores" (MOS) from human users.As obtaining MOS is timeconsuming and expensive, they are typically only used to compare the performance of algorithms once they have been optimized, but not for the optimization procedure itself 29,30 .In particular, human MOS are prohibitively expensive to guide an architecture search, where hundreds or thousands of prototype networks must be evaluated.We address this challenge by gathering approximately 100,000 MOS from human subjects and training a neural network to predict the MOS given a noisy or a denoised speech sample and the associated clean version (Methods).The predicted MOS (pMOS) is then used to guide the evolutionary search algorithm (Figure 1), since it approximates human perception well and can be computed quickly enough to act as a target in the optimization loop of the neural architecture search (Figure 1).We ran the neural architecture search until improvement in performance plateaued, after which we trained the best performing network over 2.2 million steps (the equivalent of over 2 years of audio).
Figure 1.Training pipeline of the denoising system using a mean opinion score (MOS)-estimator-guided neural architecture search.The denoising network is trained to predict denoised outputs from mixed speech and noise input STFTs.To optimize the remaining error for human acoustic perception, the denoising network architecture and hyperparameters are selected by an evolutionary neural architecture search [25][26][27][28] .This search is guided by an MOS estimator, which is a deep neural network trained on a dataset generated from around 100,000 human rated audio files.

Human evaluation of "deep" denoising
We compared our denoising algorithm to state-of-the-art deep learning-based models (Sound of Silence, Demucs, and MHANet) [31][32][33] using three test datasets, one of which (Valentini 34 ) is commonly used as a performance baseline.We created two other test datasets (WHAMVox easy and WHAMVox hard) from the publicly available VOXCeleb2 35 speech and WHAM! 36 noise datasets (see Methods) to provide a more varied set of speakers, noise types and a wider range of SNR values.The test sets 34 * and example processed sound files † are available online.
We collected MOS ratings from human listeners for the same sound samples with noisy speech (as a baseline), enhanced by our denoising algorithm, and enhanced by the comparison models (Figure 2A).Our denoising system achieves higher human MOS scores than the comparison models on all three datasets (p < 0.0005, Wilcoxon signed-rank test).
Human listeners rated the samples processed with our denoising model as better than other models for a wide range of signal-to-noise ratios (SNRs; Figure 2B), which is crucial for translating the models to real-world applications.In particular, we provide substantial improvements for the SNR range between -5 and 0 dB, where the other models achieve limited improvements.Our algorithm restores the quality of sounds with -5 dB SNR to the level of unprocessed samples at 7 dB SNR, reflecting an important use-case for hearing aid users in a very busy bar or café.Processed sounds at 1 dB SNR show the same human rated MOS as clean unprocessed sounds at 20 dB SNR or more.Finally, our denoising system also improves the quality of almost clean sounds (Figure 2B at 20 dB SNR), e.g., by reducing breathing sounds and the often present white-noise during a recording.
We also compare the models on commonly used computational metrics and our pMOS estimator.The improvements achieved in human MOS scores (Figure 2A, B) were not consistently reflected in similar increases in the computational metrics (Figure 2C).In some cases, computational metrics conflicted with human perception (e.g., Valentini dataset, CSIG/COVL vs. human MOS).Nevertheless, pMOS accounts better for human perception than traditional speech quality metrics (correlation 0.9 between human MOS and pMOS vs 0.82-0.86 for PESQ, CSIG, COVL, CBAK; see Methods).These differences highlight the need to validate denoising methods on human data to avoid potential overfitting to specific computational metrics.

Improving speech intelligibility
Current noise suppression systems commonly used in hearing aids improve hearing comfort, but do not improve speech intelligibility 7,[11][12][13][14][15][16] .To assess the impact of our denoising system on speech intelligibility, we used the Oldenburger Satz (OLSA) test 37 (see Methods).The OLSA test measures individual speech reception thresholds (SRT), defined as the SNR at which a subject correctly identifies 50% of words.Lower SRT is better.We tested three different noise conditions, mixing clean speech with: 1) speech-shaped 'OLSA' noise, 2) restaurant noise, and 3) traffic noise.For each mixed sample we tested with and without applying our denoising system, and for normal and hearing impaired listeners, i.e., a total of 12 conditions (Figure 3A).Hearing impaired subjects wore their own hearing aids, adjusted to their individual hearing profile.
Without noise suppression, normal hearing subjects have a median SRT of -5.8 dB for OLSA noise (mean -5.6 dB, median absolute deviation 0.48).The SRTs of hearing impaired subjects are higher with a median of -3.3 dB (mean -2.1 dB, median absolute deviation 2.4).Activating our denoising system provides median changes of -3.5 dB, -3.5 dB, and -2.8 dB for hearing impaired subjects (mean changes of -4.0 dB, -4.2 dB, and -4.3 dB), respectively for the OLSA, restaurant and traffic noise (p<0.0005, Figure 3A, B).These improvements are greater than those reported for any other single-channel noise reduction system.For OLSA noise, they bring the median SRTs of hearing impaired subjects to -6 dB SNR, i.e. levels that are comparable to those normal hearing listeners (Figure 3A, left).In fact, for all three noise types, SRTs are not significantly different between normal hearing subjects without noise suppression and hearing aid users with denoising (p>=0.1; Figure 3A).Moreover, even normal hearing users' median SRTs changed by -1.9 dB, -2.5 dB, -1.3 dB (mean of -2.0 dB, -2.2 dB, and -1.5 dB SNR), for the three noise types when denoising was applied (p<0.0005; Figure 3A, B).The amount of improvement is consistent across the three noise types (Figure 3B).Despite large differences in SRTs across noise types before denoising, there are no significant differences in the amount of improvement across noise types within the normal and hearing impaired groups (Friedmann test: p>0.05 for both).
Listeners with more difficulty in noisy situations receive greater benefit from the denoising system (Figure 3C), as indicated by the inverse correlation between the improvement of the SRT and the SRT without the denoising system for all subjects.Looking at the results for individual listeners, the smallest improvement for the hearing impaired population is -0.9 dB (OLSA noise) for a subject with a middle ear disorder who has a near-normal SRT of -5.1 dB without our denoising system.Conversely, we measured the strongest improvement of -16.8 dB (traffic noise) for the subject with the strongest hearing loss, who had a SRT above 10 dB without our denoising system.This suggests that with denoising, speech understanding becomes more consistent across subjects with different degrees of hearing loss.Indeed, the median absolute SRT deviation, i.e. the variation of speech thresholds, is reduced from 2.4 dB without denoising to 1.2 dB with denoising for OLSA noise.Thus, our denoising system reduces (negative) outliers, thereby leading to an improved "worst-case" outcome for normal and hearing impaired listeners.

Discussion
In summary, we have shown that the suggested noise suppression system improves speech intelligibility for hearing impaired subjects to levels similar to normal hearing subjects, across a wide range of noise conditions.Almost all hearing impaired subjects improved to a similar level of speech understanding, indicating that the benefits afforded by our system are larger for individuals with a higher need.Speech intelligibility could be further enhanced by combining single-channel denoising algorithms such as ours with existing multi-channel solutions such as beamforming.Moreover, we expect that further improvements could be achieved by adapting the sound quality metric to the specific perceptual demands of hearing aid users by gathering MOS data from hearing impaired listeners, whose perception can be quite different from normal subjects.This would allow networks to be fine-tuned to hearing aid or cochlear implant users, opening possibilities to further improve speech intelligibility in noisy environments to beyond what is currently possible with hearing aids.Additionally, our denoising system currently has no in-built system that could identify the target speaker from a mixture of multiple speakers.Thus, our system treats multiple speakers simply as "speech" and does not attenuate any of them if they are all prominent, but suppresses simultaneous background noise.For example, if two competing talkers partially overlap with a similar loudness in a café with ongoing background babbling, the two talkers would be audible but not the babbling.Separating multiple competing speakers is an active area of research [38][39][40] and forms an additional component of the system that could be added in future iterations.For practical use, the output of our denoising system would be mixed with the input signal, e.g.90% denoised and 10% original, to avoid a feeling of isolation of the user from its environment and to provide a noise "reduction" instead of noise removal.This parameter could also be easily changed according to user preferences.
The improvement of speech intelligibility that can be obtained by denoising system like ours will naturally be limited by the ability of the user to understand speech in quiet environments.Similarly, our system often reaches its performance limit for very low SNRs, in which even normal hearing listeners struggle to understand speech.Conversely, as long as normal hearing subjects are able to discern the speech signals, our denoising system is able to reduce noise while maintaining high speech quality.This is a possible explanation for why our denoising system consistently improves speech intelligibility for hearing impaired listeners in noisy situations up to levels comparable to those for normal hearing listeners without the system, but barely beyond that.
Studies involving traditional noise reduction algorithms available in current hearing aids have commonly failed to show improvements in speech intelligibility for hearing aid users without relying on spatial information 7,[11][12][13][14][15][16] .This is consistent across experimental conditions, including hearing aid type, speech and noise stimuli, noise reduction algorithm, language, and testing paradigm 7,[11][12][13][14][15][16] .Despite this, such algorithms are commonly used in hearing aids because they offer an improvement in comfort and ease of listening 7,13,14 , and a decrease in the cognitive load required to concentrate on speech in noisy environments [41][42][43] .While we do not test for it here, we also expect improvements in cognitive load with our system, considering that it increases the effective SNR of noisy sounds by up to 16dB 43 .In contrast to studies using traditional denoising algorithms, a recent study has shown that deep learning based denoising 23 on hearing aids can improve speech intelligibility for hearing impaired subjects.Our work is in line with other research in this field [17][18][19] and shows the improvements to speech intelligibility that can be made with more powerful neural network models.

Outlook
A challenge for neural network-based denoising algorithms is their computational cost compared to denoising methods traditionally used in hearing aids.As with most deep learning-based systems, the performance of our network improves with the available computational resources.Here, we limited the size of the network such that the algorithm runs in real time on a laptop.Hence, the computational power required to achieve the presented results is higher than what is available in current hearing aids and scaling the technology to the point where it can fit in a hearing aid still requires engineering more powerful hardware and/or more efficient models.However, the gap is not prohibitively large, and we speculate that Moore's law and the exponential improvement in computational power per watt 44 should lead to a feasible implementation on a hearing aid within a few years.Additionally, the rapid progress in the algorithmic efficiency of neural networks 45,46 should further shorten adoption time.
In summary, we have presented a denoising system that enables hearing aid users to achieve speech-in-noise intelligibility levels comparable to those for normal hearing listeners and generalizes across noise environments.Deep learning-based denoising systems could hence facilitate an entirely new type of hearing improvement that is directionally independent and could prove useful not only for hearing impaired users, but also for normal hearing listeners who wish to reduce noise in noisy situations, such as crowded restaurants or bars.

Deep-learning based speech-enhancement
The denoising system uses a deep neural network (also referred to as "network", Figure 1) with a U-Net 24 architecture.The U-Net consists of an encoder and a decoder separated by a bottleneck with skip connections running from encoder to decoder.
The encoder compresses the input using strided convolutions and the decoder reconstructs the compressed data back to its original dimensions using transposed convolutions.The encoder consists of 3 residual convolutional blocks, each containing several convolutional layers after which a strided convolution downsamples the data to decrease its dimensions.The blocks differ slightly in each stage and were selected by an evolutionary architecture search [25][26][27][28] .The search variables include the number of layers in each block, the type of layer, their kernel sizes, the number of filters, and dilation rate.One type of layer (e.g.3x3 convolution) is repeated in each block and combined with a residual connection after each layer repetition.If necessary, an additional projection (dense) layer is used to adapt the feature dimensions of the skip connection to match the shape of the tensor it is being added to.
The decoder has its own 3 uniquely searched blocks, but with the same search space as the encoder.However, it replaces the down sampling strided convolutions after each block by a transposed convolution to up-sample the processed data back to its original dimensions.Additionally, the first residual connection of each block consists of a skip connection from the equivalent layer in the encoder.
The bottleneck consists of two parts: a convolutional block, as above but without strided or transposed convolutions, followed by a hand-designed pyramid pooling block 47 .
The types of layers are limited to standard convolutional layers, depth-wise convolutional layers, spatial convolutions, and frequency/time first convolutions 48 .Spatially separable convolutions consist of subsequent distinct convolutional layers with variable kernel size in one spatial dimension (i.e.time or frequency), and a constant kernel size of 1 in the other dimensions.We distinguish two settings: 1) time first and 2) frequency first, to distinguish whether to apply a convolution in the time or frequency dimension first.
Throughout the network, each layer uses rectified linear unit (ReLu) activation functions except the last layer which uses a linear activation function.In total the network includes around 4 million parameters.We trained the network with a batch size of 16 samples and a length of 10 seconds (cut to 1.8 seconds to match the network input size) per sample, to predict the complex-valued ideal ratio mask of each sample using the mean squared error loss and optimize the weights with the Adam optimizer with a learning rate of 0.0001.
The network input consists of mixed speech and noise inputs taken from a large database of speech and noise files.Each mixed input waveform, sampled at 22.5kHz, is converted to the frequency domain using the short-time Fourier transform (STFT) which uses a frame length 512, frame step 128 and the Hann window function.The network predicts the denoised STFT which is then converted back to a waveform.During testing the waveform was sent to the soundcard of the testing device, which in this study was a laptop.The algorithm runs in real-time on Ubuntu 16.04 on a Asus FX504GD-DM116 laptop with the following specifications: an Intel Core i5-8300H, 8 GB DDR4 RAM, and a NVIDIA GTX 1050 graphics card.We did not attempt run-time optimizations of the model and instead focused on improving the network itself during development.It is to be expected that given effort in the implementation, similar network models should be able to run on significantly smaller systems.

Deep-learning based speech-quality metric
Our custom speech quality metric is generated by a multi-stage neural network, which was trained to predict human listeners' opinion scores from noisy speech files rated by human listeners on Amazon Mechanical Turk (MTurk).

Ratings Dataset
The MOS ratings were obtained using Amazon Mechanical Turk.To ensure that ratings were of high quality, only workers with a Mechanical Turk Master's qualification were accepted to work on the task.Workers rated sound files on a scale of 1.0 (bad) to 5.0 (excellent; see Extended Data Table 1) in batches of 17 samples.Among the 17 samples in each batch, we included 2 "baseline samples", one with a correct rating of 1.0 and the other with a correct rating of 5.0, which had to be rated correctly by the worker for the batch to be accepted into the dataset.As a note on the perceived quality when using a 1 to 5 scale MOS, the ceiling of performance for higher SNRs is caused by the limited possibility of improving the quality further (5 as highest rating) and the fact that even for the clean recorded samples the rating 5 is given only about 50% of the time.
The files to be rated were processed by a range of different denoising models.Models were selected to cover a large diversity in network architectures and training progress (i.e.models that were trained for only a few steps to models trained for millions of steps).
Altogether the final dataset consisted of 15,962 files, with 94,225 ratings, giving an average of 5.9 ratings per file.When calculating the average rating for each sample and category, the trimmed mean of all ratings is taken, i.e. the lowest and highest rating(s) were ignored.

Synthetic ratings dataset
To increase the size of the collected human-rated dataset, we trained a multilayer perceptron (MLP) to predict the MOS of each labelled sound file from a range of objective speech quality metrics, such as: SNR, mean squared error (MSE), mean log error, WSS, CEP (all described in 49 ), STOI 50 , ViSQOL 51 , and PESQ 52 .We then used the trained MLP to predict the MOS from 600,000 unrated processed or unprocessed mixed speech and noise files from our datasets.We used this larger synthetic dataset of MLP-rated files to train our MOS prediction network described in the following section.

Speech quality estimation network architecture
The MOS network is an intrusive metric estimator receiving both a clean speech file and a corresponding (processed or unprocessed) mixed speech and noise file as inputs.The STFT and MEL spectrogram from the inputs is passed to the network, which is used to predict the MOS of the noisy sample.The network was trained on the synthetic dataset of 600,000 clean and noisy files described in the previous section.
Once trained, the MOS network evaluates the performance of the denoising network.The MOS network predicts one score for the unprocessed mixed speech and noise file and one for the same file after processing by the denoising network.The difference between the two scores in predicted MOS (delta-MOS) is taken as an evaluation metric.The evolutionary neural architecture search makes use of the delta-MOS as a target to find good hyperparameters and network structures for the denoising network.

Denoising evaluation test sets
We evaluated the performance of our denoising network and of several state-of-the-art models, which were either downloaded from the author's websites if pretrained models were available or reimplemented by ourselves following requests to the authors for available code.

Figure 2 .
Figure 2. Human mean opinion scores (MOS) of state-of-the-art denoising methods.A) Comparison between current denoising models on 3 publicly available test sets 34 , total of 150,000 Human MOS (500 sound files per dataset, 20 human raters per file, 5 models).B) Dependence of Human MOS on signal-to-noise ratio (1500 samples of Figure 2A).Shading: 25th and 75th quartiles, no shading for SNR values with too few audio samples.SNR values rounded to the nearest integer value.C) Comparison of denoising methods using other common speech quality metrics on a 1 to 5 scale.

Figure 1 .
Figure 1.Reducing speech reception thresholds (SRT) on the OLSA test using our denoising system for hearing impaired (n=16) and normal hearing (n=17) subjects.A) SRTs for different noise types.Significance levels are shown for Wilcoxon signed-rank test for matched pairs and Mann-Whitney U test for independent samples.B) SRT improvements for each of the six testing conditions (Friedman test).C) SRT improvement vs SRT without denoising.Linear regression fits for each of the three noise type measurements, across all subjects (slopes m=-0.69,m=-0.70,m=-0.73).