Abstract
The reservoir computing neural network architecture is widely used to test hardware systems for neuromorphic computing. One of the preferred tasks for benchmarking such devices is automatic speech recognition. This task requires acoustic transformations from sound waveforms with varying amplitudes to frequency domain maps that can be seen as feature extraction techniques. Depending on the conversion method, these transformations sometimes obscure the contribution of the neuromorphic hardware to the overall speech recognition performance. Here, we quantify and separate the contributions of the acoustic transformations and the neuromorphic hardware to the speech recognition success rate. We show that the nonlinearity in the acoustic transformation plays a critical role in feature extraction. We compute the gain in word success rate provided by a reservoir computing device compared to the acoustic transformation only, and show that it is an appropriate benchmark for comparing different hardware. Finally, we experimentally and numerically quantify the impact of the different acoustic transformations for neuromorphic hardware based on magnetic nanooscillators.
Similar content being viewed by others
Introduction
Artificial neural network algorithms outperform humans on recognition tasks like image or speech recognition, by leveraging deep networks of interconnected nonlinear units called formal neurons^{1}. The goal of neural networks is to extract the features and classify input data through learned nonlinear transformation. Running such algorithms on a classical computer is costly energetically: to overcome this issue, neuromorphic approaches^{2,3} propose to implement them physically. In particular, reservoir computing^{4,5} is a kind of recurrent neural network that has been widely used to test the efficiency of hardware for neuromorphic computing^{6,7,8} because it has a simplified architecture and learning procedure. The input is sent to a neural network with fixed recurrent connections called a reservoir. The goal of the reservoir is to separate the different kinds of inputs, such that after this transformation, the classification can be done by a linear transformation. The response of the neurons of the reservoir are combined linearly with trained connections to construct the output. Since the connections in the reservoir are random and fixed, it is easier to fabricate it in hardware and then train the output connections, often emulated in software, with linear regression.
Speech recognition is a widely used class of benchmark tasks performed to test the efficiency of a neural network. It is especially employed in the case of reservoir computing because the recurrent connections of the reservoir create an intrinsic memory that is useful to classify timevarying inputs. Generally, this task requires frequency decomposition^{9,10,11} prior to the neural network because the acoustic features are contained in the frequency rather than in the amplitude of the timevarying signal. These decompositions return the amplitude of the signal in different frequency channels as a function of time. The neural network then extracts the acoustic features contained in the frequency information. Several frequency decomposition methods have been reported in the literature: Melfrequency cepstral coefficients (MFCC) and Lyon’s cochlear model (cochleagram) are the most common methods since they mimic the filtering that occurs biologically^{9,12,13}. However, the actual contribution of the acoustic filter to the total speech recognition rate is generally not investigated while performing speech recognition benchmarks with reservoir computing hardware, even if its influence on the final recognition rate may not be negligible^{8}. Furthermore, both of these methods were developed before reservoir computing became popular and, thus they were designed to extract the useful features of an audio signal independent of modern machine learning.
Here, we first show how the choice of different filtering methods drastically affects the final speech recognition rate. We quantify the contributions of the acoustic filtering and the neural network, respectively, for a spoken digit recognition task using four frequency decomposition methods with different nonlinear characters: Lyon’s ear cochleagram, MFCC filter, linear spectrogram \((\Re ({\rm{Spectro}}))\), and Spectro HP \(({\rm{Spectro}}\,{\rm{HP}}=\,\sin \,\sqrt{\Re ({\rm{Spectro}})}\) \(\,\cos \,\sqrt{\Im ({\rm{Spectro}})})\). In a first step, we show that the cochleagram, the Spectro HP and the MFCC filter are powerful standalone features extractors that can achieve by themselves (without additional processing by a neural network) very high recognition levels: up to 95.8%, 89.0%, and 77.2% for cochleagram, Spectro HP, and MFCC, respectively. In contrast, the linear spectrogram never achieves recognition levels statistically better than random sampling, 10%. However, by adding various levels of nonlinearity to the real part of the spectrogram we were able to show a large increase of the recognition rate from about 10% (linear) to 88% (strong nonlinearity). These results indicate that the high recognition level of the cochleagram and MFCC approaches is mainly due to the nonlinear character of these frequency decomposition methods and not to the reservoir itself.
In a second step, we evaluate the gain in recognition rate provided by a particular hardware approach to reservoir computing, based on magnetic nanooscillators. In order to compare to other hardware implementations in the literature, we model a neural network based on a single dynamical nonlinear magnetic node in the framework of the reservoir computing approach^{6,7,8,14,15,16}. We find that the contribution of the neural network is dominant for linear spectrogram filter and only plays a small role for the nonlinear cochleagram and MFCC filter. Finally, we present experimental results using a nonlinear and tunable magnetic nanooscillator exhibiting excellent agreement with our simulations.
Methods
We perform a benchmark task called spoken digit recognition that is common in the reservoir computing community for software^{10} and hardware^{7,8,14,17,18,19,20,21} implementations. The input data, taken from the TI46 database, are audio waveforms of clean isolated spoken digits (0 to 9) pronounced by five different female speakers (see example in Fig. 1a), as it is usual in the hardware reservoir computing communit^{6,7,8,10,14,17,20,22,23,24,25}.
The chosen part of the TI46 spoken digit database contains 500 (5 speakers × 10 digits × 10 utterances) audio files, which we index using the Greek letter \(\sigma \). To perform speech recognition on these spoken digits, each audio temporal trace in the database is transformed from timedomain to a mixed time/frequency domain with different acoustic filters, two of which are known to create a better representation of human voice characteristics. These acoustic filters give rise to different instances of our speech database containing the following elements: \({X}_{f\tau ,\sigma }^{{\rm{filter}}}\) where filter ∈ \(\{{\rm{Cochlear}},{\rm{MFCC}},{\rm{Spectro}},{\rm{Spectro}}\,{\rm{HP}}\}\), \(f\) is the index for the different frequency channels, and \(\tau \) is the index of a new time representation that depends on the time frame window used while performing the time to frequencydomain transformation. The number of time steps \({N}_{\tau }\) naturally depends on the digit length, while the number of frequency channels \({N}_{f}\) only depends the type of acoustic filter. For instance, \({N}_{f}^{{\rm{Cochlear}}}=78\), \({N}_{f}^{{\rm{MFCC}}}=13\), and \({N}_{f}^{{\rm{Spectro}}}=65\) while \({N}_{\tau }^{{\rm{Cochlear}}}\) ranges from 16 to 41, \({N}_{\tau }^{{\rm{MFCC}}}\) ranges from 31 to 83, and \({N}_{\tau }^{{\rm{Spectro}}}\) ranges from 24 to 67. Digits with \({N}_{\tau }\) lesser than the maximum value are padded with zeros.
The construction of the supervised learning task in the reservoir computing framework starts with associating each digit \({{\bf{X}}}_{\sigma }\) with its corresponding target \({{\bf{T}}}_{\sigma }\in {{\mathbb{R}}}^{{N}_{d}\times {N}_{\tau }}\) where \({N}_{d}\) is the number of categories to classify (here \({N}_{d}=10\) as the goal is to recognise the 10 different digits). Each target matrix \({{\bf{t}}}_{\sigma }\) is constructed columnwise and the \({N}_{\tau }\) columns correspond to the same target vector \({{\bf{t}}}_{\sigma }\) (the target vector \({t}_{d,\sigma }\) with \(d\in [\mathrm{0..9}]\) is zero almost everywhere but is one where \(d\) is equal to the corresponding digit number). The \({{\bf{T}}}_{\sigma }\) matrices would allow us to perform \(\tau \)wise recognition (partial digits for instance) but in this study we choose to make entiredigitwise recognition by averaging out the estimated target matrices \({\hat{{\bf{t}}}}_{\sigma }\) over the different columns (\(\tau \) direction) to end up with estimators \({\hat{{\bf{t}}}}_{\sigma }\) of the target vectors \({{\bf{t}}}_{\sigma }\) as shown below (see Eq. (3)).
Our reservoir is a time multiplexed single device as described in ref. ^{8}. Rather than a set of \({N}_{\theta }\) physical neurons, our reservoir consists of a single physical neuron evaluated at \({N}_{\theta }\) periodic times. To input the data to these virtual neurons, we multiply each value by a time series of length \({N}_{\theta }\) consisting of ones and minus ones and send the resulting time series to the single device. The output of the reservoir is determined by the resulting state of the device at each of the \({N}_{\theta }\) times for each element of the input data string. This output is multiplied by the output weight matrix to give the results. Training consists of determining the optimum set of output weights, which can be found through straightforward linear algebra.
The key computational concept supporting the reservoir computing approach is a nonlinear dynamical transformation of the processed information, i.e. sending the input data to a new space, in which simple linear algebra gives the readout of the results^{26}. In this work, the nonlinear transformation is the purpose of our spintorque nano oscillator represented by the function \({\rm{STNO}}(\cdot )\) in Eq. (1). The information is then encoded and injected into this nonlinear dynamical system after flattening the data and multiplying each element of the flattened \({{\bf{X}}}_{\sigma }\) by a random binary mask \({\bf{M}}\in {{\mathbb{R}}}^{{N}_{\theta }\times {N}_{f}}\) of 1’s and −1’s. This binary masks starts the timemultiplexing technique as the value times the mask gives the input to each virtual neuron. As a result, the mask distributes the frequency content of each time step \(\tau \) of the input data into a fixed neural network layer (the reservoir) of \({N}_{\theta }\) nodes. To summarise, Eq. (1) shows the details of our Reservoir Computing implementation:
where flatten(·) takes a \(m\) by \(n\) matrix as input and outputs a vector of length \(mn\) and reshape(·) does the reverse operation, i.e. takes a vector of length \(mn\) as input and outputs a \(m\) by \(n\) matrix.
Training (learning) is performed using a simple linear classifier. In this work, good performance is achieved using the MoorePenrose pseudoinverse after building the weight matrix \({\bf{W}}\) optimisation problem with a subset of \({N}_{{\rm{train}}}\) digits \({\bf{W}}[{{\bf{V}}}_{1},{{\bf{V}}}_{2},\ldots ,{{\bf{V}}}_{{N}_{{\rm{train}}}}]=[{{\bf{T}}}_{1},{{\bf{T}}}_{2},\ldots ,{{\bf{T}}}_{{N}_{{\rm{train}}}}]\):
No regularisation technique is used. The testing (recognition) step is then achieved using the computed weights \({\bf{W}}\) applied to the complementary (unseen) subset of \({N}_{{\rm{test}}}\) digits:
The estimator for a specific digit is given by \({\hat{d}}_{\sigma }={\rm{argmax}}({\hat{{\bf{t}}}}_{\sigma })\) (this corresponds to the WinnerTakesAll strategy, adequate for the present task). Digit \(\sigma \) is well recognised when \({\hat{d}}_{\sigma }={\rm{argmax}}({{\bf{t}}}_{\sigma })\) and the main performance estimator used in this work is the Word Success Rate (WSR) and corresponds to the percentage of well recognised digits over the total number of digits to recognise (\({N}_{{\rm{test}}})\). Another common performance estimator, useful for identifying overfitting issues, is the Mean Squared Error (MSE): MSE\(({\hat{{\bf{t}}}}_{\sigma })={\mathbb{E}}[{({\hat{{\bf{t}}}}_{\sigma }{{\bf{t}}}_{\sigma })}^{2}]\).
In all cases, the training and testing sets do not overlap. To avoid any learning bias while selecting samples randomly from the database when choosing the training and testing sets, we organise the 500 input files into 10 subsets of 50 files. Each subset contains one utterance of each digit pronounced by each of the five speakers. A random selection would produce an overrepresentation of some speakers and an underrepresentation of others. We take \(N\) utterance subsets (50 audio files, one for each digit and each speaker) for training (total training set size N × 50), and \(10N\) utterance subsets for testing (total testing set size (10 − N) × 50). To minimise the fluctuations that occur in the results due to random choices between the training and testing sets, we employ a crossvalidation technique and therefore average over all possible choices. That is, when \(N\) utterance subsets are used for training, we average over \(10!/[N!(10N)!]\) possible ways to choose the training and testing sets. This procedure also allows us to determine a width to the distribution of individual outcomes indicated by the shaded regions in Fig. 2. All word success rate results reported in the paper are crossvalidated test results.
In reservoir computing, training is fast and always converges due to the basic linear algebra algorithms. This behaviour stands in contrast to standard recurrent neuralnetwork approaches for which learning can be time consuming and does not necessarily converge to the desired solution. In reservoir computing, the learning process only modifies the readout weights whereas in other types of recurrent neural network it modifies the weights in all the other constituent layers in the neural network under complex feedforward/backpropagation algorithms.
The contribution of the frequency filtering and the reservoir computing, respectively, are then analysed separately. In order to evaluate the impact of the frequency filtering on the input separation capability, a linear classifier is trained directly on the different frequency channels. The classification results with both influence of the frequency filtering and the reservoir are computed by injecting the filtered input in a neural network composed of \({N}_{\theta }\) interconnected neurons. Here, we use \({N}_{\theta }=400\) input neurons that are connected to all of the frequency channels for each time interval \(\tau \), Fig. 1c, as this number allowed reaching maximum test accuracy. In the framework of reservoir computing, these fixed connections have random weights. To reach high classification rate, 400 neurons are sufficient^{6,8}. The features of the magnetic neurons that we consider are specified in section IV. A linear classifier is trained to map the neuron outputs to the desired results. The contribution of the reservoir to the ultimate success is extracted from the results by subtracting the success rate found using only the frequency filtering methods.
Acoustic filter: role of nonlinearity
First, we compute the digit recognition rate as a function of the number of utterances used in training for the cochleagram and the MFCC methods as shown in Fig. 2(a). The recognition rate increases with the number of trained utterances and then saturates in the case of the cochlear model. It remains almost constant for the MFCC model. Both filters achieve a high recognition rate. In particular, the cochlear model is an excellent acoustic feature extractor with recognition rates up to 95.8% (for 9 trained utterances) whereas the MFCC filter is less powerful, reaching recognition rates up to 77.2%.
These filters are commonly used for speech recognition tasks, because of their similarity to audio signal processing in biological ears, which perform complex frequency decompositions with high nonlinearities. Both MFCC and cochlear methods use nonlinearities to transform the audio data. For MFCC the transformed representation corresponds to the logenergy of the Mel frequency filter output^{9}. In the cochleagram approach, the main nonlinear ingredient corresponds to a set of interconnected automatic gain controls^{12,13}. The successful separation of the data achieved by these filtering methods appears to be mainly due to the nonlinear character of the transformation with a moderate influence of the kind of nonlinearity (similarly to reservoirs that can have different kinds of nonlinearity that work).
To establish the critical role of the nonlinearity contained in the filtering methods to recognition performance, we start by investigating the separation achieved by a very simple linear spectrogram filter. This filter is based on standard Fourier transforms of the audio input over finite time windows. The Fourier transform is a linear operation that outputs a real and an imaginary part. We consider only the real part in the following in order to avoid introducing nonlinearities by computing the norm. After the Fourier transforms, \({{\bf{Z}}}_{\sigma }\) is the matrix of the real parts of the spectrogram with dimension \({N}_{f}\times {N}_{\tau }\) where \({N}_{f}\) is the number of frequency channels and \({N}_{\tau }\) is the number of time steps, which depends on the particular digit. We normalise the data, \({{\bf{X}}}_{\sigma }={{\bf{Z}}}_{\sigma }/\,{\rm{\max }}({{\bf{Z}}}_{\sigma })\) and \({X}_{f\tau ,\sigma }\in [\,\,1,1]\) for \(f\in \{1..{N}_{f}\}\) and \(\tau \in \{1..{N}_{\tau }\}\). The normalisation is crucial to ensure that there exists at least one \({X}_{f\tau ,\sigma }\) that is equal either to 1 or to −1 for each \({{\bf{X}}}_{\sigma }\) when nonlinearities are introduced into the transform.
To study the influence of a nonlinear transformation on the normalised input data \({{\bf{X}}}_{\sigma }\), we choose to apply a pointwise operation, namely the exponent \(\alpha \in {\mathbb{R}}\), giving rise to the transformation on each element of \({X}_{f\tau ,\sigma }^{{\rm{filter}}}\to {({X}_{f\tau ,\sigma }^{{\rm{filter}}})}^{\alpha }\). The impact of the nonlinear exponent \(\alpha \) on the recognition rate is shown in Fig. 2(b). The recognition rate oscillates strongly as a function of the nonlinear exponent and decreases for large \(\alpha \). Some particular values of the recognition rate can be easily understood. For \(\alpha =0\): \(\forall \) \(i\) and \(j\), \({X}_{f\tau ,\sigma }=1\), and it becomes impossible to discriminate between different digits \({{\bf{X}}}_{\sigma }\) and the success rate is equal to 10% (random choice). As \(\alpha \) approaches zero, the success rate decreases drastically and drops to 10%. For such exponents, all inputs get mapped to the same output making data separation impossible. For \(\alpha =1\) the real part of the spectrogram corresponds to a linear transformation of the input data, thus there is no nonlinear data separation and the word recognition rate \(\simeq \) 10% (random choice).
The evolution shown in Fig. 2(b) can be understood by decomposing the exponent \(\alpha \) into an integer part \(n\in {\mathbb{N}}\) and a real part \(\varepsilon \in {\mathbb{R}}\) around \(n\) (\(\varepsilon \in ]\,\,0.5,0.5]\)): \(\alpha =n+\varepsilon \). For \({X}_{f\tau ,\sigma } < 0\), \({X}_{f\tau ,\sigma }^{}\to {({X}_{f\tau ,\sigma }^{})}^{n+\varepsilon }=\) \({X}_{f\tau ,\sigma }^{}{}^{n+\varepsilon }{(1)}^{n}(\cos (\pi \varepsilon )+i\,\sin (\pi \varepsilon ))\)and for \({X}_{f\tau ,\sigma }\geqslant 0\), \({X}_{f\tau ,\sigma }^{+}\to {({X}_{f\tau ,\sigma }^{+})}^{n+\varepsilon }={X}_{f\tau ,\sigma }^{+}{}^{n+\varepsilon }\). For simplicity, we choose to consider only the real part of the data obtained after applying the nonlinearity, so \({{\bf{R}}}_{\sigma }={X}_{f\tau ,\sigma }\to \Re ({X}_{f\tau ,\sigma }^{n+\varepsilon })\):
From Eq. (4), for \({X}_{f\tau ,\sigma } < 0\) there is an additional factor \({(1)}^{n}\,\cos (\pi \varepsilon )\) compared to \({X}_{f\tau ,\sigma } > 0\). Consider the particular case where \(\varepsilon =0\), then in the case of values of \({X}_{f\tau ,\sigma }\) that were initially negative, the values \({R}_{f\tau ,\sigma }\) have the sign \({(1)}^{n}\). So, depending on the parity of \(n\), there are two possibilities. If \(n\) is even, there is at least one value \({R}_{f\tau ,\sigma }\) in \({{\bf{R}}}_{\sigma }\) equal to 1 for each \({{\bf{R}}}_{\sigma }\) (digit in the database). If \(n\) is odd, the \({{\bf{R}}}_{\sigma }\) digits originating from an input \({{\bf{X}}}_{\sigma }\) where at least one \({X}_{f\tau ,\sigma }=\,1\) have a corresponding \({R}_{f\tau ,\sigma }=\,1\) (at least one \({R}_{f\tau ,\sigma }=1\) otherwise). Therefore, the oscillating behaviour of the success rate shown in Fig. 2b is related to what happens to the negative input data as shown in Eq. (4).
The poorer performance for the recognition task for odd \(n\) comes from the fact that the phase from the Fourier transform is essentially arbitrary. When \(n\) is even, the important elements of \({{\bf{R}}}_{\sigma }\) are always positive, but for odd \(n\) they are sometimes positive and sometimes negative. The greater variation in the latter case makes it essentially impossible for the neural network to connect the input date to the appropriate output. This behaviour is most easily seen in the limit that \(n\) becomes large as shown in the inset of Fig. 2b.
From Eq. (4) we can evaluate the effect of our nonlinear transformation for \(n\to \infty \): \(\mathop{\mathrm{lim}}\limits_{n\to \infty }\,{X}_{f\tau ,\sigma }{}^{n}=0\) for \({X}_{f\tau ,\sigma } < 1\) and \(\forall n\) \({X}_{f\tau ,\sigma }{}^{n}=1\) when \({X}_{f\tau ,\sigma }=1\). In practice, due to the numerical truncation on a computer, for \(n\gg 100\) and \({X}_{f\tau ,\sigma } < 1\), \({X}_{f\tau ,\sigma }{}^{n}=0\). So, for very large \(n\), the resulting vector \({{\bf{R}}}_{\sigma }\) contains only zeros and at least one element that is equal to 1 or −1 after the nonlinear transformation.
There are 500 digits in our spoken digit database and for very large odd values of \(n\), there are 253 \({{\bf{R}}}_{\sigma }\) vectors with one \({R}_{f\tau ,\sigma }=1\) and 247 with one \({R}_{f\tau ,\sigma }=\,1\). For each of these vectors, all other elements are mapped to zero. For large even values of \(n\), all the 500 \({{\bf{R}}}_{\sigma }\) contain one \({R}_{f\tau ,\sigma }=1\) with all others equal to zero. In this largeexponent limit, the classification task is simple to understand. The nonlinearity selects the largest magnitude frequency/time component from the transformed audio file. If this component is constant between speakers, the digit can be identified. For large values of \(n\), there are 2 different success rate values depending on the parity of \(n\). As shown in the inset of Fig. 2b, for large values of \(\alpha \) \((\alpha > 1000)\), the success rate behaviour tends to a square function alternating between very low values (12%) around odd values of \(\alpha \), i.e. α ∈ \([2n+0.5,2n+1.5]\), and a slightly higher value (25.8%) around even values of \(\alpha \), i.e. for α ∈ \([2n0.5,2n+0.5]\), where \(n\in {\mathbb{N}}\) (for large values of \(\alpha \), when \(\alpha =n+0.5\), the success rate is not defined). The difference arises because of the random phases that arise from the Fourier transform. For even \(n\), the phases are irrelevant, but for odd \(n\), sometimes the value gets mapped to one and sometimes to minus one, making it much more difficult to classify.
Overall, as shown in Fig. 2(b), for a wide range of values of \(\alpha \), the nonlinearity drastically improves the recognition rate. In particular, the recognition rate is very high for low exponents \(\alpha \). An optimum nonlinearity is reached for \(\alpha =0.2\) providing the highest recognition rate of 88%, which is comparable to those obtained for the cochleagram.
We use a \(t\)Distributed Stochastic Neighb our Embedding (\(t\)SNE) technique^{27} to represent our \({N}_{f}\) channels data in a 2D plot (see Fig. 3) in order to visualise how the data separation occurs and understand the recognition capacity of the different filtering methods. \(t\)SNE is a nonlinear dimensionality reduction technique used for embedding highdimensional data into a lowdimensional space of two or three dimensions. During the data reduction, the probability of two vectors to be neighbours is conserved, allowing visualisation of the structure in the data. Each digit is represented by coloured dots for all data points of the utterances. For instance, Fig. 3a shows that for the spectrogram with \(\alpha =1\) (linear) for which the recognition rate is about 10% (random choice), there is no data separation as all the coloured points seem to be randomly distributed. In particular the digits of a same class do not form separated clusters. On the other hand, for the spectrogram with \(\alpha =0.2\) (optimal nonlinearity), data separation can clearly be seen in Fig. 3b correlating with the better recognition rate of 88% compared to the linear spectrogram \((\alpha =1)\). Furthermore, \(t\)SNE shows data separation capability for both the cochleagram and the MFCC filter. As shown in Fig. 3c,d, well defined clusters corresponding to the spoken digits appear and corroborate the high recognition rates exhibited by the two filtering methods.
To summarise this section, we show that the nonlinear transformation applied to the input data by the MFCC filter and the cochleagram plays a similar role as the nonlinear nodes in the reservoir neural network prior to the linear classifier. We highlight that these standalone feature extractors perform data separation due to their internal nonlinear transformations. We indeed obtain recognition performance that are close to what is found with these approaches by adding a simple nonlinear transformation to the individual elements of the conventional spectrogram. Depending on the nonlinearity, the recognition rate can strongly vary from around 10% to 95.8%.
Neural network: Reservoir computing based on nonlinear oscillators
Having shown that nonlinear filtering methods can by themselves achieve high recognition rates, we turn to evaluating the gain in overall performance provided by a reservoir neural network taking as inputs the output of these acoustic filters. We implement the reservoir with a single nonlinear oscillator^{6}. In this approach, recurrent chains of nonlinear transformations occur in time instead of space. The loss of parallelism is compensated by timemultiplexing, in turn requiring that the input be preprocessed. To do that, each point of interval \(\tau \) in Fig. 1(b) is multiplied by a random binary matrix (of dimensions \({N}_{f}\times {N}_{\theta }\)) to induce transient behaviour. This transformation is linear and does not affect the final recognition rate. Each point of the input audio file is converted in a binary sequence of duration \(\tau \) composed of \({N}_{\theta }=400\) points separated by time steps \(\theta \). The time step \(\theta \) is set shorter than the relaxation time of the oscillator to keep the oscillator in the transient regime and generate temporal cascades at each sequence \(\tau \) of the preprocessed input.
We have developed a simple model based on a nonlinear magnetic oscillator^{28} taking into account the main ingredients for neuromorphic computing: nonlinearity (square root dependence of the amplitude on the input current) and memory (relaxation time of the oscillator between two different output voltage levels). The dynamics of the evolution of the oscillator output microwave voltage \({v}_{i}^{{\rm{osc}}}\) as a function of the input voltage \({v}_{i}^{{\rm{in}}}\) at time step \(i\) can be solved numerically^{16}:
where \({T}_{{\rm{relax}}}\) is the relaxation time towards the asymptotic value \({v}_{i}^{\infty }\) given by^{29}:
with \(c\), a constant related to the initial bias condition, i.e. the initial emitted voltage of oscillator, \(R\) the DC resistance of the oscillator and \({I}_{{\rm{c}}}\) the threshold current above which autooscillations can occur. In order to simulate the oscillator response to a time varying input \(({V}_{i}^{{\rm{in}}})\), we solve Eq. (5) numerically with the following parameters: Δ\(t=5\) ns, \({V}_{i}^{{\rm{in}}}/R=\pm 3\) mA, \({I}_{{\rm{DC}}}=6\) mA, \({I}_{{\rm{c}}}=4.9\) mA, \({T}_{{\rm{relax}}}=410\) ns. These parameters, which constitute huperparameters of our system, are extracted from experiments as reported elsewhere^{8}.
Even if the recognition rate by the nonlinear filters (MFCC and cochleagram) is already high, there is still room for improvement with the inclusion of a recurrent neural network. The increase in the recognition rate induced by the emulated nonlinear oscillator is shown in Fig. 4a. We determine the increase in recognition rate due to the neural network by subtracting from the total recognition rate the contribution from acoustic filters previously calculated in Fig. 2(a) and normalising the result with the total recognition rate. The gain provided by the nonlinear oscillator is low for the nonlinear filters (training over 9 data subsets): 3.8% for the cochlear method, 9.6% for the Spectro HP method, and 22% for MFCC method. The increase is small because the total recognition rate (filter + network) is close to a perfect success rate: up to 99.6%, 98.6%, and 99.2% for the cochleagram, Spectro HP, and the MFCC filter respectively. On the other hand, the neural network drastically improves the recognition gain up to 55.6% (for 9 data subsets trained) but the final recognition rate (filter + neural network) is not as good, around 65.2%. The simulations have been obtained for a specific neural network based on the nonlinear dynamics of an oscillator with time multiplexing in the framework of reservoir computing. As mentioned earlier, we choose this particular framework because it is frequently used for hardware implementations. However, these conclusions hold for very general types of (spatial or temporal) neural networks and learning processes because the limitation of the gain in recognition in the case of MFCC, Spectro HP, and cochlear filters is not due to the neural network but to the already excellent separation properties of the filtering.
We compare these simulations to the behaviour of an experimental nonlinear oscillator. In particular, we choose a magnetic nanooscillator that was recently demonstrated to be an excellent building block for neuromorphic computing^{8,16,30}. This kind of oscillator is small (nanoscale), performs low power computing, has a high signal to noise ratio for high reliable computation, and allows a tunable nonlinearity through the spin transfer torque mechanism. Our nanoscale oscillators are circular magnetic tunnel junctions, with a 6 nm thick FeB free layer and a diameter of 375 nm. The magnetisation in the FeB layer has a vortex structure as its ground state for these dimensions. In a small region called the vortex core, the elsewhere inplanecurling magnetisation points out of the plane. Under dc current injection, the core of the vortex steadily gyrates around the centre of the dot with a frequency in the range 250 MHz to 400 MHz. Vortex dynamics driven by spintorque are wellunderstood, wellcontrolled and have been shown to be particularly stable (more details can be found elsewhere^{31}).
The experimental implementation of the spoken digit recognition task is described in ref. ^{8}. The preprocessed input signal (filtered digits with time multiplexing) is generated and sent to the sample using an arbitrary waveform generator. Then, the microwave voltage across the magnetic tunnel junction is measured by a real time oscilloscope and fast oscillations are observed. The amplitude of oscillator response is obtained by inserting a microwave diode between the sample and the oscilloscope and is processed as the output signal. The oscillation amplitude is robust to noise thanks to the confinement provided by the counteracting torques exerted by the injected current and the magnetic damping. In addition, the voltage amplitude is highly nonlinear as a function of the injected current. The current depends on the voltage amplitude similarly to our simulated oscillator (square root dependence) in Eq. (3). Furthermore, the amplitude of the oscillator voltage intrinsically depends on past inputs when the time step θ is shorter than the relaxation time of the magnetic nanooscillator. Therefore, this single nanodevice has the two most crucial properties of neurons: nonlinearity and memory.
Tables 1 and 2 show the word success rates, as well as mean squared error obtained by simulations and experimentally. The gain on the spoken digit recognition for the different acoustic filters induced by the experimental magnetic nanooscillator is shown in Fig. 4(b). There is very good agreement between the experimental results and the simulations. When 9 data sets are used while doing the training process the gain is 3.8% for the cochleagram, 10.8% for the Spectro HP filter, 22% for the MFCC filter, and 70.4% for linear spectrogram. We see in Tables 1 and 2, as well as Fig. 4 that, for some cases, the magnetic nanooscillator exhibits slightly higher recognition gain than the simulations even though the latter neglects the intrinsic noise. We believe the better performance is mainly due to the higher complexity in the dynamics of the magnetic nanooscillators, including a relaxation time that varies with current. Finally, the observation shows that overfitting effects are quite minimal. Some overfitting with respect to the mean squared error (MSE) can be seen when using cochleagram filters. This is due to the fact that in this situation, the role of the reservoir is quite minimal, whereas it uses the same number of parameters as in the other situations. However, the overfitting occurs in the MSE does not make the overall performance for the word success rate (WSR) on the test set significantly worse.
The different contributions to the spoken digit recognition task are summarised in Fig. 5 for the case in which nine utterances are used during the learning step and one during the recognition. The random choice level is 10% and is shown in grey. The contribution of the filtering methods is shown in blue (not visible in the case of the linear spectrogram, \(\alpha =1\)). Figure 5 also shows the net contribution of a neural network, in our case under the reservoir computing approach, to the spoken digit task. The simulated version of our neural network, i.e. using the simulated dynamics of the spintorque vortex oscillators is shown in purple, while the results for the experimental magnetic nanooscillators are shown in green. The main contribution to the spoken recognition task brought by the neural network happens when there is a lot of work to perform, i.e. when starting from the random choice level (linear spectrogram). Nevertheless, when our neural network is coupled with well performing standalone feature extraction techniques like the cochleagram or the MFCC filter, it is capable of bringing the recognition rate level to stateoftheart values (overall WSR of 99.8% for the MFCC and the Spectro HP filters + experimental spintorque vortex oscillator).
More challenging spoken digit database
The TI46 database is based on clean audio waveforms from a limited number of speakers. We describe above how this database is rather limited for testing a neural network when combined with substantive preprocessing. We are able to test our nanooscillator based reservoir computing approach by limiting the preprocessing to a basic linear preprocessing filter (linear spectrogram) and demonstrate its effectiveness. However, we can test the combination of effective preprocessing with reservoir computing by using a larger data set with a broader set of speakers and a variety of background noise types and levels. In this section, we simulate the performance of our theoretical reservoir computing implementation (simulated STNO neural network) on the AURORA2 database^{32,33}.
The AURORA2 database provides the data for the task of recognising digits taken from the TIDigits database^{34} in noise and channel distorted environments (artificially corrupted). To simulate noisy telephony environments, the clean utterances are first downsampled to 8 kHz, and then additive and convolutional noise is added. The AURORA2 database has both clean and multicondition training and test sets. Each type of noise is added into a subset of clean speech utterances, with seven different levels of signaltonoise ratios (SNRs). This process generates seven subgroups of test sets for a specified noise type, with clean (infinite signaltonoise ratios) and signaltonoise ratios of 20, 15, 10, 5, 0, and −5 dB.
We simulate the preprocessing and reservoir computing on all available isolated spoken digits from 0 to 9 in the training and testA datasets from 214 female and male speakers. The training dataset contains 2196 clean digits and several subsets of noisy (corrupted) digits with 4 different noise types (subway, babble, car, and exhibition hall noise) at different signaltonoise ratios. During the training of our model, we select the 2196 clean digits and the corrupted digits with SNR = 20 dB (451 digits), SNR = 15 dB (444 digits), and SNR = 10 dB (430 digits). With this dataset, we train the model in mixed conditions.
The recognition (testing) is performed on testA dataset containing 4 types of added noise at different signaltonoise ratios. The different noise types are the same as in the training set (unlike for the testB and testC subsets of AURORA2 that we do not use). The testA dataset contains a subset of 1040 clean digits and several subsets containing each 1040 digits corrupted with subway, babble, car, and exhibition hall noise. We choose the following 4 subsets: clean and with corrupted with SNR = 20 dB, 15 dB, and 10 dB. To summarise, training is performed on 3521 digits and testing on 1040 unseen digits from the same categories as the training set. The test set contains 22.8% of the total number of digits (1040/4561).
Tables 3 and 4 give the simulated results for spoken digit recognition using the nanooscillator based reservoir computing approach combined with the two filtering methods, MFCC, and cochlear, respectively. The results are given in word success rate (%). In parenthesis, we give the gain compared to the baseline (control test without the nanooscillator based reservoir). Not surprisingly, the results are not as good as the results presented above for the TI46 database. While the preprocessing filters do much worse without the reservoir than they do on the TI46 database, they still do much better than linear preprocessing. In all cases, inclusion of the reservoir substantially improves the success rate compared to the baseline.
Comparison of Tables 3 and 4 shows that the MFCC filter is more robust to noise than the cochlear filter and gives better results in most cases. Interestingly, it does worse in almost all cases without the reservoir, but appears to allow the reservoir to make much larger improvements in the success rate. The average word success rate of the testing set containing only clean digits is 92.96% for the MFCC filter. The training was performed on clean and corrupted digits (mixed conditions). The improvement over the baseline is 50.70% (given in parentheses) implying that the baseline value is 42.26% (=92.96% − 50.70%). When the cochlear filter is used, the average gain brought by the neural network is +25.90% when testing clean digits and is about two times smaller than for the MFCC filter. The same holds when noisy conditions are tested. The overall gain is +48.79% (+23.02%) for an overall average recognition rate of 81.20% (68.82%) when the input is preprocessed with the MFCC filter (cochlear filter). The baseline is lower for the MFCC filter than for the cochlear filter but as the gain is much larger (about 2 times), the absolute performance of our neural network is larger in noisy conditions for the MFCC filter. We suspect that similar results would hold for the class of MFCClike filters, some of which are even more robust against the inclusion of noise.
There are many differences between the simulations performed on the TI46 and AURORA2 databases. For the TI46 database, there are only 5 female speakers uttering each digit 10 times. Training is performed on some utterances and recognition is performed on the others and the success rate is the average success rate over all combinations. For the AURORA2 database, there are 214 speakers, half of them are female and half are male, and they typically utter each digit twice. In contrast to TI46, the speakers in the training set are different from the speakers in the testing set. Even without added noise, the test is much more difficult and involves almost 9 times more digits than for TI46 (4561 digits in AURORA2 vs 500 digits in TI46).
Our results without the reservoir are consistent with previous results. Cochlear filtering achieves approximately a 60% word success rate (around 40% for MFCC) by itself on clean isolated digits of AURORA2 database^{25}. We obtain 63.24% (89.14–25.90%, see the last column of the first line of Table 4) for the cochlear filter and 42.26% (92.96–50.70%, see the last column of the first line of Table 3) for the MFCC filter.
Conclusion
We test different frequency filtering methods as stand alone feature extractors. Training a linear classifier on the \({{\bf{R}}}_{\sigma }\) vectors for the classic TI46 spoken digit data base, both the cochleagram and the MFCC filter give high identification rates without further processing. On the other hand, the real part of a linear spectrogram does not separate the inputs of different digit classes. Nonlinearly transforming the spectrogram, gives similar results to the cochleagram and MFCC filters, stressing that the separation found for the MFCC and cochlear classifiers is due to the presence of nonlinearity, with a minor effect due to the particular type of nonlinear transformation.
In a second part, a nonlinear oscillator is added to process the filtered input. The gain in word recognition due to the nonlinear oscillator is computed for each filtering method. The nonlinear oscillator is simulated and found to be in excellent agreement with experimental results with magnetic nanooscillators. For the nonlinear methods MFCC, Spectro HP, and cochleagram, the gain of word recognition is small, despite a nearly perfect word recognition. On the the other hand, for the linear spectrogram, the gain of word recognition is much higher even if the final word recognition is maximum 80%.
An important lesson is that when evaluating hardware systems with speech recognition tasks, the final word recognition rate should be interpreted with caution. If a very efficient filtering is used to preprocess the input, the hardware system may not be adding much performance. A hardware system only adds something if it provides improved word recognition. It should be noted that the use of more complicated datasets, such as the proprietary spoken digits dataset used in^{21,25}, or the inclusion of babble noise in the dataset would lead to significantly different results. The takeaway of our work is that, in order to test and compare hardware systems, using a linear spectrogram eases the interpretation of the results, because it does not introduce any separation of the input prior the hardware system. Furthermore, we show that a simple but powerful transformation like our Spectro HP filter starting from a simple spectrogram achieves stateoftheart results (simulations and experimentally) without applying any specific acoustic filter that mimics the human auditory system (like the cochleagram or the MFCC filter).
Testing on the AURORA2 dataset reveals that under noisy conditions the cochlear filter performs better by itself than the MFCC filter but the gain brought by the neural network is two times better on average for the MFCC filter and the overall word success rate is higher for the MFCC filter than for the cochlear filter.
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436 (2015).
Mead, C. Neuromorphic electronic systems. Proc. IEEE 78, 1629–1636, https://doi.org/10.1109/5.58356 (1990).
Mead, C. & Ismail, M. (eds) Analog VLSI implementation of neural systems, vol. 80 (Springer Science & Business Media, 2012).
Maass, W., Natschläger, T. & Markram, H. Realtime computing without stable states: A new framework for neural computation based on perturbations. Neural Comput. 14, 2531–2560, https://doi.org/10.1162/089976602760407955 (2002).
Jaeger, H. & Haas, H. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science 304, 78–80, https://doi.org/10.1126/science.1091277 (2004).
Appeltant, L. et al. Information processing using a single dynamical node as complex system. Nat. Commun. 2, 468 (2011).
Paquot, Y. et al. Optoelectronic reservoir computing. Sci. Reports 2, 287 (2012).
Torrejon, J. et al. Neuromorphic computing with nanoscale spintronic oscillators. Nature 547, 428 (2017).
Davis, S. & Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoust. Speech, Signal Process. 28, 357–366, https://doi.org/10.1109/TASSP.1980.1163420 (1980).
Verstraeten, D., Schrauwen, B., Stroobandt, D. & Van Campenhout, J. Isolated word recognition with the liquid state machine: a case study. Inf. Process. Lett. 95, 521–528, https://doi.org/10.1016/j.ipl.2005.05.019, Applications of Spiking Neural Networks (2005).
Hinton, G. et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29, 82–97 (2012).
Lyon, R. A computational model of filtering, detection, and compression in the cochlea. In ICASSP ’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 7, 1282–1285, https://doi.org/10.1109/ICASSP.1982.1171644 (1982).
Slaney, M. Lyon’s Cochlear Model. Apple Computer, Inc., Cupertino, CA, apple computer technical report#13 edn. (1988).
Brunner, D., Soriano, M. C., Mirasso, C. R. & Fischer, I. Parallel photonic information processing at gigabyte per second data rates using transient states. Nat. Commun. 4, 1364 (2013).
Vandoorne, K. et al. Experimental demonstration of reservoir computing on a silicon photonics chip. Nat. Commun. 5, 3541 (2014).
Riou, M. et al. Neuromorphic computing through timemultiplexing with a spintorque nanooscillator. Electron Devices Meet. (IEDM), 2017 IEEE Int. 36.3.1–36.3.4, https://doi.org/10.1109/IEDM.2017.8268505 (2017).
Larger, L. et al. Photonic information processing beyond turing: an optoelectronic implementation of reservoir computing. Opt. Express 20, 3241–3249, https://doi.org/10.1364/OE.20.003241 (2012).
Dejonckheere, A. et al. Alloptical reservoir computer based on saturation of absorption. Opt. Express 22, 10868–10881, https://doi.org/10.1364/OE.22.010868 (2014).
Vinckier, Q. et al. Highperformance photonic reservoir computer based on a coherently driven passive cavity. Optica 2, 438–446, https://doi.org/10.1364/OPTICA.2.000438 (2015).
Brunner, D. et al. Tutorial: Photonic neural networks in delay systems. J. Appl. Phys. 124, 152004, https://doi.org/10.1063/1.5042342 (2018).
Penkovsky, B., Larger, L. & Brunner, D. Efficient design of hardwareenabled reservoir computing in fpgas. J. Appl. Phys. 124, 162101, https://doi.org/10.1063/1.5039826 (2018).
Duport, F., Schneider, B., Smerieri, A., Haelterman, M. & Massar, S. Alloptical reservoir computing. Opt. Express 20, 22783–22795, https://doi.org/10.1364/OE.20.022783 (2012).
Martinenghi, R., Rybalko, S., Jacquot, M., Chembo, Y. K. & Larger, L. Photonic nonlinear transient computing with multipledelay wavelength dynamics. Phys. Rev. Lett. 108, 244101, https://doi.org/10.1103/PhysRevLett.108.244101 (2012).
Romain Martinenghi, M. J. Y. K. C. L. L., Antonio Baylon Fuentes. Towards optoelectronic architectures for integrated neuromorphic computers. Proc. SPIE OPTO 8989, https://doi.org/10.1117/12.2038347 (2014).
Larger, L. et al. Highspeed photonic reservoir computing using a timedelaybased architecture: Million words per second classification. Phys. Rev. X 7, 011015, https://doi.org/10.1103/PhysRevX.7.011015 (2017).
Lukoševičius, M. & Jaeger, H. Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3, 127–149, https://doi.org/10.1016/j.cosrev.2009.03.005 (2009).
van der Maaten, L. & Hinton, G. Visualizing data using tsne. J. machine learning research 9, 2579–2605 (2008).
Slavin, A. N. & Tiberkevich, V. S. Nonlinear autooscillator theory of microwave generation by spinpolarized current. IEEE Transactions on Magn. 45, 1875–1918, https://doi.org/10.1109/TMAG.2008.2009935 (2009).
Grimaldi, E. et al. Response to noise of a vortex based spin transfer nanooscillator. Phys. Rev. B 89, 104404, https://doi.org/10.1103/PhysRevB.89.104404 (2014).
Romera, M. et al. Vowel recognition with four coupled spintorque nanooscillators. Nature 563, 230–234, https://doi.org/10.1038/s415860180632y (2018).
Tsunegi, S. et al. High emission power and q factor in spin torque vortex oscillator consisting of feb free layer. Appl. Phys. Express 7, 063009 (2014).
Hirsch, H. & Pearce, D. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of ISCA ITRW ASR2000 on Automatic Speech recognition: Challenges for the next Millennium (Paris, France, 2000).
ELRA catalogue (http://catalog.elra.info), AURORA Project Database, v2.0, ISLRN: 9774571393042, ELRA ID: AURORA/CD0002.
Leonard, R. & Doddington, G. TIDigits. (Linguistic Data Consortium, Philadelphia, 1993).
Acknowledgements
F.A.A. is a Research Fellow of the F.R.S.FNRS. This work was supported by the European Research Council ERC under Grant bioSPINspired 682955.
Author information
Authors and Affiliations
Contributions
The study was designed by F.A.A., J.G. and M.D.S., samples were optimised and fabricated by S.T. and K.Y., experiments were performed by M.R. and J.T., and numerical studies were realised by F.A.A. F.A.A. wrote the core of the manuscript and all authors contributed to the text as well as to the analysis of the results.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Abreu Araujo, F., Riou, M., Torrejon, J. et al. Role of nonlinear data processing on speech recognition task in the framework of reservoir computing. Sci Rep 10, 328 (2020). https://doi.org/10.1038/s4159801956991x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4159801956991x
This article is cited by

Criticality in FitzHughNagumo oscillator ensembles: Design, robustness, and spatial invariance
Communications Physics (2024)

Allferroelectric implementation of reservoir computing
Nature Communications (2023)

Reconfigurable reservoir computing in a magnetic metamaterial
Communications Physics (2023)

A bioinspired adaptive acoustic sensor with integrated signal processing
Nature Electronics (2023)

Neuromorphic acoustic sensing using an adaptive microelectromechanical cochlea with integrated feedback
Nature Electronics (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.