Introduction

Artificial neural network algorithms outperform humans on recognition tasks like image or speech recognition, by leveraging deep networks of interconnected non-linear units called formal neurons1. The goal of neural networks is to extract the features and classify input data through learned non-linear transformation. Running such algorithms on a classical computer is costly energetically: to overcome this issue, neuromorphic approaches2,3 propose to implement them physically. In particular, reservoir computing4,5 is a kind of recurrent neural network that has been widely used to test the efficiency of hardware for neuromorphic computing6,7,8 because it has a simplified architecture and learning procedure. The input is sent to a neural network with fixed recurrent connections called a reservoir. The goal of the reservoir is to separate the different kinds of inputs, such that after this transformation, the classification can be done by a linear transformation. The response of the neurons of the reservoir are combined linearly with trained connections to construct the output. Since the connections in the reservoir are random and fixed, it is easier to fabricate it in hardware and then train the output connections, often emulated in software, with linear regression.

Speech recognition is a widely used class of benchmark tasks performed to test the efficiency of a neural network. It is especially employed in the case of reservoir computing because the recurrent connections of the reservoir create an intrinsic memory that is useful to classify time-varying inputs. Generally, this task requires frequency decomposition9,10,11 prior to the neural network because the acoustic features are contained in the frequency rather than in the amplitude of the time-varying signal. These decompositions return the amplitude of the signal in different frequency channels as a function of time. The neural network then extracts the acoustic features contained in the frequency information. Several frequency decomposition methods have been reported in the literature: Mel-frequency cepstral coefficients (MFCC) and Lyon’s cochlear model (cochleagram) are the most common methods since they mimic the filtering that occurs biologically9,12,13. However, the actual contribution of the acoustic filter to the total speech recognition rate is generally not investigated while performing speech recognition benchmarks with reservoir computing hardware, even if its influence on the final recognition rate may not be negligible8. Furthermore, both of these methods were developed before reservoir computing became popular and, thus they were designed to extract the useful features of an audio signal independent of modern machine learning.

Here, we first show how the choice of different filtering methods drastically affects the final speech recognition rate. We quantify the contributions of the acoustic filtering and the neural network, respectively, for a spoken digit recognition task using four frequency decomposition methods with different non-linear characters: Lyon’s ear cochleagram, MFCC filter, linear spectrogram \((\Re ({\rm{Spectro}}))\), and Spectro HP \(({\rm{Spectro}}\,{\rm{HP}}=|\,\sin \,\sqrt{|\Re ({\rm{Spectro}})|}|-\) \(|\,\cos \,\sqrt{|\Im ({\rm{Spectro}})|}|)\). In a first step, we show that the cochleagram, the Spectro HP and the MFCC filter are powerful stand-alone features extractors that can achieve by themselves (without additional processing by a neural network) very high recognition levels: up to 95.8%, 89.0%, and 77.2% for cochleagram, Spectro HP, and MFCC, respectively. In contrast, the linear spectrogram never achieves recognition levels statistically better than random sampling, 10%. However, by adding various levels of non-linearity to the real part of the spectrogram we were able to show a large increase of the recognition rate from about 10% (linear) to 88% (strong non-linearity). These results indicate that the high recognition level of the cochleagram and MFCC approaches is mainly due to the non-linear character of these frequency decomposition methods and not to the reservoir itself.

In a second step, we evaluate the gain in recognition rate provided by a particular hardware approach to reservoir computing, based on magnetic nano-oscillators. In order to compare to other hardware implementations in the literature, we model a neural network based on a single dynamical non-linear magnetic node in the framework of the reservoir computing approach6,7,8,14,15,16. We find that the contribution of the neural network is dominant for linear spectrogram filter and only plays a small role for the non-linear cochleagram and MFCC filter. Finally, we present experimental results using a non-linear and tunable magnetic nano-oscillator exhibiting excellent agreement with our simulations.

Methods

We perform a benchmark task called spoken digit recognition that is common in the reservoir computing community for software10 and hardware7,8,14,17,18,19,20,21 implementations. The input data, taken from the TI-46 database, are audio waveforms of clean isolated spoken digits (0 to 9) pronounced by five different female speakers (see example in Fig. 1a), as it is usual in the hardware reservoir computing communit6,7,8,10,14,17,20,22,23,24,25.

Figure 1
figure 1

Principle of spoken digit recognition. (a) Audio waveform corresponding to the digit 1 pronounced by speaker 1. (b) Filtering to frequency channels for acoustic feature extraction. The signal during each time interval \(\tau \) is decomposed in \({N}_{f}\) frequency channels. The cochlear model filters each point of the audio waveform in 78 frequency channels (13 in the case of the MFCC model and 65 for the spectrogram model). The frequency channels are concatenated in intervals of duration \(\tau \) to form the filtered input. (c) The filtered input is injected in the neural network or directly used to construct the output (No neural network). The neural network is composed of \(N\) interconnected filtered inputs. (d) For each digit, the response of the neural network (or directly the filtered output) is constructed from a linear combination of neuron states \({V}_{\theta \tau ,\sigma }\) (there are 10 classifiers in total).

The chosen part of the TI-46 spoken digit database contains 500 (5 speakers × 10 digits × 10 utterances) audio files, which we index using the Greek letter \(\sigma \). To perform speech recognition on these spoken digits, each audio temporal trace in the database is transformed from time-domain to a mixed time/frequency domain with different acoustic filters, two of which are known to create a better representation of human voice characteristics. These acoustic filters give rise to different instances of our speech database containing the following elements: \({X}_{f\tau ,\sigma }^{{\rm{filter}}}\) where filter \(\{{\rm{Cochlear}},{\rm{MFCC}},{\rm{Spectro}},{\rm{Spectro}}\,{\rm{HP}}\}\), \(f\) is the index for the different frequency channels, and \(\tau \) is the index of a new time representation that depends on the time frame window used while performing the time- to frequency-domain transformation. The number of time steps \({N}_{\tau }\) naturally depends on the digit length, while the number of frequency channels \({N}_{f}\) only depends the type of acoustic filter. For instance, \({N}_{f}^{{\rm{Cochlear}}}=78\), \({N}_{f}^{{\rm{MFCC}}}=13\), and \({N}_{f}^{{\rm{Spectro}}}=65\) while \({N}_{\tau }^{{\rm{Cochlear}}}\) ranges from 16 to 41, \({N}_{\tau }^{{\rm{MFCC}}}\) ranges from 31 to 83, and \({N}_{\tau }^{{\rm{Spectro}}}\) ranges from 24 to 67. Digits with \({N}_{\tau }\) lesser than the maximum value are padded with zeros.

The construction of the supervised learning task in the reservoir computing framework starts with associating each digit \({{\bf{X}}}_{\sigma }\) with its corresponding target \({{\bf{T}}}_{\sigma }\in {{\mathbb{R}}}^{{N}_{d}\times {N}_{\tau }}\) where \({N}_{d}\) is the number of categories to classify (here \({N}_{d}=10\) as the goal is to recognise the 10 different digits). Each target matrix \({{\bf{t}}}_{\sigma }\) is constructed column-wise and the \({N}_{\tau }\) columns correspond to the same target vector \({{\bf{t}}}_{\sigma }\) (the target vector \({t}_{d,\sigma }\) with \(d\in [\mathrm{0..9}]\) is zero almost everywhere but is one where \(d\) is equal to the corresponding digit number). The \({{\bf{T}}}_{\sigma }\) matrices would allow us to perform \(\tau \)-wise recognition (partial digits for instance) but in this study we choose to make entire-digit-wise recognition by averaging out the estimated target matrices \({\hat{{\bf{t}}}}_{\sigma }\) over the different columns (\(\tau \) direction) to end up with estimators \({\hat{{\bf{t}}}}_{\sigma }\) of the target vectors \({{\bf{t}}}_{\sigma }\) as shown below (see Eq. (3)).

Our reservoir is a time multiplexed single device as described in ref. 8. Rather than a set of \({N}_{\theta }\) physical neurons, our reservoir consists of a single physical neuron evaluated at \({N}_{\theta }\) periodic times. To input the data to these virtual neurons, we multiply each value by a time series of length \({N}_{\theta }\) consisting of ones and minus ones and send the resulting time series to the single device. The output of the reservoir is determined by the resulting state of the device at each of the \({N}_{\theta }\) times for each element of the input data string. This output is multiplied by the output weight matrix to give the results. Training consists of determining the optimum set of output weights, which can be found through straightforward linear algebra.

The key computational concept supporting the reservoir computing approach is a nonlinear dynamical transformation of the processed information, i.e. sending the input data to a new space, in which simple linear algebra gives the read-out of the results26. In this work, the non-linear transformation is the purpose of our spin-torque nano oscillator represented by the function \({\rm{STNO}}(\cdot )\) in Eq. (1). The information is then encoded and injected into this nonlinear dynamical system after flattening the data and multiplying each element of the flattened \({{\bf{X}}}_{\sigma }\) by a random binary mask \({\bf{M}}\in {{\mathbb{R}}}^{{N}_{\theta }\times {N}_{f}}\) of 1’s and −1’s. This binary masks starts the time-multiplexing technique as the value times the mask gives the input to each virtual neuron. As a result, the mask distributes the frequency content of each time step \(\tau \) of the input data into a fixed neural network layer (the reservoir) of \({N}_{\theta }\) nodes. To summarise, Eq. (1) shows the details of our Reservoir Computing implementation:

$${{\bf{X}}}_{\sigma }\to {{\bf{V}}}_{\sigma }:\{\begin{array}{ll}{{\bf{x}}}_{\sigma }={\rm{flatten}}({\bf{M}}{{\bf{X}}}_{\sigma }) & {x}_{i,\sigma }={\rm{flatten}}(\sum _{f}\,{M}_{\theta f}{X}_{f\tau ,\sigma })\,{\rm{with}}\,i\in [1..{N}_{\theta }{N}_{\tau }],\\ {{\bf{v}}}_{\sigma }={\rm{STNO}}({{\bf{x}}}_{\sigma }) & {v}_{i,\sigma }={\rm{STNO}}({x}_{i,\sigma }),\\ {{\bf{V}}}_{\sigma }={\rm{reshape}}({{\bf{v}}}_{\sigma }) & {V}_{\theta \tau ,\sigma }={\rm{reshape}}({v}_{i,\sigma }),\end{array}$$
(1)

where flatten(·) takes a \(m\) by \(n\) matrix as input and outputs a vector of length \(mn\) and reshape(·) does the reverse operation, i.e. takes a vector of length \(mn\) as input and outputs a \(m\) by \(n\) matrix.

Training (learning) is performed using a simple linear classifier. In this work, good performance is achieved using the Moore-Penrose pseudo-inverse after building the weight matrix \({\bf{W}}\) optimisation problem with a subset of \({N}_{{\rm{train}}}\) digits \({\bf{W}}[{{\bf{V}}}_{1},{{\bf{V}}}_{2},\ldots ,{{\bf{V}}}_{{N}_{{\rm{train}}}}]=[{{\bf{T}}}_{1},{{\bf{T}}}_{2},\ldots ,{{\bf{T}}}_{{N}_{{\rm{train}}}}]\):

$${\bf{W}}=[{{\bf{T}}}_{1},{{\bf{T}}}_{2},\ldots ,{{\bf{T}}}_{{N}_{{\rm{train}}}}]{[{{\bf{V}}}_{1},{{\bf{V}}}_{2},\ldots ,{{\bf{V}}}_{{N}_{{\rm{train}}}}]}^{-1}.$$
(2)

No regularisation technique is used. The testing (recognition) step is then achieved using the computed weights \({\bf{W}}\) applied to the complementary (unseen) subset of \({N}_{{\rm{test}}}\) digits:

$$\begin{array}{ll}{\hat{{\bf{T}}}}_{\sigma }={\bf{W}}{{\bf{V}}}_{\sigma } & {\hat{T}}_{d\tau ,\sigma }=\sum _{\theta }\,{W}_{d\theta }\,{V}_{\theta \tau ,\sigma }\\ {\hat{{\bf{t}}}}_{\sigma }={{\rm{mean}}}_{\tau }({\hat{{\bf{T}}}}_{\sigma }) & {\hat{t}}_{d,\sigma }=\frac{1}{{N}_{\tau }}\,\mathop{\sum }\limits_{\tau =1}^{{N}_{\tau }}\,{\hat{T}}_{d\tau ,\sigma }\end{array}.$$
(3)

The estimator for a specific digit is given by \({\hat{d}}_{\sigma }={\rm{argmax}}({\hat{{\bf{t}}}}_{\sigma })\) (this corresponds to the Winner-Takes-All strategy, adequate for the present task). Digit \(\sigma \) is well recognised when \({\hat{d}}_{\sigma }={\rm{argmax}}({{\bf{t}}}_{\sigma })\) and the main performance estimator used in this work is the Word Success Rate (WSR) and corresponds to the percentage of well recognised digits over the total number of digits to recognise (\({N}_{{\rm{test}}})\). Another common performance estimator, useful for identifying overfitting issues, is the Mean Squared Error (MSE): MSE\(({\hat{{\bf{t}}}}_{\sigma })={\mathbb{E}}[{({\hat{{\bf{t}}}}_{\sigma }-{{\bf{t}}}_{\sigma })}^{2}]\).

In all cases, the training and testing sets do not overlap. To avoid any learning bias while selecting samples randomly from the database when choosing the training and testing sets, we organise the 500 input files into 10 subsets of 50 files. Each subset contains one utterance of each digit pronounced by each of the five speakers. A random selection would produce an over-representation of some speakers and an under-representation of others. We take \(N\) utterance subsets (50 audio files, one for each digit and each speaker) for training (total training set size N × 50), and \(10-N\) utterance subsets for testing (total testing set size (10 − N) × 50). To minimise the fluctuations that occur in the results due to random choices between the training and testing sets, we employ a cross-validation technique and therefore average over all possible choices. That is, when \(N\) utterance subsets are used for training, we average over \(10!/[N!(10-N)!]\) possible ways to choose the training and testing sets. This procedure also allows us to determine a width to the distribution of individual outcomes indicated by the shaded regions in Fig. 2. All word success rate results reported in the paper are cross-validated test results.

Figure 2
figure 2

Spoken digit recognition for filtered inputs. (a) Spoken digit cross-validated test recognition rates as a function of the number of data subsets \(N\) used for training (total size of the training set \(5\times 10\times N\)) of the filtered input (without neural network) corresponding to four different methods: cochleagram, MFCC filter, Spectro HP and linear spectrogram \((\alpha =1)\). (b) Spoken digit recognition as a function of non-linear coefficient for spectrogram methods (Inset: Word success rate for large non-linear coefficient values from 1000 to 1004). Here, 9 data subsets (90% of the database) are used for training our reservoir computing model and the remaining subset (10% of the database) is used to perform the recognition task. The shaded region corresponds to the uncertainty of the recognition rate, here the standard deviation).

In reservoir computing, training is fast and always converges due to the basic linear algebra algorithms. This behaviour stands in contrast to standard recurrent neural-network approaches for which learning can be time consuming and does not necessarily converge to the desired solution. In reservoir computing, the learning process only modifies the read-out weights whereas in other types of recurrent neural network it modifies the weights in all the other constituent layers in the neural network under complex feed-forward/back-propagation algorithms.

The contribution of the frequency filtering and the reservoir computing, respectively, are then analysed separately. In order to evaluate the impact of the frequency filtering on the input separation capability, a linear classifier is trained directly on the different frequency channels. The classification results with both influence of the frequency filtering and the reservoir are computed by injecting the filtered input in a neural network composed of \({N}_{\theta }\) interconnected neurons. Here, we use \({N}_{\theta }=400\) input neurons that are connected to all of the frequency channels for each time interval \(\tau \), Fig. 1c, as this number allowed reaching maximum test accuracy. In the framework of reservoir computing, these fixed connections have random weights. To reach high classification rate, 400 neurons are sufficient6,8. The features of the magnetic neurons that we consider are specified in section IV. A linear classifier is trained to map the neuron outputs to the desired results. The contribution of the reservoir to the ultimate success is extracted from the results by subtracting the success rate found using only the frequency filtering methods.

Acoustic filter: role of non-linearity

First, we compute the digit recognition rate as a function of the number of utterances used in training for the cochleagram and the MFCC methods as shown in Fig. 2(a). The recognition rate increases with the number of trained utterances and then saturates in the case of the cochlear model. It remains almost constant for the MFCC model. Both filters achieve a high recognition rate. In particular, the cochlear model is an excellent acoustic feature extractor with recognition rates up to 95.8% (for 9 trained utterances) whereas the MFCC filter is less powerful, reaching recognition rates up to 77.2%.

These filters are commonly used for speech recognition tasks, because of their similarity to audio signal processing in biological ears, which perform complex frequency decompositions with high non-linearities. Both MFCC and cochlear methods use non-linearities to transform the audio data. For MFCC the transformed representation corresponds to the log-energy of the Mel frequency filter output9. In the cochleagram approach, the main non-linear ingredient corresponds to a set of interconnected automatic gain controls12,13. The successful separation of the data achieved by these filtering methods appears to be mainly due to the non-linear character of the transformation with a moderate influence of the kind of non-linearity (similarly to reservoirs that can have different kinds of non-linearity that work).

To establish the critical role of the non-linearity contained in the filtering methods to recognition performance, we start by investigating the separation achieved by a very simple linear spectrogram filter. This filter is based on standard Fourier transforms of the audio input over finite time windows. The Fourier transform is a linear operation that outputs a real and an imaginary part. We consider only the real part in the following in order to avoid introducing non-linearities by computing the norm. After the Fourier transforms, \({{\bf{Z}}}_{\sigma }\) is the matrix of the real parts of the spectrogram with dimension \({N}_{f}\times {N}_{\tau }\) where \({N}_{f}\) is the number of frequency channels and \({N}_{\tau }\) is the number of time steps, which depends on the particular digit. We normalise the data, \({{\bf{X}}}_{\sigma }={{\bf{Z}}}_{\sigma }/\,{\rm{\max }}(|{{\bf{Z}}}_{\sigma }|)\) and \({X}_{f\tau ,\sigma }\in [\,-\,1,1]\) for \(f\in \{1..{N}_{f}\}\) and \(\tau \in \{1..{N}_{\tau }\}\). The normalisation is crucial to ensure that there exists at least one \({X}_{f\tau ,\sigma }\) that is equal either to 1 or to −1 for each \({{\bf{X}}}_{\sigma }\) when non-linearities are introduced into the transform.

To study the influence of a non-linear transformation on the normalised input data \({{\bf{X}}}_{\sigma }\), we choose to apply a point-wise operation, namely the exponent \(\alpha \in {\mathbb{R}}\), giving rise to the transformation on each element of \({X}_{f\tau ,\sigma }^{{\rm{filter}}}\to {({X}_{f\tau ,\sigma }^{{\rm{filter}}})}^{\alpha }\). The impact of the non-linear exponent \(\alpha \) on the recognition rate is shown in Fig. 2(b). The recognition rate oscillates strongly as a function of the non-linear exponent and decreases for large \(\alpha \). Some particular values of the recognition rate can be easily understood. For \(\alpha =0\): \(\forall \) \(i\) and \(j\), \({X}_{f\tau ,\sigma }=1\), and it becomes impossible to discriminate between different digits \({{\bf{X}}}_{\sigma }\) and the success rate is equal to 10% (random choice). As \(\alpha \) approaches zero, the success rate decreases drastically and drops to 10%. For such exponents, all inputs get mapped to the same output making data separation impossible. For \(\alpha =1\) the real part of the spectrogram corresponds to a linear transformation of the input data, thus there is no non-linear data separation and the word recognition rate \(\simeq \) 10% (random choice).

The evolution shown in Fig. 2(b) can be understood by decomposing the exponent \(\alpha \) into an integer part \(n\in {\mathbb{N}}\) and a real part \(\varepsilon \in {\mathbb{R}}\) around \(n\) (\(\varepsilon \in ]\,-\,0.5,0.5]\)): \(\alpha =n+\varepsilon \). For \({X}_{f\tau ,\sigma } < 0\), \({X}_{f\tau ,\sigma }^{-}\to {({X}_{f\tau ,\sigma }^{-})}^{n+\varepsilon }=\) \(|{X}_{f\tau ,\sigma }^{-}{|}^{n+\varepsilon }{(-1)}^{n}(\cos (\pi \varepsilon )+i\,\sin (\pi \varepsilon ))\)and for \({X}_{f\tau ,\sigma }\geqslant 0\), \({X}_{f\tau ,\sigma }^{+}\to {({X}_{f\tau ,\sigma }^{+})}^{n+\varepsilon }=|{X}_{f\tau ,\sigma }^{+}{|}^{n+\varepsilon }\). For simplicity, we choose to consider only the real part of the data obtained after applying the non-linearity, so \({{\bf{R}}}_{\sigma }={X}_{f\tau ,\sigma }\to \Re ({X}_{f\tau ,\sigma }^{n+\varepsilon })\):

$${{\bf{R}}}_{\sigma }=\{\begin{array}{rcl}\Re ({({X}_{f\tau ,\sigma }^{-})}^{n+\varepsilon }) & = & |{X}_{f\tau ,\sigma }^{-}{|}^{n+\varepsilon }{(-1)}^{n}\,\cos (\pi \varepsilon )\\ \Re ({({X}_{f\tau ,\sigma }^{+})}^{n+\varepsilon }) & = & |{X}_{f\tau ,\sigma }^{+}{|}^{n+\varepsilon }.\end{array}$$
(4)

From Eq. (4), for \({X}_{f\tau ,\sigma } < 0\) there is an additional factor \({(-1)}^{n}\,\cos (\pi \varepsilon )\) compared to \({X}_{f\tau ,\sigma } > 0\). Consider the particular case where \(\varepsilon =0\), then in the case of values of \({X}_{f\tau ,\sigma }\) that were initially negative, the values \({R}_{f\tau ,\sigma }\) have the sign \({(-1)}^{n}\). So, depending on the parity of \(n\), there are two possibilities. If \(n\) is even, there is at least one value \({R}_{f\tau ,\sigma }\) in \({{\bf{R}}}_{\sigma }\) equal to 1 for each \({{\bf{R}}}_{\sigma }\) (digit in the database). If \(n\) is odd, the \({{\bf{R}}}_{\sigma }\) digits originating from an input \({{\bf{X}}}_{\sigma }\) where at least one \({X}_{f\tau ,\sigma }=-\,1\) have a corresponding \({R}_{f\tau ,\sigma }=-\,1\) (at least one \({R}_{f\tau ,\sigma }=1\) otherwise). Therefore, the oscillating behaviour of the success rate shown in Fig. 2b is related to what happens to the negative input data as shown in Eq. (4).

The poorer performance for the recognition task for odd \(n\) comes from the fact that the phase from the Fourier transform is essentially arbitrary. When \(n\) is even, the important elements of \({{\bf{R}}}_{\sigma }\) are always positive, but for odd \(n\) they are sometimes positive and sometimes negative. The greater variation in the latter case makes it essentially impossible for the neural network to connect the input date to the appropriate output. This behaviour is most easily seen in the limit that \(n\) becomes large as shown in the inset of Fig. 2b.

From Eq. (4) we can evaluate the effect of our non-linear transformation for \(n\to \infty \): \(\mathop{\mathrm{lim}}\limits_{n\to \infty }\,|{X}_{f\tau ,\sigma }{|}^{n}=0\) for \(|{X}_{f\tau ,\sigma }| < 1\) and \(\forall n\) \(|{X}_{f\tau ,\sigma }{|}^{n}=1\) when \(|{X}_{f\tau ,\sigma }|=1\). In practice, due to the numerical truncation on a computer, for \(n\gg 100\) and \(|{X}_{f\tau ,\sigma }| < 1\), \(|{X}_{f\tau ,\sigma }{|}^{n}=0\). So, for very large \(n\), the resulting vector \({{\bf{R}}}_{\sigma }\) contains only zeros and at least one element that is equal to 1 or −1 after the non-linear transformation.

There are 500 digits in our spoken digit database and for very large odd values of \(n\), there are 253 \({{\bf{R}}}_{\sigma }\) vectors with one \({R}_{f\tau ,\sigma }=1\) and 247 with one \({R}_{f\tau ,\sigma }=-\,1\). For each of these vectors, all other elements are mapped to zero. For large even values of \(n\), all the 500 \({{\bf{R}}}_{\sigma }\) contain one \({R}_{f\tau ,\sigma }=1\) with all others equal to zero. In this large-exponent limit, the classification task is simple to understand. The non-linearity selects the largest magnitude frequency/time component from the transformed audio file. If this component is constant between speakers, the digit can be identified. For large values of \(n\), there are 2 different success rate values depending on the parity of \(n\). As shown in the inset of Fig. 2b, for large values of \(\alpha \) \((\alpha > 1000)\), the success rate behaviour tends to a square function alternating between very low values (12%) around odd values of \(\alpha \), i.e. α \([2n+0.5,2n+1.5]\), and a slightly higher value (25.8%) around even values of \(\alpha \), i.e. for α \([2n-0.5,2n+0.5]\), where \(n\in {\mathbb{N}}\) (for large values of \(\alpha \), when \(\alpha =n+0.5\), the success rate is not defined). The difference arises because of the random phases that arise from the Fourier transform. For even \(n\), the phases are irrelevant, but for odd \(n\), sometimes the value gets mapped to one and sometimes to minus one, making it much more difficult to classify.

Overall, as shown in Fig. 2(b), for a wide range of values of \(\alpha \), the non-linearity drastically improves the recognition rate. In particular, the recognition rate is very high for low exponents \(\alpha \). An optimum non-linearity is reached for \(\alpha =0.2\) providing the highest recognition rate of 88%, which is comparable to those obtained for the cochleagram.

We use a \(t\)-Distributed Stochastic Neighb our Embedding (\(t\)-SNE) technique27 to represent our \({N}_{f}\) channels data in a 2D plot (see Fig. 3) in order to visualise how the data separation occurs and understand the recognition capacity of the different filtering methods. \(t\)-SNE is a nonlinear dimensionality reduction technique used for embedding high-dimensional data into a low-dimensional space of two or three dimensions. During the data reduction, the probability of two vectors to be neighbours is conserved, allowing visualisation of the structure in the data. Each digit is represented by coloured dots for all data points of the utterances. For instance, Fig. 3a shows that for the spectrogram with \(\alpha =1\) (linear) for which the recognition rate is about 10% (random choice), there is no data separation as all the coloured points seem to be randomly distributed. In particular the digits of a same class do not form separated clusters. On the other hand, for the spectrogram with \(\alpha =0.2\) (optimal non-linearity), data separation can clearly be seen in Fig. 3b correlating with the better recognition rate of 88% compared to the linear spectrogram \((\alpha =1)\). Furthermore, \(t\)-SNE shows data separation capability for both the cochleagram and the MFCC filter. As shown in Fig. 3c,d, well defined clusters corresponding to the spoken digits appear and corroborate the high recognition rates exhibited by the two filtering methods.

Figure 3
figure 3

2D representation of the two \(t\)-SNE components for: (a) the spectrogram with \(\alpha =1\), (b) the spectrogram with \(\alpha =0.2\), (c) the cochleagram, and (d) the MFCC filtering methods.

To summarise this section, we show that the non-linear transformation applied to the input data by the MFCC filter and the cochleagram plays a similar role as the non-linear nodes in the reservoir neural network prior to the linear classifier. We highlight that these stand-alone feature extractors perform data separation due to their internal non-linear transformations. We indeed obtain recognition performance that are close to what is found with these approaches by adding a simple non-linear transformation to the individual elements of the conventional spectrogram. Depending on the non-linearity, the recognition rate can strongly vary from around 10% to 95.8%.

Neural network: Reservoir computing based on non-linear oscillators

Having shown that non-linear filtering methods can by themselves achieve high recognition rates, we turn to evaluating the gain in overall performance provided by a reservoir neural network taking as inputs the output of these acoustic filters. We implement the reservoir with a single non-linear oscillator6. In this approach, recurrent chains of non-linear transformations occur in time instead of space. The loss of parallelism is compensated by time-multiplexing, in turn requiring that the input be preprocessed. To do that, each point of interval \(\tau \) in Fig. 1(b) is multiplied by a random binary matrix (of dimensions \({N}_{f}\times {N}_{\theta }\)) to induce transient behaviour. This transformation is linear and does not affect the final recognition rate. Each point of the input audio file is converted in a binary sequence of duration \(\tau \) composed of \({N}_{\theta }=400\) points separated by time steps \(\theta \). The time step \(\theta \) is set shorter than the relaxation time of the oscillator to keep the oscillator in the transient regime and generate temporal cascades at each sequence \(\tau \) of the pre-processed input.

We have developed a simple model based on a non-linear magnetic oscillator28 taking into account the main ingredients for neuromorphic computing: non-linearity (square root dependence of the amplitude on the input current) and memory (relaxation time of the oscillator between two different output voltage levels). The dynamics of the evolution of the oscillator output microwave voltage \({v}_{i}^{{\rm{osc}}}\) as a function of the input voltage \({v}_{i}^{{\rm{in}}}\) at time step \(i\) can be solved numerically16:

$${v}_{i}^{{\rm{osc}}}={v}_{i}^{\infty }(1-{e}^{-\Delta t/{T}_{{\rm{relax}}}})+{v}_{i-1}^{{\rm{osc}}}\cdot {e}^{-\Delta t/{T}_{{\rm{relax}}}},$$
(5)

where \({T}_{{\rm{relax}}}\) is the relaxation time towards the asymptotic value \({v}_{i}^{\infty }\) given by29:

$${v}_{i}^{\infty }=c\sqrt{{I}_{{\rm{DC}}}-{v}_{i}^{{\rm{in}}}/R-{I}_{{\rm{c}}}},$$
(6)

with \(c\), a constant related to the initial bias condition, i.e. the initial emitted voltage of oscillator, \(R\) the DC resistance of the oscillator and \({I}_{{\rm{c}}}\) the threshold current above which auto-oscillations can occur. In order to simulate the oscillator response to a time varying input \(({V}_{i}^{{\rm{in}}})\), we solve Eq. (5) numerically with the following parameters: Δ\(t=5\) ns, \({V}_{i}^{{\rm{in}}}/R=\pm 3\) mA, \({I}_{{\rm{DC}}}=6\) mA, \({I}_{{\rm{c}}}=4.9\) mA, \({T}_{{\rm{relax}}}=410\) ns. These parameters, which constitute huperparameters of our system, are extracted from experiments as reported elsewhere8.

Even if the recognition rate by the non-linear filters (MFCC and cochleagram) is already high, there is still room for improvement with the inclusion of a recurrent neural network. The increase in the recognition rate induced by the emulated non-linear oscillator is shown in Fig. 4a. We determine the increase in recognition rate due to the neural network by subtracting from the total recognition rate the contribution from acoustic filters previously calculated in Fig. 2(a) and normalising the result with the total recognition rate. The gain provided by the non-linear oscillator is low for the non-linear filters (training over 9 data subsets): 3.8% for the cochlear method, 9.6% for the Spectro HP method, and 22% for MFCC method. The increase is small because the total recognition rate (filter + network) is close to a perfect success rate: up to 99.6%, 98.6%, and 99.2% for the cochleagram, Spectro HP, and the MFCC filter respectively. On the other hand, the neural network drastically improves the recognition gain up to 55.6% (for 9 data subsets trained) but the final recognition rate (filter + neural network) is not as good, around 65.2%. The simulations have been obtained for a specific neural network based on the non-linear dynamics of an oscillator with time multiplexing in the framework of reservoir computing. As mentioned earlier, we choose this particular framework because it is frequently used for hardware implementations. However, these conclusions hold for very general types of (spatial or temporal) neural networks and learning processes because the limitation of the gain in recognition in the case of MFCC, Spectro HP, and cochlear filters is not due to the neural network but to the already excellent separation properties of the filtering.

Figure 4
figure 4

Spoken digit recognition for a neural network. (a) Spoken digit gain on cross-validated test recognition rates as a function of the number of subsets \(N\) used during training for a non-linear oscillator modelled with Eqs. (2 and 3) and (b) experimental spin torque nano-oscillator driven by spin polarised current. The coloured region corresponds to the uncertainty of the recognition rate, here twice the standard deviation.

We compare these simulations to the behaviour of an experimental non-linear oscillator. In particular, we choose a magnetic nano-oscillator that was recently demonstrated to be an excellent building block for neuromorphic computing8,16,30. This kind of oscillator is small (nanoscale), performs low power computing, has a high signal to noise ratio for high reliable computation, and allows a tunable non-linearity through the spin transfer torque mechanism. Our nanoscale oscillators are circular magnetic tunnel junctions, with a 6 nm thick FeB free layer and a diameter of 375 nm. The magnetisation in the FeB layer has a vortex structure as its ground state for these dimensions. In a small region called the vortex core, the elsewhere in-plane-curling magnetisation points out of the plane. Under dc current injection, the core of the vortex steadily gyrates around the centre of the dot with a frequency in the range 250 MHz to 400 MHz. Vortex dynamics driven by spin-torque are well-understood, well-controlled and have been shown to be particularly stable (more details can be found elsewhere31).

The experimental implementation of the spoken digit recognition task is described in ref. 8. The preprocessed input signal (filtered digits with time multiplexing) is generated and sent to the sample using an arbitrary waveform generator. Then, the microwave voltage across the magnetic tunnel junction is measured by a real time oscilloscope and fast oscillations are observed. The amplitude of oscillator response is obtained by inserting a microwave diode between the sample and the oscilloscope and is processed as the output signal. The oscillation amplitude is robust to noise thanks to the confinement provided by the counteracting torques exerted by the injected current and the magnetic damping. In addition, the voltage amplitude is highly non-linear as a function of the injected current. The current depends on the voltage amplitude similarly to our simulated oscillator (square root dependence) in Eq. (3). Furthermore, the amplitude of the oscillator voltage intrinsically depends on past inputs when the time step θ is shorter than the relaxation time of the magnetic nano-oscillator. Therefore, this single nano-device has the two most crucial properties of neurons: non-linearity and memory.

Tables 1 and 2 show the word success rates, as well as mean squared error obtained by simulations and experimentally. The gain on the spoken digit recognition for the different acoustic filters induced by the experimental magnetic nano-oscillator is shown in Fig. 4(b). There is very good agreement between the experimental results and the simulations. When 9 data sets are used while doing the training process the gain is 3.8% for the cochleagram, 10.8% for the Spectro HP filter, 22% for the MFCC filter, and 70.4% for linear spectrogram. We see in Tables 1 and 2, as well as Fig. 4 that, for some cases, the magnetic nano-oscillator exhibits slightly higher recognition gain than the simulations even though the latter neglects the intrinsic noise. We believe the better performance is mainly due to the higher complexity in the dynamics of the magnetic nano-oscillators, including a relaxation time that varies with current. Finally, the observation shows that overfitting effects are quite minimal. Some overfitting with respect to the mean squared error (MSE) can be seen when using cochleagram filters. This is due to the fact that in this situation, the role of the reservoir is quite minimal, whereas it uses the same number of parameters as in the other situations. However, the overfitting occurs in the MSE does not make the overall performance for the word success rate (WSR) on the test set significantly worse.

Table 1 Results for a simulated STNO neural network with N = 400 nodes.
Table 2 Results for a experimental STNO neural network with N = 400 nodes.

The different contributions to the spoken digit recognition task are summarised in Fig. 5 for the case in which nine utterances are used during the learning step and one during the recognition. The random choice level is 10% and is shown in grey. The contribution of the filtering methods is shown in blue (not visible in the case of the linear spectrogram, \(\alpha =1\)). Figure 5 also shows the net contribution of a neural network, in our case under the reservoir computing approach, to the spoken digit task. The simulated version of our neural network, i.e. using the simulated dynamics of the spin-torque vortex oscillators is shown in purple, while the results for the experimental magnetic nano-oscillators are shown in green. The main contribution to the spoken recognition task brought by the neural network happens when there is a lot of work to perform, i.e. when starting from the random choice level (linear spectrogram). Nevertheless, when our neural network is coupled with well performing stand-alone feature extraction techniques like the cochleagram or the MFCC filter, it is capable of bringing the recognition rate level to state-of-the-art values (overall WSR of 99.8% for the MFCC and the Spectro HP filters + experimental spin-torque vortex oscillator).

Figure 5
figure 5

Contributions to the spoken digit cross-validated test recognition rate. Random choice level is shown in grey, the filtering methods in blue, and the neural network under the reservoir computing approach in purple and green for the simulations and experiments, respectively. Here, 9 data subsets (90% of the database) are used for training our reservoir computing model and the remaining subset (10% of the database) is used to perform the recognition task.

More challenging spoken digit database

The TI-46 database is based on clean audio waveforms from a limited number of speakers. We describe above how this database is rather limited for testing a neural network when combined with substantive preprocessing. We are able to test our nano-oscillator based reservoir computing approach by limiting the preprocessing to a basic linear preprocessing filter (linear spectrogram) and demonstrate its effectiveness. However, we can test the combination of effective preprocessing with reservoir computing by using a larger data set with a broader set of speakers and a variety of background noise types and levels. In this section, we simulate the performance of our theoretical reservoir computing implementation (simulated STNO neural network) on the AURORA-2 database32,33.

The AURORA-2 database provides the data for the task of recognising digits taken from the TIDigits database34 in noise and channel distorted environments (artificially corrupted). To simulate noisy telephony environments, the clean utterances are first down-sampled to 8 kHz, and then additive and convolutional noise is added. The AURORA-2 database has both clean and multi-condition training and test sets. Each type of noise is added into a subset of clean speech utterances, with seven different levels of signal-to-noise ratios (SNRs). This process generates seven subgroups of test sets for a specified noise type, with clean (infinite signal-to-noise ratios) and signal-to-noise ratios of 20, 15, 10, 5, 0, and −5 dB.

We simulate the preprocessing and reservoir computing on all available isolated spoken digits from 0 to 9 in the training and testA datasets from 214 female and male speakers. The training dataset contains 2196 clean digits and several subsets of noisy (corrupted) digits with 4 different noise types (subway, babble, car, and exhibition hall noise) at different signal-to-noise ratios. During the training of our model, we select the 2196 clean digits and the corrupted digits with SNR = 20 dB (451 digits), SNR = 15 dB (444 digits), and SNR = 10 dB (430 digits). With this dataset, we train the model in mixed conditions.

The recognition (testing) is performed on testA dataset containing 4 types of added noise at different signal-to-noise ratios. The different noise types are the same as in the training set (unlike for the testB and testC subsets of AURORA-2 that we do not use). The testA dataset contains a subset of 1040 clean digits and several subsets containing each 1040 digits corrupted with subway, babble, car, and exhibition hall noise. We choose the following 4 subsets: clean and with corrupted with SNR = 20 dB, 15 dB, and 10 dB. To summarise, training is performed on 3521 digits and testing on 1040 unseen digits from the same categories as the training set. The test set contains 22.8% of the total number of digits (1040/4561).

Tables 3 and 4 give the simulated results for spoken digit recognition using the nano-oscillator based reservoir computing approach combined with the two filtering methods, MFCC, and cochlear, respectively. The results are given in word success rate (%). In parenthesis, we give the gain compared to the baseline (control test without the nano-oscillator based reservoir). Not surprisingly, the results are not as good as the results presented above for the TI-46 database. While the preprocessing filters do much worse without the reservoir than they do on the TI-46 database, they still do much better than linear preprocessing. In all cases, inclusion of the reservoir substantially improves the success rate compared to the baseline.

Table 3 Word success rate (in percent) for a simulated STNO neural network with \(N=2000\) nodes after filtering the inputs with the MFCC filter combined with the reservoir.
Table 4 Word success rate (in percent) for a simulated STNO neural network with \(N=2000\) nodes after filtering the inputs with the cochlear filter combined with the reservoir.

Comparison of Tables 3 and 4 shows that the MFCC filter is more robust to noise than the cochlear filter and gives better results in most cases. Interestingly, it does worse in almost all cases without the reservoir, but appears to allow the reservoir to make much larger improvements in the success rate. The average word success rate of the testing set containing only clean digits is 92.96% for the MFCC filter. The training was performed on clean and corrupted digits (mixed conditions). The improvement over the baseline is 50.70% (given in parentheses) implying that the baseline value is 42.26% (=92.96% − 50.70%). When the cochlear filter is used, the average gain brought by the neural network is +25.90% when testing clean digits and is about two times smaller than for the MFCC filter. The same holds when noisy conditions are tested. The overall gain is +48.79% (+23.02%) for an overall average recognition rate of 81.20% (68.82%) when the input is preprocessed with the MFCC filter (cochlear filter). The baseline is lower for the MFCC filter than for the cochlear filter but as the gain is much larger (about 2 times), the absolute performance of our neural network is larger in noisy conditions for the MFCC filter. We suspect that similar results would hold for the class of MFCC-like filters, some of which are even more robust against the inclusion of noise.

There are many differences between the simulations performed on the TI-46 and AURORA-2 databases. For the TI-46 database, there are only 5 female speakers uttering each digit 10 times. Training is performed on some utterances and recognition is performed on the others and the success rate is the average success rate over all combinations. For the AURORA-2 database, there are 214 speakers, half of them are female and half are male, and they typically utter each digit twice. In contrast to TI-46, the speakers in the training set are different from the speakers in the testing set. Even without added noise, the test is much more difficult and involves almost 9 times more digits than for TI-46 (4561 digits in AURORA-2 vs 500 digits in TI-46).

Our results without the reservoir are consistent with previous results. Cochlear filtering achieves approximately a 60% word success rate (around 40% for MFCC) by itself on clean isolated digits of AURORA-2 database25. We obtain 63.24% (89.14–25.90%, see the last column of the first line of Table 4) for the cochlear filter and 42.26% (92.96–50.70%, see the last column of the first line of Table 3) for the MFCC filter.

Conclusion

We test different frequency filtering methods as stand alone feature extractors. Training a linear classifier on the \({{\bf{R}}}_{\sigma }\) vectors for the classic TI-46 spoken digit data base, both the cochleagram and the MFCC filter give high identification rates without further processing. On the other hand, the real part of a linear spectrogram does not separate the inputs of different digit classes. Non-linearly transforming the spectrogram, gives similar results to the cochleagram and MFCC filters, stressing that the separation found for the MFCC and cochlear classifiers is due to the presence of non-linearity, with a minor effect due to the particular type of non-linear transformation.

In a second part, a non-linear oscillator is added to process the filtered input. The gain in word recognition due to the non-linear oscillator is computed for each filtering method. The non-linear oscillator is simulated and found to be in excellent agreement with experimental results with magnetic nano-oscillators. For the non-linear methods MFCC, Spectro HP, and cochleagram, the gain of word recognition is small, despite a nearly perfect word recognition. On the the other hand, for the linear spectrogram, the gain of word recognition is much higher even if the final word recognition is maximum 80%.

An important lesson is that when evaluating hardware systems with speech recognition tasks, the final word recognition rate should be interpreted with caution. If a very efficient filtering is used to preprocess the input, the hardware system may not be adding much performance. A hardware system only adds something if it provides improved word recognition. It should be noted that the use of more complicated datasets, such as the proprietary spoken digits dataset used in21,25, or the inclusion of babble noise in the dataset would lead to significantly different results. The takeaway of our work is that, in order to test and compare hardware systems, using a linear spectrogram eases the interpretation of the results, because it does not introduce any separation of the input prior the hardware system. Furthermore, we show that a simple but powerful transformation like our Spectro HP filter starting from a simple spectrogram achieves state-of-the-art results (simulations and experimentally) without applying any specific acoustic filter that mimics the human auditory system (like the cochleagram or the MFCC filter).

Testing on the AURORA-2 dataset reveals that under noisy conditions the cochlear filter performs better by itself than the MFCC filter but the gain brought by the neural network is two times better on average for the MFCC filter and the overall word success rate is higher for the MFCC filter than for the cochlear filter.