Hopf physical reservoir computer for reconfigurable sound recognition

The Hopf oscillator is a nonlinear oscillator that exhibits limit cycle motion. This reservoir computer utilizes the vibratory nature of the oscillator, which makes it an ideal candidate for reconfigurable sound recognition tasks. In this paper, the capabilities of the Hopf reservoir computer performing sound recognition are systematically demonstrated. This work shows that the Hopf reservoir computer can offer superior sound recognition accuracy compared to legacy approaches (e.g., a Mel spectrum + machine learning approach). More importantly, the Hopf reservoir computer operating as a sound recognition system does not require audio preprocessing and has a very simple setup while still offering a high degree of reconfigurability. These features pave the way of applying physical reservoir computing for sound recognition in low power edge devices.


I. INTRODUCTION
T HERE are ubiquitous methods of audio signal clas- sification, particularly for speech recognition [1], [2].However, machine learning suffers several drawbacks that hinder its wide dissemination on the Internet of Things (IoT) [3].First, machine learning, especially deep neural networks (DNNs), rely on the cloud infrastructure to conduct massive computation for both model training and inference.State-ofthe-art (SOTA) deep learning models, such as GPT-3, can have over 175 billion parameters and training requirements of 3.14 × 10 23 FLOPS (floating operations per second) [4], [5].The training of the SOTA speech transcription model, Whisper, used a word library that had as many words as one person would continuously speaks for 77 years [6].None of these mentioned technical requirements could be fulfilled by any edge devices for IoT; thus, the cloud infrastructure is a necessity for DNN tasks.Second, reliance of cloud computing for machine learning poses great security and privacy risks.Over 60% of previous security breaches happened during the raw data communication between the cloud and the edge for machine learning [7].Further, each breach carries an average $4.24 million loss, and this number is continuously growing [8].The privacy concern causes distrust among smart device users and drives the abandonment of smart devices [9], [10].Third, the environmental impact of implementing DNN through a cloud infrastructure is often overlooked but cannot be neglected.Training a transformer model with 213 million parameters will generate carbon dioxide emissions equaling four times of a US manufacturer's vehicle over its whole lifespan [11].Therefore, the next generation of smart IoT devices needs to possess sufficient computational power to operate machine learning or even deep learning on the edge.
Among efforts to bring machine learning to edge devices, reservoir computing, especially physical reservoir computing, has generated early success over the last two decades.Originating from the concepts of liquid state machines and echo state networks, researchers demonstrated that the soundinduced ripples on the surface of a bucket of water could be used to conduct audio signal recognition [12].In a nutshell, reservoir computing exploits the intrinsic nonlinearity of a physical system to replicate the process of nodal connections in a neural network to extract features from time series signals for machine perception [13], [14].Reservoir computing directly conducts computations in an analog fashion by using the physical system, which largely eliminates the necessity of separate data storage, organization, and machine learning perception.Notably, reservoir computing is naturally suited for audio processing tasks, which are a subset of time series signals.It should also be noted that more traditional, non-physical reservoir computing approaches have seen widespread use for the Internet of Things.Some of these examples include dynamic spectrum management [15], network traffic prediction [16], and UAV trajectory design [17].
Researchers have explored many physical systems to operate as reservoir computers for temporal signal processing.These systems include the field-programmable gate array (FPGA) [18], chemical reactions [19], memristors [20], superparamagnetic tunnel junctions [21], spintronics [22], attenuation of wavelength of lasers in special mediums [23], MEMS (microelectromechanical systems) [24], and others [13], [25].Though these studies have demonstrated that reservoir computing could handle audio signal processing, the physical system for computing is usually very cumbersome [23], and they all require preprocessing of the original audio clips using methods such as the Mel spectrum, which largely cancels the benefits of reducing the computational requirements of machine learning via reservoir computing.More importantly, to boost the computational power, conventional reservoir computing techniques use time-delayed feedback achieved by a digital to analog conversion [26], and the time-delayed feedback will hamper the processing speed of reservoir computing while drastically increasing the envelope of energy consumption for arXiv:2212.10370v2[cs.SD] 26 Jan 2023 computing.We suggest that the less-than-satisfactory performance of physical reservoir computing is largely caused by the insufficient computational power of the computing systems chosen by the previous works.
Recently, we have discovered that the Hopf oscillator, which is a common model for many physical processes, has sufficient computational power to conduct machine learning.Although this is a very simple physical system, computing can be achieved without the need of additional data handling, time-delayed feedback, or auxiliary electrical components [27], [28], [29], [30].Notably, reconfigurable performance can also be achieved for traditional, non-physical reservoir computers [31].The performance of the Hopf oscillator reservoir computer on a set of benchmarking tasks (e.g., logical tasks, emulation of time-series signals, and prediction tasks) is exceptional compared to much more complex physical reservoirs.This paper is an extension of previous work to further demonstrate the outstanding capabilities of the Hopf reservoir computer for audio signal recognition tasks.These results point to the efficacy of using this type of reservoir computer for edge computing, which could pave the way to obtaining edge artificial intelligence and decentralized deep learning in the foreseeable future.

II. HOPF OSCILLATOR & RESERVOIR
The forced Hopf oscillator is represented in eq. 1 [32], [30]: In the above equations, x and y refer to the first and second states of the Hopf oscillator, respectively.The ω 0 term is the resonance frequency of the Hopf oscillator.The µ parameter affects the radius of the limit cycle motion.For example, without external forcing, the Hopf oscillator would have a limit cycle of radius µ, and it would oscillate at a frequency of ω 0 .This parameter also loosely correlates with the quality factor of the oscillator.A is the amplitude of a sinusoidal force.
For the oscillator to classify audio signals, an external forcing signal that contains the audio signal, a(t) is constructed, which is shown in eq.2; this is then used as input to the Hopf oscillator.The modified Hopf oscillator as a reservoir is represented by eqs. 3 and 4: The external signal, f (t), is composed of a DC offset and the audio signal, a(t).The DC offset ensures that the radius parameter is non-negative.This external signal is injected into both the radius parameter, µ, and the sinusoid, A sin(Ωt).The Hopf oscillator dynamically responds to the audio signal, and the x state corresponds to the audio features for the machine learning audio classification task.The y state, although not explicitly used in the classification task (as depicted in Fig. 1), likely stores information and aids in the computational task.Unlike the original form of the Hopf Fig. 1.A schematic showing the nodal connections within a Hopf oscillator for reservoir computing.The original signal, f (t), is sent to the two states of the oscillator (i.e., two physical nodes).Each physical node generates N virtual nodes in time series.The digital readout layers (i.e., machine learning algorithm) will read n samples from the node x of the oscillator (note that we only use one node for audio classification in the present paper).n 0 corresponds to the number of samples from the original audio signal, and N refers to the number of virtual nodes controlled by the readout mechanisms.The signal from the reservoir is then sent to a neural network, which is indicated by the blue dashed arrow; this neural network is described in Fig. 4. The digital readout will classify the n samples corresponding to one audio clip to its class.
oscillator reservoir computer, we use the Hopf oscillations to extract audio features for classification instead of directly using the two state outputs for time series signal prediction [27].As such, several changes are made in the computational scheme of the Hopf oscillator reservoir computer.First, this formulation of the reservoir does not include time-multiplexing or a masking procedure.Conventional reservoir computing uses a preset mask multiplying the reservoir outputs to create neurons in the reservoir system.The training of the mask equates to updating parameters when training the digitally realized neural networks.However, this method is memory expensive and inefficient for audio signal processing, since the length of the mask should be sufficient to cover the length of audio clip and the nodal connections necessary for the signal classification.Instead of training masks, we use a more efficient multiple layer convolutional neural network readout to directly feed forward the reservoir outputs and train the connections between each layer as the parameters.Second, no Gaussian noise is added, as the audio signals already have background noise.Third, instead of using a pseudo-period to guide the training of the machine learning readout, we use the number of samples collected for classification to control the nodal connections within each collected feature point generated from the reservoir processing 1D audio data.N virtual nodes means that for each sampling point of the original audio, the reservoir will generate N − 1 nodal connections in 1D for each reservoir state for classification.For example, with N virtual nodes, a sampled audio data point is processed by the physical node (i.e., x in Fig. 1) N − 1 times, which creates N feature points from one audio sample and N − 1 nodal connections in these N feature points.In the current paper, we set N to 100 for audio processing.This method hinders the sampling speed of the audio signals.Thus, we resample the original full resolution audio data to ensure that we operate experiments within a relatively short period of time.It is worth noting that the length of the audio clips for each classification event effectively builds the pseudo-period in the traditional context of the reservoir computing via timedelayed feedback loops (i.e., a fixed length of the audio will produce one classification result with details provided later).The eventual nodal connection of the Hopf reservoir computer and output handling could be conceptualized as Fig. 1.Here, the Hopf reservoir computer is used to compute feature maps, with several representative examples shown in Fig. 2. The x-axis follows the numerical order of the virtual nodes, and the y-axis is the time.The value of the feature map is rescaled from 0 to 1. Consecutive convolutional layers, followed by the flattened layer and fully-connected layers depicted in Fig. 4, construct the machine learning readout for processing the audio signal outputs from the reservoir, which is further described in Section III.Note that a similar approach is applied in the SOTA urban sound recognition on edge devices [33], though we eliminate the computationally expensive preprocessing of the Mel spectrogram by offloading feature extraction to the reservoir computer.More important, our approach could use a very coarse sampling (4000 Hz was used here) instead of the Mel spectrogram applied in [33] to capture the granularity of the audio signals.A detailed compar-ison is provided in the subsequent section to demonstrate the superior feature extraction from the Hopf reservoir computer.

III. METHODS
The Hopf physical reservoir computer is realized through a proprietary circuit design proposed by [27].Following the schematic given in Fig. 3, the circuit is implemented using TL082 operational amplifiers and AD633 multipliers.The input audio signal is first normalized to the range from −1 to +1 and mixed with the sinusoidal forcing signal in MATLAB, then it is sent to the circuit by a National Instrument (NI) cDAQ-9174 data I/O module.The outputs from the circuit, referred to as the x and y states of the Hopf oscillator, are collected with a sampling rate of 10 5 samples/s by the same NI cDAQ-9174 for later machine learning processing.Three datasets are employed in the sound recognition experiments.These consist of urban sound recognition, Qualcomm voice command, and spoken digits.The urban sound recognition dataset consists of 873 audio clips of 10 classes, which are high quality urban sound clips recorded in New York City [34].Each audio clip is four seconds long with a sampling rate of at least 44.1 kHz.Compared to commonly available datasets, we have an extremely small number of samples.To demonstrate reconfigurability of the Hopf reservoir computer for audio processing, the Qualcomm voice command dataset is also used.This dataset consists of 4270 audio clips with each clip lasting 1 second, which are four wake words that are collected from speakers with diverse speaking speeds and accents [35].From the dataset, we use 1000 clips for experiments.Compared to the previous urban sound recognition case, the only difference in the processing algorithm is the retraining of the output portion (i.e., after convolution layers) of the machine learning readout (details are discussed in the later part of the methodology section and results section of the paper).To compare the proposed Hopf reservoir with other reservoirs, we also conduct an experiment of spoken digits recognition, which serves as the standard benchmarking test for reservoir computing.The spoken digits dataset consists of 3000 audio clips, which are spoken by five different speakers [36].As with the Qualcomm voice command dataset, the total number of audio clips for the experiments is set to be only 1000.
For the sake of processing speed, we resample each audio clip with a sampling rate of 4000 Hz and normalize the data to the range from −1 to +1 before sending to the analog circuit.80% of the outputs from the circuit are used for training the machine learning model with the remaining 20% used for testing.
In Fig. 1, the nodal connections of the Hopf physical reservoir computer are shown.Although we only collect a 1D data stream from the Hopf circuit, the data stream consists of both input signals and the response from the virtual nodes defined by the sampling speed of the signals [37].We follow this principle of arranging and manipulating signals by their virtual nodes.The output from the circuit reservoir is first activated using a inverse hyperbolic tangent function [27], [38] x f eature = tanh −1 x (5) Subsequently, the activated output is rearranged by the order of the virtual nodes as the feature maps for the machine perception.A sample feature map rendering consisting of 10 different classes of urban sound is shown as Fig. 2. The Hopf reservoir computer produces this feature map as described in Section II, which is then used as an input to the neural network shown in Figure 4. Effectively, the Hopf reservoir computer is offloading the costs of the computationally expensive Mel spectrum.A Swish activation [39] is employed to boost the performance of the machine learning model on processing sparse neuron activation (i.e., dead neuron problems) and the overall accuracy of the machine learning model processing audio data.Note that a future version of the machine learning software using skipped connection (generating residual networks) [40] will further boost the robustness of the software for large set of data.Each 1 second clip of the outputs is further skip-sampled to a 200 (number of time samples) × 100 (number of virtual nodes) for machine learning processing (as labeled in Fig. 4. The machine learning algorithm is implemented using Keras [41] with a TensorFlow backend.The training is conducted on an Nvidia RTX 2080Ti GPU and uses an Adam optimizer with the default learning rate of 0.001 [42].The loss function is cross entropy [43].The batch size during training is 5; the epochs is 100 for urban sound recognition dataset, 20 for Qualcomm voice command dataset, and 100 for the spoken digits.The Mel spectrum operation is conducted upon samples that are four seconds long with a 44.1 kHz sampling rate.The total number of frequency bands is set to be 100, and the time step is set to be 0.025 seconds.In the right column, the audio features extracted from the Hopf reservoir computer for the same samples, such that each 1 second audio clip is downsampled to 4000 Hz and the number of virtual nodes is set to 100.

A. Results for Urban Sound Recognition Dataset
First, we present the results of the Hopf reservoir computer for an urban sound recognition task.As shown in Fig. 5 in the left column, the audio features from the Mel spectrum operations (as calculated on the audio clips with a 44.1 kHz sampling rate) show drastic differences between the three examples; using the top example as a reference, the Euclidean distance between the reference and the other two are higher than 25.In comparison, the audio features from the Hopf RC are shown in the right column of Fig. 5; all three examples have a much higher similarity for these three examples (e.g., Euclidean distance <12).
The robustness of the audio classification is also of high importance for real-world applications.To highlight this, the Mel spectrum results are compared with the Hopf RC results for three different noise levels.Using the example in the top row of Fig. 5, white noise is added to the original signal to create different signal-to-noise ratios (SNRs); the audio features of these three new signals are computed with the Mel spectrum (using 44.1 kHz audio sampling rate) and the Hopf reservoir computer (using 4000 Hz audio sampling rate).The output audio features are shown in Fig. 6.It is clearly shown that the Mel spectrum-based audio features lose low frequency information when the SNR is reduced to 20, while the features generated by the Hopf reservoir computer maintain a similar structure with the original audio counterpart, with the Euclidean distance <5 for for an SNR of 20.
Fig. 6.The robustness of the Hopf RC audio extraction is compared with the Mel spectrum for various signal-to-noise ratios (SNRs).For visualization, the siren example shown in the top of Fig. 5 is used with different levels of noise.From the top to the bottom, three different amounts of noise were added to the original siren audio example.In the left column, the energy of the Mel spectrum is shown.Note that the result starts to lose low frequency information when the SNR drops to 20.In the right column, the audio features that are extracted using the Hopf RC are shown.Note that the result remains largely the same for all noise levels, even when the SNR is equal to 20.
The confusion matrix for the urban sound recognition task is shown in Fig. 7.The proposed audio recognition approach based on the Hopf reservoir computer has a 96.2% accuracy.This accounts for a 10% accuracy improvement compared to [33], with a reduction of >94% of the FLOPS (floating operations per second) for high sampling rate readout and Mel spectrum computation and ∼ 90% of the audio pieces for training.The confusion matrix of the proposed sound recognition system processing Qualcomm wake words.Each label corresponds to: a) "Hi, Galaxy", b) "Hi, Lumia", c) "Hi, Snapdragon", and d) "Hi, Android".

B. Results for Qualcomm Voice Command Dataset
Using the machine learning model trained from the previous test case (i.e., the urban sound recognition task) as the baseline, we test the Qualcomm voice command dataset to demonstrate the reconfigurability of the Hopf reservoir computer audio recognition system.In this experiment, we purposefully reduce the number of epochs to 20 and freeze the CNN portion of the machine learning model to reconfigure the process of the audio recognition system from the urban sound detection task to a voice command task.In the left portion of Fig. 8, representative audio features of the four classes are shown, which have significant differences compared to the features of the urban sound events (Fig. 2).The audio recognition yields a >99% accuracy, with the confusion matrix depicted in the right portion of Fig. 8.Note that the number of parameters trained for this experiment is about 35,000, which accounts for about 300 KB dynamic memory for 8-bit input with a batch size of 5 [44], [45], demonstrating the feasibility of running the training of the machine learning readout on low-level edge devices consuming Li-Po battery level of power.The spoken digit dataset is used to compare the performance of the Hopf reservoir computer for audio recognition with other reservoirs (e.g., [18], [19], [20], [21], [22], [23], [24], [25].).As shown in Fig. 9, the Hopf reservoir computer produces an approximately 97% accuracy for the spoken digit classification task.This result retains the state-of-theart recognition accuracy on this dataset while only using one physical device (i.e., one consolidated analog circuit) and two physical nodes (x and y states).As a comparison, the best performing reservoir [20] employed 10 memristors and preprocessing of the original audio clips to yield a similar accuracy.We suggest that the vibratory nature of our reservoir largely contributes to the simplicity of the proposed sound event detection system, and the activation of the reservoir using sinusoidal signals boosts the feature extraction of the audio signal using Hopf oscillations (details described later).

C. Results for Spoken Digit Dataset
Further, we increase the strength of the activation signal (term A in eq. 1) and discard the inverse hyperbolic tangent activation (eq.5) before the machine learning readout.The yielded results, which are shown in Fig. 10, have a 96% accuracy compared to the case using eq. 5 before sending the x state to the machine learning readout.This suggests that Hopf reservoir computer can be reconfigured not only by its digital readout, similar the traditional physical reservoir computers, but that the Hopf oscillator's computational power could also be drastically enhanced by changing the oscillator's internal physical conditions.Fig. 10.Summary of the results of Hopf reservoir computer conducts spoken digits recognition task.The confusion matrix of the proposed sound recognition system processing spoken digits dataset with 10 times increase of activation strength and without inverse hyperbolic tangent before machine learning readouts.

A. Summary of Results
In this paper, we present the results of sound signal recognition using reservoir computing technology consisting of a Hopf oscillator [27], [28].Instead of employing computationally expensive preprocessing (e.g., Mel spectrum) commonly used in other studies [23], [20], [18], [33], we directly take the outputs from the Hopf circuit to process the normalized audio signal for machine learning recognition.We anticipate that this Hopf reservoir computing can be directly implemented to microphones to achieve a future processing-on-the-sensor.
In Section IV, we systematically demonstrate that our Hopf reservoir computing approach yields a 10% accuracy improvement on a diverse 10-class urban sound recognition compared to the state-of-the-art results using edge devices [33], whereas we use a surprisingly simple preprocessing by just normalizing the original signal.The wake words recognition results in >99% accuracy using the exact readout machine learning algorithm by only retraining the MLP.This implies that the Hopf reservoir computer will enable inference and reconfiguration on the edge for the sound recognition system.Additionally, compared to other reservoir computing systems (e.g., [20], [19], [25], [18]), the spoken digit dataset yields superior performance without the need of using complex preprocessing, multiple physical devices, and time-multiplexing; in addition, we have also conducted our benchmarking experiments on far more realistic datasets (i.e., the 10-class urban sound recognition dataset and the 4-class wake words dataset).We demonstrate boosted performance of audio signal processing by changing the activation signal strength of the Hopf oscillator, which implies that there are more degrees of freedom for reconfiguring physical reservoir computers as compared to other reservoir implementations.
Lastly, we carefully crafted the algorithms and preprocessing of the data for sound recognition tasks to keep overall energy consumption, including the digital readout, less than 1 mW based on FLOPS operations and the analog sampling rate.The computational load, which uses less than 700 sound clips of a 10-class dataset for training machine learning models, is well below the envelope of the computational resources possessed by consumer electronic devices.As such, the sound recognition devices using a Hopf reservoir computer could have an effortless integration with devices with untraceable computational load increases.

B. Analysis on the Physical Mechanisms of the Hopf Reservoir Computer Sound Recognition
Three elements play important roles in the audio signal recognition.The limit cycle system creates an oscillation signal in the temporal domain with a sinusoidal form, which continuously convolves with the incoming audio signal.This effectively creates a 1D short time Fourier transform, generating unique patterns for audio recognition (e.g., Fig. 2).Interestingly, this process largely replicates the process of the cochlea in extracting the sound signal features perceptible by the neurons.The nonlinear oscillation of the Hopf oscillator in the temporal direction creates nodal connections of the reservoir computer, corresponding to the neuron connections in DNN.Additionally, the nonlinearity of the Hopf oscillator causes it to respond differently to signals possessing various characteristic features of the audio in a broadband fashion, which produces clean separation of features (Fig. 2 and Fig. 8a).It is worth noting that some recent studies [46], [47] have demonstrated that the cochlea and its directly-connected neurons creates a limit cycle system using the previous audio signals as activation to dynamically enhance the performance of the cochlea in performing audio signal feature extraction.The physical model of the inner ear can be modeled as a Hopf oscillator with a time-delayed feedback loop using the signals from previous time instants to activate the limit cycle oscillations.The audio signal recognition actually happens in the inner ear instead of in the brain.An interesting future extension of this work is to explore different activation signals to create an artificial ear, which is capable of on-membrane audio recognition.In the meantime, the two states of the Hopf oscillator affect each other with a time delay, which enhances the memory effects essential to the time series signal processing.

C. Discussion and Future Work
The unique advantages of the Hopf reservoir computer demonstrated in this paper pave the way for the next generation of smart IoT devices that exploit the unused computational power in sensor networks.Specifically, the physical mechanisms backing reservoir computing also happen in the microphone membrane with carefully crafted activation signals [46].One could imagine that future microphones directly operate sound signal recognition using sensor mechanisms instead of dedicated processing rigs.In addition, as shown in Fig. 2, the feature map of sound signals consists of unique patterns that are recognized by a convolutional neural network commonly used for visual signal processing.An extension of the present work will explore the correlations of audio signal feature maps, visual signal feature maps, and other types of time-series data features.As such, reservoir computing could be used as a backbone for multi-modal machine learning in smart IoT paradigms, including sensor fusion, audio video signal combination, and decentralized machine learning.The extremely small amount of training data required for the machine learning operation and clear feature separation described in Section IV could offer surprisingly satisfactory results, which is essential for many use cases without the luxury of unlimited sizes of datasets (e.g., soft user identification) or with noisy environments (e.g., a mix of different signals).One example is shown in Fig. 11: a eight-second long audio signal consisting of multiple different (i.e., car horn, drilling, and siren) is used to demonstrate the proof-of-concept of Hopf reservoir computer on mixed signal processing.The first four seconds of the audio clip only have car horn and drilling sound.For the last four seconds, the siren sound is added with a higher amplitude.As shown in the figure, the audio features generated from the Hopf reservoir computer has a clearly dominant class on the second half of the data and exhibits visually high correlation with the audio features generated by a clean siren sound with the same Hopf reservoir computer (an Euclidean distance less than 8).We anticipate a pattern matching algorithm originating from computer vision applications could be employed in this type of audio event separation and processing.
There are still limits in the reservoir computing method using the Hopf oscillator in its current form.First, the high accuracy sound event recognition requires many virtual nodes to generate diverse features for machine perception.However, increasing the virtual nodes leads to exponential growth of the sampling rate to read high quality audio data.We are actively seeking solutions to separate audio features from the original signal for recognition and recording, which could decrease the required sampling rate.Second, the current circuit-based physical reservoir separates the process of signal mixing and activation of the circuit.Redesigning the circuit is necessary to simplify signal reading for future system deployment.However, the ultimate version of the Hopf reservoir using MEMS will solve this problem, since the computing will happen on the audio sensing mechanisms.Lastly, the signal processing still relies on a digital readout.Though the algorithm is remarkably simple, a microcontroller unit is needed.We anticipate that the short-term solution will be deploying the optimized machine learning model as firmware (consuming less than 1 MB size of static memory without optimization and less than 256 KB dynamic memory for training upgraded machine learning models).A future goal should be using an analog circuit that could detect the spike signals for audio recognition (similar to neurons) to achieve a fully analog computer on edge devices [48].

Fig. 2 .
Fig. 2. Sample feature maps generated by the Hopf oscillator corresponding to different audio events.Each audio clip has a length of 1 sec sampled at 4000 Hz.The x-axis follows the arithmetic order of the virtual nodes, and the y-axis is the time.The reservoir is set to have 100 nodes for the test.The grayscale value (from 0 to 1) of each pixel corresponds to the signal strength of each data point (i.e., feature point of the audio signal).a) Air conditioner.b) Car horn.c) Children playing.d) Dog barking.e) Drilling.f) Engine idling.g) Gunshot.h) Jackhammer.i) Siren.j) Street music.

Fig. 3 .
Fig. 3.A simplified circuit schematic of the Hopf reservoir computer.

Fig. 4 .
Fig.4.A schematic showing the convolutional neural network-based machine learning readout for classification of the audio events using the Hopf reservoir computer.The light blue boxes in the figure correspond to the feature maps generated from each machine learning operations.The arrows are the different machine learning operations.The numbers above the light blue boxes are the depth of feature maps, and the bottom numbers are the length and the width of the feature maps, respectively.A max pooling with a size of (2,2) is also operated after two consecutive convolutions to reduce the dimension of the feature maps.Note for the length and width, we only label the dimensions that are changed after machine learning operations.

Fig. 5 .
Fig. 5.The Mel spectrum is compared with the Hopf RC for the urban sound recognition task.From the top to the bottom, three examples of the siren class are presented.In the left column, the energy of the Mel spectrum is shown.The Mel spectrum operation is conducted upon samples that are four seconds long with a 44.1 kHz sampling rate.The total number of frequency bands is set to be 100, and the time step is set to be 0.025 seconds.In the right column, the audio features extracted from the Hopf reservoir computer for the same samples, such that each 1 second audio clip is downsampled to 4000 Hz and the number of virtual nodes is set to 100.

Fig. 7 .Fig. 8 .
Fig. 7.For the urban sound recognition task, the confusion matrix is presented with the recognition accuracy labeled for the ten different audio events.Note that the class labels in this figure are the same as the class labels of Fig. 2.

Fig. 9 .
Fig. 9. Summary of the results of Hopf reservoir computer conducts spoken digits recognition task.The confusion matrix of the proposed sound recognition system processing spoken digits dataset with original activation strength and inverse hyperbolic tangent before machine learning readouts.

Fig. 11 .
Fig. 11.A noise resistance test using audio features generated from the urban sound recognition task.During the first four seconds of this eight second clip, drilling and car horn sounds are mixed, and the last four seconds contains the siren sound with a high amplitude (two times larger as compared to other two audio classes) is added to the mixed data.As shown in the figure, the latter four seconds of audio features shows high similarity compared to the reference siren sound.