Hybrid neural network based on novel audio feature for vehicle type identification

Due to the audio information of different types of vehicle models are distinct, the vehicle information can be identified by the audio signal of vehicle accurately. In real life, in order to determine the type of vehicle, we do not need to obtain the visual information of vehicles and just need to obtain the audio information. In this paper, we extract and stitching different features from different aspects: Mel frequency cepstrum coefficients in perceptual characteristics, pitch class profile in psychoacoustic characteristics and short-term energy in acoustic characteristics. In addition, we improve the neural networks classifier by fusing the LSTM unit into the convolutional neural networks. At last, we put the novel feature to the hybrid neural networks to recognize different vehicles. The results suggest the novel feature we proposed in this paper can increase the recognition rate by 7%; destroying the training data randomly by superimposing different kinds of noise can improve the anti-noise ability in our identification system; and LSTM has great advantages in modeling time series, adding LSTM to the networks can improve the recognition rate of 3.39%.


Extraction and splice of vehicle features
Acoustic characteristics, perceptual characteristics and psychoacoustic characteristics make up audio characteristics. Time and frequency domain parameters constitute acoustic characteristics, including the short-time energy, the short-time autocorrelation function, and the short-time zero crossing rate and so on. Perceptual characteristics are the parameters extracted from human auditory, including the Mel Frequency Cepstrum Coefficients, and its first-order and second-order differential which are used to reflect the dynamic characteristics. Psychoacoustic characteristics include loudness, roughness and pitch class profile etc.
Mel frequency cepstrum coefficient (MFCC). The Mel frequency cepstrum coefficient was first used in the field of automatic speech recognition (ASR) and speaker recognition. MFCC is a kind of cepstral coefficient based on Mel frequency. After the pre-processing of vehicle audio signals, MFCC vector will be extracted from each frame in the form of vector group. Fast Fourier Transform (FFT) is applied to each frame of the audio signals before sending to Mel filter where the original spectrum is transformed into Mel frequency spectrum. Then logarithmic transformation and discrete cosine transform are applied to the spectrum to form MFCC vectors. Equations (1) to (4) reflect the entire transformation process. Figure 1. shows the process of extracting MFCC. Finally, we perform a discrete Fourier transform on a to obtain the Mel frequency cepstrum coefficient: where, M is the dimension of the MFCC feature.
Pitch class profile (PCP). The pitch class profile can be used to extract the characteristics of the audio tone. It reconstructs the spectrum into sound spectrum, which describing the level of sound 16 . When extracting the MFCC, we send the audio signals to Mel filter to smooth spectrum and eliminate harmonics. The formant of audio will be highlighted. So the tone characteristics of vehicle audio will be ignored. But the vehicle audio contains complex audio signals including horn sound, tire friction sound and engine sound, etc. So the pitch  www.nature.com/scientificreports/ class profile can be selected to be the characteristic to recognize different types of the vehicles. The process of extracting PCP is shown in Fig. 2.
The specific extraction steps are as follows: Step 1: x(m) is the collected audio signal, we segment the pre-processed audio into frames, N is frame length; N/2 is frame shifting; we use Hamming window as window function, The formula represents the spectrum of the kth semitone of the nth frame. w N k (m) is the window and N k is the length of the window. Q is the constant factor of Constant-Q spectral analysis.
Step 2: Spectrum mapping. We map the spectrum X cqt (k) to p(k) in the pitch class domain, A 12-bin tuned Chromagram is then calculated from the Constant-Q spectra. The CQT spectrum is mapped to a pitch contour feature in a logarithmic manner according to the averaging law in music theory. k in X cqt (k) is mapped to p in PCP. The mapping formula is: In the formula, f s is sampling rate; f s /N is the frequency domain interval; f ref is the reference frequency, corresponding to the C1 in music; indicates the frequency of each frequency domain.
Step 3: Accumulate all the frequency which corresponding to the pitch class, we will obtain the 12-bin PCP vector: Step 4: In order to make the PCP more dynamic, we perform a first-order difference on the PCP p to get PCP ′ p and then stitch PCP p and PCP ′ p .
Short-term energy. The energy can reflect vehicle audio feature more comprehensively. x n (m) is the nth frame of pre-processed audio signal. E n is the short-term energy: In the formula, L is the length of the frame.

Hybrid neural network
Convolutional neural network (CNN). Essentially, CNN is an extended structure of multi-layer perceptron, the construction of the network layer structure has a great impact on the actual output. The classic convolutional neural networks include input layer, convolutional and pooling layer combined in various forms, a limited number of fully connected layer, and output layer. In the convolutional layer, the convolution kernel performs a convolution operation on the vector which export from the previous layer. Nonlinear system can be obtained by activation function, mathematical model can be expressed as: www.nature.com/scientificreports/ In the formula, M j is input vector; l is the layer l network; k is the convolution kernel; b is bias; x l j is the output of layer l. We select ReLU as the activation function.

Long short-term memory (LSTM). Long short-term memory (LSTM) is a special kind of RNN 17 , mainly
to solve the problem of gradient disappearance and gradient explosion during long sequence training. To put it simply, LSTM can perform better in longer sequences than ordinary RNN. A LSTM model includes the input gate i t , the output gate o t , the forget gate f t and a cell c t . We can see the LSTM structure from Fig. 3.
The forget gate can be express as: where, x t is input unit; h (t−1) is the output unit; C (t−1) is the status of previous unit; W f is weight; b t is the bias; σ is the sigmoid. Input gate and output gate can be expressed by Eqs. (11)-(14): In the formula, C t−1 is the output status of the input gate; h t is the output of the LSTM unit. The LSTM is fed by the softmax output of the CNN as the feature.
The hybrid neural network classifier proposed in this paper is shown in Fig. 4. It includes 1 input layer, 3 convolution layers, 2 Batch Normalization layers, 2 pooling layers. Conv1d represents 1-dimensional convolution and LSTM adopts one time step.

Experimental results
Effectiveness of novel feature. To verify the validity of the novel feature which proposed in this paper with support vector machine (SVM) and hybrid neural network classifiers (HNN), we use the data of an extensive real world experiment (http:// www. ecs. umass. edu/ ~mduar te/ Softw are. html), which includes vehicle sensor signals for both assault amphibian vehicle and Dragon Wagon. This paper adopts 180 vehicle signal data collected from the same sampling location. In this experiment, we divided the dataset into training set and test set. Due to the small amount of experimental data, and in order to utilize as much as possible the limited number of datasets, we used a tenfold cross-validation method to Evaluate all methods where 90% of the data was selected for training, 10% for testing and for 10 non-overlap test dataset repeat 10 times. It makes all samples available for training the model. Firstly, we preprocessed these data, the process is as follows: the length of the frame and window are both 128, the frame shifting is 64, resampling to 44.1 kHz. We spliced two of the three features and compared with the novel features proposed in this paper in order to verify the effectiveness of the novel feature proposed in this paper. To avoid the contingency of the experiment, we use multiple experiments and take the average as the experimental result.
The experimental results are shown in Fig. 5. As we can see from the result that the novel feature can capture audio information from different aspects, retain information from the psychoacoustic feature, the acoustic feature and the perceptual feature, which can represent the vehicle audio information more comprehensively and make the recognition more accurately. It can also be seen from the figure that compared with other features, the method www.nature.com/scientificreports/ combining these three feature vectors can improve the recognition accuracy by nearly 7%. In the meantime, the novel feature is unidimensional, so the complexity of the algorithm has not increased. The feature containing MFCC has a better recognition effect, since it is a good simulation of human auditory sensation. As shown in Fig. 5, when using the tenfold crossover method, the average accuracy of the novel features as input features can reach 100%, which can fully prove the superiority of the novel features extracted in this paper, and the recognition rate is higher than other features. On the other hand, in different numbers of test sets, the recognition accuracy with HNN which proposed in this paper is higher than SVM.
Verification experiment by augmenting dataset. In order to verify the generalization ability of the vehicle recognition system, the city noise is superimposed on the vehicle audio signal, and the signal noise ratio (SNR) is 20 dB, 15 dB, 10 dB, 5 dB and 0 dB respectively. We randomly select 80% of the data sets to add noise with different signal-to-noise ratios, so that on the one hand, the size of data sets can be expanded, and on the other hand, each vehicle can be simulated in different environments. After training, we tested the recognition accuracy of the model using the test set. Here, the test set can be divided into two parts, one part is the audio with noise added, and the other part is the original data set without noise. We use the same data as previous experiment under the same conditions. We select the novel feature as the input vector. We also use the tenfold cross-validation method described in "Effectiveness of novel feature" to divide the data set. Dropout rate is set to 0.25. The initial learning rate is set to 0.001. Moving average attenuation coefficient is set to 0.9. Stochastic gradient descent algorithm (SGD) is used to update the learning rate. In order to reflect the performance of dataset augmentation more visually, we trained the original signal without superimposing noise under the same conditions and tested with the noisy signal. We performed five experiments for each SNR signal to guarantee the reliability. The training set accounted for 70% of the total data set. The experimental recognition accuracy is shown in Fig. 6. As shown in Fig. 6., when experiment with dataset augmentation and the SNR is 0 dB, the recognition accuracy is 98.46%. However, the accuracy of the experiment without dataset augmentation is 94.5%. When the signal to noise ratio is in the range of 0 dB to 20 dB, the stability and accuracy of the model with dataset augmentation www.nature.com/scientificreports/ are significantly higher than the latter. The anti-interference ability of the model is greatly improved. Furthermore, it is verified that this recognition system has high accuracy in different SNR.
Verification experiment of sensor acquisition data. This experiment aims to verify the robust of the recognition system, we recorded 10 different kinds of vehicle audio, including YUTONG bus, JAC truck, HAVAL H5, SGMW, RANGE ROVER, JETTA, HYUNDAI IX35, BYD e6, JAC V7, and train. The recording process of the audio is shown in Fig. 7. The collected equipment includes a microphone, a PXI Vector Signal Transceivers and a computer. The number of each vehicle sample is 100. The sampling frequency is 44.1 kHz. The audio contains the vehicle's engine sound, horn sound, brake sound and the tire friction sound. Due to the interference noise from different vehicles when collecting, we intercepted 132,300 sample points containing the largest ingredient of the main vehicles as the sample point data. The collected vehicle audio waveform is shown in Fig. 8.
To verify the impact of the LSTM layer on the recognition accuracy of the system, under the same experimental condition, we randomly selected 80% of the training set to superimpose noise. We experimented with LSTM layer and without LSTM layer separately and performed 5 experiments according to the test set ratio of 20%, 25%, 30%, 35%, 40% and 45% respectively. The accuracy and recall of experiments are shown in Table 1.
As shown in Table 1, adding LSTM layer to the networks can improve the recognition rate of 3.39%. The reason for this effect is that the LSTM unit has a certain memory for the front and back frame information of the audio, so the correlation between the frictional sound in the signal and the engine sound appears obviously. The LSTM unit has great advantages in modelling time series. Therefore, adding the LSTM layer to the neural network classifier has a great effect on improving the system performance. Figure 9 is the confusion matrix of the experiment. We selected 700 of them as training sets. It can be seen that the recognition rate of large vehicles is very high, the reasons are that the large vehicles are heavy and the friction between tires and ground is more characteristic than the other small vehicles. Furthermore, large vehicles use diesel engines, the distinctive noise is caused largely by the way the fuel ignites. BYD e6 is the new energy vehicle, its motor drive will not have the same vibration and noise as the internal combustion engine, so   www.nature.com/scientificreports/ the characteristics are relatively obvious compared with gasoline and diesel engines, so it has a good recognition effect. Figure 10 shows the accuracy and loss function of the experiment. We can see that the recognition accuracy tends to be stable after 150 iterations, and has remained at around 98%. The loss function tends to be stable about 200 iterations, and the test set loss function curve has a high similarity with the training set, tending to 0.05, indicating that the model has strong robustness. To verify the impact of different types of noise, we randomly selected 80% of the audio superimposing the wind noise, engine noise, rain noise, thunderstorm noise and helicopter noise which contain in the ESC-50 data set on the vehicle audio with the SNR of 10 dB and 5 dB respectively. After extracting features, we put them into the hybrid neural network to train and performed 5 time experiments every kind of noise, then took the average. The comparison experimental are shown in Fig. 11.
As shown in Fig. 11, the recognition rate can be stabilized at around 97.5%, when superimposing the wind, rain and thunderstorm noise; But when we superimposed engine and helicopter noise, it has a certain impact on the recognition rate of the system, the reason is that both types of noise have mechanical vibration and is similar to the vehicle audio. Therefore, when we extract features, this mechanical vibration is easily extracted as the vehicle audio characteristics, which affecting the accuracy of recognition. The recognition accuracy will remain at 93%, even if there is some impact. It can be proved that under the complex environmental noise, the recognition rate of the system can be stabilized above 93%.

Conclusion
In the vehicle identification system, the quality of extracting feature and the performance of the classifier affect the accuracy of recognition. In this paper, we spliced the Mel frequency cepstrum coefficient, the reformative pitch class profile and the short-term energy as the input vector of the hybrid neural networks. The Mel frequency  www.nature.com/scientificreports/ cepstrum coefficient can reflect the auditory properties of human ear. The reformative pitch class profile can reflect the tone of the vehicle audio. When the background noise is low, the short-term energy can reflect the audio characteristics very well. Compared with other identification system for identifying the vehicle type, the developed system in this paper has many advantages. Compared with the support vector machine, for the extraction of different features of audio, in general, the hybrid neural network classifier has a higher recognition effect. Compared with the other features, the novel features can reflect the characteristic of vehicle audio very well, the recognition rate can reach 100% on the data of an extensive real word experiment. The recognition system we proposed can make recognition rate to 98% when the vehicle audio is collected by the microphone. When we destroy the training set randomly, the accuracy of this system is higher, so destroying the training set randomly can greatly improve the anti-noise ability. We proposed the hybrid neural network combining the convolutional neural network with the LSTM unit, which has better fit to time series of vehicle audio and can improve the vehicle identification accuracy significantly. Experimental results that the LSTM unit can improve the recognition accuracy by 3.39%.
In future, we would like to collect more vehicle audio and perform the experiment with the new dataset to confirm if the identification system has more advantages than other system. This paper only superimposes different types of noise on the actual collected vehicle audio. In order to better verify the feasibility of the algorithm in this paper, future work should collect vehicle sounds in real environments, such as rainy weather and strong winds. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.