## Introduction

Mass shootings in public places like schools, clubs, and roads are becoming common and its rate is alarmingly increasing in the US1. Even the countries like Canada, Australia, France, UK have seen gun violence in public places. Addressing seriousness of problem, these countries have built unique organizations to tackle mass shootings2. According to Mother Jones magazine’s database, Fig. 1 shows us the public gun violence incidences across the US between 1981 and 2021. Almost all major cities get covered under the red zone. In India, Naxalite and terrorism problems are very frequent. Clearly, there should be a mechanism to detect gun violence and act fast on it irrespective of gun laws. This paper takes a step in this direction to provide a real-time and cost-effective, reliable, and robust methodology for detecting gunshots. This will help people spend quality life with family, friends, and colleagues and take immediate action if anything serious happens around them.

Guns are dangerous, especially when operated by people who want to harm others. Mass shootings in schools and public places are becoming a common place. Terrorist attacks are also increasing and creating new security challenges. Such incidents either kill or traumatize survivors for a lifetime, creating a sense of insecurity in society. This highlighted the need for a system that increases security in such situations. Even at the military level, it would help develop strategies to protect our troops at the workplace.

The computer vision community is contributing, wherever possible to protect the life of people in such uneventful situations. Automated camera-based surveillance systems are highly focused previously. These include a system to detect cold weapons, and guns, differentiating handheld items from weapons that undoubtedly increased such systems’ strength. The problem that these researchers focused on is detecting the weapon by visually seeing it on camera, preventing the mishap, and alerting the authority.

If it fails, the attack will occur, and security forces will be called to confront the attackers. At such point, forces don’t know what kind of weapons the attackers are using. By knowing their weapon, we can secure and give an edge to our security personnel. Most of the time, attackers fight from spots not covered by surveillance (including drones). At such a point, using the audio of the weapon is our only option to get information about the attacker’s weapon. The camera has limited view of capture, audio’s view is not limited. The use of audio will improve these systems by adding a small audio capture device.

Previously, researchers widely used Convolution neural network (CNN) architectures to solve computer vision problems. Although promising, CNN-based architectures have some weaknesses. Firstly, as the convolutions work with constant window size, the model helps in finding the local information rather than long-range spatial relationships between different parts of the image and the complete image3. Secondly, there is some loss of local information through max-pooling4. Inspired by Natural language processing (NLP), Vision transformers have recently been proposed as an alternative to CNNs and have shown promising results in the field of computer vision3. Vision transformer is free of convolutions and identifies an image as a sequence of patches, hence overcoming the issues of locality and translation invariance faced by CNN. This approach has been observed to use the hardware resources more efficiently than CNN and could be pre-trained on the public ImageNet dataset with fewer resources3.

In this paper, we have investigated whether we can classify different gun types from audios of their shots. The hypothesis arises as experienced military personnel can accurately tell which gun the shot is fired from by listening to audio. To test this, we created a dataset that consists of 1-s audio of shots. Transformers have proved their success from NLP to vision tasks. However, they were never used to classify audio. To replicate a human’s sense of understanding of audio, we used Mel-frequency to represent audio and developed a transformer-based classifier to predict the class of gun.

Major Challenges faced while working in this area are as follows. firstly, creating such a dataset is difficult and costly due to the legalities and risks involved. We created a dataset using audio from YouTube videos. Secondly, no such features were proposed earlier for attention-based approaches for gun audio classification. It is not known whether we can use attention directly on shot audio or if there exists some feature that can use attention to give a better result.

Our main contributions in this paper are the following:

1. We manually created a 1-s gun audio shot dataset and made it available for public use.

2.We optimized transformer hyperparameters, showing that the vision transformer has significantly improved the gunshot detection accuracy compared to the other state-of-the-art algorithms.

3. Our proposed approach has the advantage of showing sublime results on every type and aspect of the data that led to obtain unparalleled results for the task of gunshot detection when compared with state-of-the-art algorithms.

The rest of the paper is divided as: related work, the methodology that we used and dataset creation, results and discussion, conclusion, limitations and future scope.

## Related work

Choi et al. highlighted CCTV’s significance for better police service5. They also studied CCTV-based intelligent security systems for constructing crime-zero zone6. In 2018, they further analyzed the feasibility of security systems based on their economic value7. Adding an audio-based gun-shot detection method will enhance the reach of such systems. With a nominal increase in cost, a completely new sensory ability increases its usefulness. Liang et al.8 proposed a method that can detect the shooter’s position using only a few user-generated videos that include the sound of gunshots. Liang et al.9 in 2017 solved the issue by developing a temporal localization framework for intensive audio events in videos. The localization results are improved by the proposed method using Localized Self-Paced Reranking (LSPaR).

Morshed et al.10 proposed a robust neural network-based approach for audio classification. They employed an encoder that contains a sequential stack of neural networks. Further, they also used a temporal-based interpolation approach for performing the scene-level embeddings. Banoorupa et al.11 proposed a hybrid fingerprinting based approach for performing audio classification using LSTM. In this approach, the fingerprints were created employing the MFCC spectrum and finally converting the spectrum obtained into digital images.

Phan et al.12 proposed a multiclass audio classification solution for polyphonic audio event detection. They divided the event categories into multiple sets and created a multi-class problem employing a divide and conquer approach. Wang et al.13 proposed a 2D CNN based technique for audio classification.This algorithm was widely used in various speech recognition and classification tasks. Zhang et al.14 propose an AED module called Multi-Scale Time-Frequency Attention (MTFA) it informs the model where to focus along the time and frequency axes by collecting data at different resolutions, which has not been taken care in the past. Zhang et al.15 and Shen et al.16 proposed a a multiscale time-frequency convolutional recurrent neural network (MTF-CRNN) for sound event detection15.

Deep learning has evolved a lot over time. Earlier CNNs were being used and continue to work very well. Recently, Transformers were introduced, and state-of-the-art methods have proved their performance in both NLP and Vision tasks. In this paper, we are searching for the best way to classify guns based on their audio. We have done many experiments for this task with CNNs and transformers.

### Traditional machine learning based researches

This subsection discusses the machine learning-based audio classification approaches, and researcher’s contribution are represented in Table 1. Gunshot audio depends on various aspects, i.e., (1) firing power (size of bullet), (2) length and width of the nozzle, and (3) environment (echo). If firing power or length/width of gun nozzle changes, it reflects the weapon’s is changed. Vrysis et al.42 compared 1D and 2D CNNs for audio classification using various features. 2D CNN with spectrogram worked best according to them. Nanni et al39 used an ensemble of CNNs to classify animal sound using spectrogram and some handcrafted features. In our work, we have utilized Mel-frequency and spectrogram-based features to identify the audio sound.

### Transformer as new state of the art

Transformers are first proposed by Vaswani et al.43 for the text-classification task of NLP. It is complete attention based, removed the recurrent nature completely; is faster to train, and offers better performance. Experimenting with pairwise and patch-wise self-attention, Zhao et al.44 found both outperform CNNs. Dosovitskiy et al.3 used transformers for image recognition. They used a simple transformer as an encoder. Transformer architecture has proven success in different domains. Inspired by this success we employed transformers to classify audio of gun shots.

The research based on gun identification has been done by Kiktova et al.45 which was an extension of an intelligent acoustic event detection system. The work was based on extracting a variety of features where by using mutual information for the feature selection, the length of feature vectors were reduced. Thereafter, the Hidden Markov Model and Viterbi-based decoding algorithm utilized those obtained features. In a recent publication of 2021, Dogan46 presented work on predicting the gun model by identifying sound and developed an intelligent audio forensics tool. Dogan has used the fractal H-tree pattern-based classification method, where fractal and statistical features were utilized by SVM and kNN classifiers. Researchers have not explored the classification of gunshots; research has also been done on measuring and analysing the gunshot sound as they may cause hearing impairments47. In their study, acoustic data was collected from four different guns where sound was captured at a sampling rate of 204.8 kHz. The method developed to measure gunshot was based on using image processing techniques where Short Time Fourier Transform (STFT), was used to get the spectrogram of an audio signal. Once the spectrogram was generated, then kNN and random subspaces were used to classify them. The study found that STFT gives better accuracy than CWT. Mares and Blackburn have focused on reducing gun violence by providing an evaluation for St. Louis’ Acoustic Gunshot Detection System’s (AGDS)48. They have done a quasi-experimental longitudinal panel study by conducting various experiments over time. The experimental study shows a substantial increase (80%) in gunshot responses by the system. Thus, from the overall study and literature survey, it has been observed that gunshot detection plays an important role in the forensic, health sector, and security, which motivated us to work in this area.

## Methodology

### Proposed approach

In this section, the methodology which has been used to classify gunshots is discussed in detail. The proposed work approach has not been used earlier to classify audios and is thus considered to be a novelty. The approach used to carry out the work follows some steps, which are given below:

• Loading the audio samples To load the .mp3 shot samples we have used the Librosa library. This Library produces a 21,624 or 22,200 lengths Numpy array for a one second sample array. To make its lengths equal to nearest whole square (22,500), as deep learning algorithms work on such lengths, we padded it with another array containing ‘− 1’.

• Extracting features from the audio: Mel-frequency Cepstral Coefficients (MFCCs): Cepstral helps us to understand periodic structures in frequency which gives us information related to echos. In MFC, frequency bands are equally spaced on Mel-scale which maps audio better. MFCCs are coefficients of MFC. To obtain MFCC sequence, we have used sampling rate of 22,500 Hz, the number of output MFCC features is set to 44, length of the FFT window is 2048 samples, the number of samples between successive frames is 512, and the type2 discrete cosine transform.

• Mel-spectogram: A spectrogram is a visual representation of changes in frequencies of a signal with respect to time. To the human ear, the frequencies 600 Hz and 1000 Hz may sound different, but 7600 Hz and 8000 Hz sound similar. Due to non-linear transformations of the frequency scale, the pitches (frequencies) that sound less distant appear less distant on Mel-scale. In Mel-spectrogram, y-axis is Mel-scale, and x-axis is time. The Mel-spectrogram is calculated by splitting the audio signal into the windows of length 2048 samples and hop length of 512. Then, applying a fast Fourier transform to each window and separating the given audio spectrum into evenly spaced frequencies; 128 such Mel bins are taken for this purpose. Finally, for each window, based on Mel-scale frequencies, breaking down the magnitude of the audio signal into its components.

• Vision transformer For the image classification task, a variation is made to the traditional transformer architecture used in NLP. In our approach, the input part is processed by the encoder and the output from the encoder is fed to a Multilayer perceptron after flattening it. No decoder is used, as shown in Fig. 2. The approach treats each patch from the image like text.

The transformer takes a linear vector with positional embedding as input. Therefore, first, 2D patches from the image are flattened to linear vector. Then, the positional embedding and class token are added to it. As shown in Fig. 3, the encoder part has many encoder blocks. Each encoder block has multiple layers of multi-headed self-attention mechanisms. The output from the encoder is normalized and sent to dense layers for image classification as shown in Fig. 3. The model is inspired from Dosovitskiy et al.3.

Pre-trained model have proven their success in many researches. After identifying diminishing gradient problem, researchers have proposed using residual connections in Resnet50 paper49. In our experiments, while training couple of times, we found that even the Resnet50 model quickly overfits. We tried using Dropout with high value, still problem of very high difference in training and validation loss exists. As shown by50, transformers are better at handling such situations.

Among various available architectures of vision transformers, we have used the L32 model for this research. The input to the transformer is a 3-channel (RGB) image of size $$224 \times 224$$. The image is divided into patches of shape $$32\times 32$$. Each patch is given as input to the linear transformation layer, which changes into a fixed-length vector. Vector of each patch has given its position information by adding a position embedding. As our problem is boiled down to image classification, a class descriptor token is also added to the patches. We pass the combined vector to the transformer encoder block. In the L32 model, we have 24 self-attention encoder blocks through which input passes for feature extraction. As our problem needs, we added two dense layers with 400 and 100 nodes, respectively. Both layers use ReLU activation. The dense layers are followed by the softmax layer, which classifies input into three classes: handgun, rifle, and none. The L32 model we choose is pre-trained on ImageNet dataset. We fine tune the model keeping each layer trainable.The Adam optimizer is used while fine-tuning with learning rate schedule of $$1\times 10^{-3}$$ using a warmup phase followed by a linear decay, and categorical cross-entropy is used as the loss function while training.

In this paper, we have used the concept of multi-head attention instead of single head attention by using $$d_{model}$$ dimentional keys, values and queries as used by Vaswani et al.43.

## Experimental setup

### Dataset collection and pre-processing

The dataset which has been used to carry out work contains short Gun Shot audio clips of 1 s. First, 200 videos containing multiple gunshots, were collected from YouTube. Among these 200 videos, 101 videos are of “rifle” shots, and 99 videos have “handgun” shots. The gun-shot is usually of less than 1 s, which is randomly taken audio from before and after the shot act.

To segment the 1-s shot audio, manual annotation was corrected up to a millisecond by using a video player. Thereafter, ffmpeg tool was used to crop annotated audio segments from videos. We kept one to several shots from each video, given they are different. In each audio clip, the background noise padding is at a different position.

To extract audio from random noise or audio containing no gun-shot, the annotations given to extract gunshot audios have been used. We have simply left the audio where the gunshots were present as these were already marked by us manually to create the dataset for the other two classes. A total of 322 audio images with noise was obtained.

Finally, all the audios were manually verified, and any unclassified audio was removed. Each audio for rifle and handgun class contains only one gunshot, which is either handgun or rifle. The dataset contains a total of 1661 audios comprising 649 handgun sample shots, 693 rifle sample shots, and 322 none or random noise in .mp3 format.

Each audio file for every class is split into set length frames in order to provide the network with sufficient and relevant data. We divided the original audio files into two categories as a first step: training samples (which constituted 70% of the original data) and test samples (which constituted 30% of the original data). This is done to prevent the network from overfitting and producing inaccurate results when tested on data that was previously used to train the network. With K = 5, we performed a K-fold cross-validation to effectively test the proposed network, as shown in Table 2.

### Comparison with other datasets and system configuration

We test our method on two additional available datasets to confirm the efficacy of the proposed method. (1) TRECVID Gunshot Videos (TREC): we have 57 videos of gunshots from the TRECVID SIN task51. (3) UrbanSound Gunshot Videos (Urban): from the UrbanSound dataset52, we have extracted 117 audio files that include gunshots. We run experiments on these two datasets, and Table 4 reports the comparison on testing accuracy for each dataset.

All the experiments were conducted inside google collaboratory. The system provided in google collaboratory has Nvidia k80 GPU with 12 GB of VRAM. The system has an Intel Xeon processor with two cores, 12 GB of RAM, and 25 GB of disk space. All the implementations were done in python 3.

## Results and discussion

This section describe the results. Table 1, shows audio classification algorithms employed in the past, with the comparison of their performance on a yearly basis. Further, we compared our proposed approach with algorithms developed by other authors, as shown in Table 4. Also, in order to check the generalizability of our proposed approach, we compared the results with two available datasets, (1) TRECVID Gunshot Videos (TREC), and (2) UrbanSound Gunshot Videos (Urban). Our proposed approach outperformed the state-of-the-art methods in all three datasets, as seen in Table 4. It is to be noted that our proposed approach produced testing accuracy of 89.0–90.0%. We could see that zero-shot federated learning produced accuracy within 83.5–86.0% (Table 4). Further, DNN Ensemble model produced accuracy of 83.0–84.5%, and capsule network produced the audio classification accuracy between 82.2 and 83.6% as shown in Table 4.

So far, CNNs53,54,55 have dominated audio classification tasks. CNN’s work is based on where it extracts significant features and edges by applying filters to a section of the data56. This allows the model to learn only the most important elements from the data rather than the fine details. Moreover, our proposed model works on the principle where the complete audio data is put into it, rather than only the sections that the filters can extract (or find relevant). This serves as a reason why our proposed approach outperforms the state-of-the-art approaches.

We have tried using raw audio signals directly to train Resnet50 as a baseline. When resnet50 is fine-tuned on the raw audio signal, the model overfits quickly. Training accuracy at the 50th epoch is 99.47%, while validation accuracy remained just 77.78%. On lower epochs, the validation accuracy is far poor. The training and validation loss were 0.0471 and 1.488. We found many variation in training and validation loss and classification accuracy. Then MFCC and Mel-Spectrogram features were also tested both individually and combined. When these are combined, we found that better classification accuracy is obtained. So we continued with the combined feature as our input.

We fine-tuned Resnet50 on MFCC and Mel-spectrogram features. As shown in Fig. 4, the Resnet50 model still has a lot of variations compared to the Vision transformer. The training accuracy and time for this is 98.88% and 18 h, respectively. The maximum validation classification was obtained to be 93.87% for both Resnet50 and Transformer (Table 3). However, for this accuracy vision transformer has a training loss of 0.2509 and validation loss of 0.1991. On the other side, Resnet50 has a lot of variation in Training loss $$4.0\times 10^{-4}$$ to 0.04 and in Validation loss of 0.2768–1.538.

The classification accuracy of vision transformer on testing dataset ranges between 89.0 and 90.0% in different training testing experiments (Table 4). Comparatively, the accuracy of Resnet50 ranges from 84.0 to 87.0% (Table 4). We split the dataset into training and testing (the terms validation and testing are interchangeably used). Our dataset size limits us to divide available audio into training testing and validation. While training, we used training data and validation data and trained the model for a fixed number of epochs. For the transformer, the best model is obtained at about 100 epochs while fine-tuning. While for resnet50 above approximately 50 epoch model start to overfit quickly.

We trained and validated models multiple times. We performed 5-fold cross-validation as shown in Table 2.

Interestingly, the Vision Transformer, which is reputed for quick overfitting behavior, did not overfit when the MFCC+MelSpec feature in the form of an image was passed. But it overfits when raw audio is given as input. Resnet50 worked well with raw audio but overfitted when MFCC+MelSpec feature as images are passed.

While training on both raw audio and features, we observed that Resnet50 and VT created their features. Considering the recording devices, the environment (echo) was different, and background noise was present.

### Vision transformer verses CNN

Vision transformers have shown remarkable performance in several computer vision-based tasks. These architectures work on multi-head self-attention mechanisms that can accept a sequence of image patches to encode contextual cues.

We are intrigued by the fundamental differences in the operation of convolution and self-attention that have not been extensively explored in the context of robustness and generalization. It is known that convolutions learn local interactions between elements in the input domain. In contrast, self-attention has shown to learn global exchanges effectively, for example, relationships between far-off objects57,58. Given a query embedding, self-attention finds interactions with the other embeddings in the sequence, thereby conditioning the local content while modeling global relationships59. In contrast, convolutions are content-independent as the exact filter weights are applied to all inputs regardless of their distinct nature. In this paper, our analysis illustrates that Vision transformer can adjust their receptive field in order to work with the noises in the data and improve the expressivity of the representations.

## Conclusion

This paper examined the vision transformer-based approach for audio-based gun-type identification tasks. Various features like MFCC and MelSpectogram were experimented with as previous research suggested. Vision Transformer was found to work better in terms of closeness of training and validation loss, thus giving us a better fitting model. Results indicate that though only a shallow gun audio classification is done in this paper, these techniques can be employed to classify various handguns and rifles based on their shot audio.

Collecting the dataset for such a project is very difficult. It has both legal and financial issues. However, such projects are necessary.

Our dataset, though, still captured audio of gunshots in different environments; the plausible audio filters used in videos would have disturbed the original audio signal. Due to such disturbance, some critical information is missing in the audio. We felt it and therefore limited the research only to classify gun types. Had the audios been recorded using the same device with no audio filter and in various environments, we could have classified different handguns and various rifles based on shot audios. Some attackers use audio suppressors. This audio is also classifiable.

To increase the audio-based gun identification task’s, the first step will be to collect raw gun audio shots. For each gun among various types of handguns and rifles, with and without suppressors, multiple shots in different environments must be collected. As mentioned in the limitations above, this step requires the support of legal authorities and monitory support.

After dataset collection, we can train a model that will classify different types of guns based on audio of shots. Like CCTV cameras, we will attach an audio input device with CCTVs, and any gunshot will be recognized. In such a way, we can attend to such situations quickly, bypassing human intervention, which usually delays the response and causes damage to intensify.