Towards audio-based identification of Ethio-Semitic languages using recurrent neural network

In recent times, there is an increasing interest in employing technology to process natural language with the aim of providing information that can benefit society. Language identification refers to the process of detecting which speech a speaker appears to be using. This paper presents an audio-based Ethio-semitic language identification system using Recurrent Neural Network. Identifying the features that can accurately differentiate between various languages is a difficult task because of the very high similarity between characters of each language. Recurrent Neural Network (RNN) was used in this paper in relation to the Mel-frequency cepstral coefficients (MFCCs) features to bring out the key features which helps provide good results. The primary goal of this research is to find the best model for the identification of Ethio-semitic languages such as Amharic, Geez, Guragigna, and Tigrigna. The models were tested using an 8-h collection of audio recording. Experiments were carried out using our unique dataset with an extended version of RNN, Long Short Term Memory (LSTM) and Bidirectional Long Short Term Memory (BLSTM), for 5 and 10 s, respectively. According to the results, Bidirectional Long Short Term Memory (BLSTM) with a 5 s delay outperformed Long Short Term Memory (LSTM). The BLSTM model achieved average results of 98.1, 92.9, and 89.9% for training, validation, and testing accuracy, respectively. As a result, we can infer that the best performing method for the selected Ethio-Semitic language dataset was the BLSTM algorithm with MFCCs feature running for 5 s.


Bidirectional long short term memory
Bidirectional long short-term memory (BLSTM) is a kind of bidirectional recurrent neural network (BRNN).It was originally proposed in 9 and comprises of a combination of two hidden layers that expose separate directions to the same output.The output layer can gather information from both previous (backwards) and forward (future) states at the same time.Each LSTM is made up of increasingly complex and often linked subnets called "memory cells."These cells (gates) enable information to have long-distance dependencies.As a result, this adds fresh values to the activation as it progresses through the layers, avoiding the vanishing gradient problem that makes LSTM comparable to residual neural networks, or ResNets 13,14 .
Numerous research on language identification (LID) have been undertaken in the past which used conventional machine learning and deep learning approaches.The following are the study's contributions: There is no publicly available dataset for the LID of Ethio-Semitic languages.In this study, we created the first audio dataset to address this issue.As a result, this dataset may be utilized for future research and to compare similar works.In addition, we presented an efficient RNN model with appropriate hyper-parameters.Finally, we achieved state-of-the-art LID accuracy for Ethio-Semitic languages using the proposed method.
The rest of the paper is structured as follows: Section "Background of the deep learning approaches" presents a detailed description of the related work, including the gaps and limitations used in this study.Section "Related work" provides a detailed description of the proposed methodology and the dataset analyzed in this study.The experimental findings and analysis are provided in Section "Methodology".Finally, we present the concluding remarks in Section "Results and Discussion".

Related work
In an attempt to identify the gap in knowledge in this study area, we examined past research which used the conventional machine learning and deep learning approaches.This enabled us identify the gap in knowledge, existing datasets and sizes, methodologies, and models employed by various authors.Wondimu and Tekeba 3 used Gaussian mixture model (GMM) in a LID system for four Ethiopian languages (Amharic, Oromiffa, Guragegna, and Tigrigna).In the study, the MFCC feature extraction approach was employed, while GMM was used for classification.There is no fixed duration or segment size for audio splitting in the study.For the four languages, the average accuracy of the utterance dependent LID test was 93%; the average accuracy of the utterance independent test was around 70%; and the average accuracy of the speaker independent test, which was only evaluated on the utterance dependent scenario, was around 91%.
Athiyaa et al. 4 suggested using the Gaussian Mixture Model (GMM) and Mel-Frequency Cepstral Coefficients (MFCCs) aspects of the speech waveform to discriminate between two independent audio signals in Tamil and Telugu, two South Indian languages.The suggested approach uses MFCC features taken from speech waveforms to train the Gaussian Mixture Model (GMM).With an additional blend feature, the proposed spoken language recognition method improves the accuracy for both languages.
Gonzalez-dominguez et al. 12 research study showed how LSTM RNNs effectively exploits temporal correlations in audio data by learning appropriate features for language recognition.The suggested method is compared to the i-vector and feed-forward Deep Neural Network (DNN) baseline methods on the NIST Language Recognition Evaluation 2009 dataset.Despite having less magnitude values of the parameters, their results demonstrate that LSTM RNNs outperformed the DNN system.Additionally, the fusion of the various technologies enables significant speed gains of up to 28%.

Methodology
The methods to achieve successful results and achieve the goals of this research are described in this section.The sections of the study include subsections such as data collection and annotation, audio preprocessing, data splitting, feature extraction, model training, and model evaluation.Furthermore, an endeavor is made to elaborate on the study's unique feature, the identification of particular Ethio-Semitic languages.

Data collection
There is no publicly available dataset of audio data from Ethio-Semitic languages.Even though there is no prepared data for the selected languages, we created our own dataset by recording directly from native speakers of each language in order to execute the automatic language identification task.With a record time of 12 min per person for each language, we used 10 people for each language.Each corpus is composed of a variety of audio recordings containing different ages, genders, and accents.Each language's recording lasted for about 120 min (2 h), thereby totalling 480 min (8 h).The speech signal has a sampling rate of 44.1 kHz, and each sample is stored as a 16 bit value.This was recorded using a smart phone.

Proposed system architecture
This section describes the steps undertaken to build the proposed LID model using the proposed architecture in Fig. 1.It primarily focuses on the corpus preparation, overall model system architecture, preprocessing methods, preparation of machine readable datasets (CSV), and model evaluation procedures.Figure 1  Data Pre-processing After data collection, the first step in the LID system is to pre-process the raw speech data to make it suitable for the study.It is the term used to describe any type of processing performed on raw data to get it ready for the next processing step.It is the primary step that converts the data into a format that can be processed more quickly and efficiently.Data cleaning.Audio data cleaning is mandatory before any processing steps.When we collect the data by recording, most of the time the speakers speak mixed languages like Amharic with English or Ge' ez with Amharic, etc. Therefore we remove those mixed audios by using pydub package to prepare pure data for the proposed system.The audio files were saved in the form of WAV files.
Silent Removal.We used librosa trim function in order to remove the silence from the actual speech wav files and the normal sound is greater than 10 dB 18 .Decibel (dB) measures sound intensity (amplitude).We utilized a 10 dB threshold to eliminate the silence from speech.As a result, any speech that falls below this threshold is erased from the whole sound file 18 .We used 10 dB as we reviewed numerous literature which used the same level of sound.This is a standard value as the sound must be greater than the noise as given by Signal-to-Noise ratio (-10 dB ≤ SNR < + 10 dB).Also any speech which is greater or equal to 10 dB provides an adequate quality for speech communication.
Resampling.Even though our data was initially sampled at 44,100 Hz and had a large data size, in a comparison of 44,100 Hz and 22,050 Hz, the former has much greater data size.Because of the large amount of the data, the dataset was initially down-sampled to 22,050 Hz from 44,100 Hz.All audio files were converted from stereo (two channel) to single-channel (mono) since mono channel is superior for language classification and a sampling rate of 22.05 kHz is sufficient for basic information 19,20 .Each audio recording in stereo channels contains two channels, and is less essential for language classification than mono channels 21,22 .Librosa's default settings include a sampling rate of 22050 Hz, a bit depth of 16 bits, and 1 channel (mono).

Framing
Framing is the technique of splitting a continuous stream of speech samples into units of fixed length that enables it to process the signal block-by-block.Speech signals are slowly time-varying or non-stationary.Since sound waves are non-stationary signals, their acquired data are constantly shifting over time.Consequently, it is not possible to extract the speech features at once 23 .The speech signal is broken down into frames, which are small period subdivisions in second or millisecond 8 .Consecutive frames usually overlap by 30 to 50% in order to avoid windowing from obliterating any essential voice signal information 24 .Different segment lengths can be applied, namely: 30 s, 20 s, 10 s, 5 s and 3 s 25 .We selected intermediate lengths which are, 5 s and 10 s to segment or for framing using audio segment library for each language sound.

Feature extraction
The method of converting a raw speech signal into a collection of acoustic feature vectors that carry speakerspecific information is referred to as feature extraction 26 .By using Librosa package, we extract useful components from audio data 27,28 .There are different feature extraction techniques like Chorma Stft, Mel-frequency cepstral coefficients (MFCC), mel-spectrogram, Spectral Contrast, and Tonnetz etc. 27,28 .There are numerous features of an audio speech segment that can vary from language to language.These can be included into various LID system designs, each with a unique level of complexity and outcome 29 .Several feature extraction techniques exist for language classification 30 .The proposed system employs an acoustic-phonetics feature type with feature extraction technique namely MFCC have been used for feature extraction which is mainly used for spoken language identification.

Mel-frequency cepstral coefficients feature
The most common feature extraction technique to represent speech signal data and to express acoustic signals as cepstral coefficients for various applications is Mel-frequency cepstral coefficients (MFCCs) 16 .MFCC is a well-known feature used to describe speech signals.Since the technique for computing MFCC is based on shortterm analysis, a MFCC vector is generated from each frame.These are based on speech processing performed by the human ear and the speech signal's cepstrum 30,31 .Based on the study in 21 , we took the number of MFCC coefficients as 20, since the most important information were found from 13 to 20 coefficients 8,21,32 .The process to extract MFCC features is shown in Fig. 2.

A. Windowing
Windowing is the first step to extract MFCC features from framed signals and the effect of windowing is used to convert an infinite-duration signal into a finite-duration signal 33 .The signal should be attenuated to zero or very similar to zero in order to reduce the discontinuity of the voice signal at the start and end of each frame.Therefore, windowing each frame of the signal to increase the correlation of the Mel Frequency Cepstrum Coefficients (MFCC) can be used to minimize the difference 24 .

C. Mel-scale
In this stage, the projected spectra from the previous phase are mapped on the Mel scale to produce an approximation of the energy present at each point using a triangle overlapping window, also known as a triangular filter bank 8,21,22,33,34 .

D. Discrete cosine transform
This is the final stage in the process of converting a given sequence of finite duration data into a discrete vector by calculating coefficients from the provided log Mel spectrum.Discrete Cosine Transform (DCT) is preferred for the coefficient calculation since its outputs can have significant energy contents.Finally, the result of applying DCT is referred to as the MFCC vector 8,23,30 .The discrete cosine transform provides cepstral coefficients and is based on the linear transformation principle (MFCCs) 16 .The sample MFCC vector feature is shown in Fig. 3.  www.nature.com/scientificreports/

Model training
Long-Short Term Memory (LSTM) and Bidirectional Long-Short Term Memory (BLSTM) models were used to develop the proposed system.The MFCC features are fed into the proposed recurrent neural networks as input.
The acquired audio dataset was used to train the models.
Proposed LSTM model.The LSTM model is better for sequential data, so therefore, we used the sequential feature value which is MFCC.We fed the MFCC feature vector form directly to the LSTM model 36 .So, using the specified feature extraction technique, we ran one experiment using this model.Figure 4 depicts the proposed LSTM model.
Proposed BLSTM model.Experimentally, the BLSTM model, like the LSTM model, is better for sequential data.We used the sequential feature, which is similar to the MFCC feature.We fed the MFCC feature vectors directly into the BLSTM model.So we run one experiment with the MFCC feature in this model and used 64 and 128 filters, respectively.Figure 5 depicts the proposed BLSTM model.

Model evaluation metrics
There are two types of classification problems depending on the number of classes: binary classification, where there are only two classes, and multi-class classification, where there are more than two classes 37,38 .The confusion matrix is the visual representation of actual and predicted class values.The actual class indicates the real classification of each language and the predicted class also the prediction values of our trained models.Confusion matrix evaluates the accuracy of an algorithm to arrange samples in the appropriate classes 2 .This study used multi-class classification to evaluate the proposed models with four languages and we used evaluation metrics to evaluate the performance which include accuracy, precision, recall, and F1 score given by Eqs. ( 1)-( 4).
where, TP, TN, FP and FN represent true positive, true negative, false positive and false negative respectively.

Experimental setup
After feature extraction, we split the dataset into training, validation, and testing data to train, validate and evaluate the model.From the Scikit learn library, we used the train test split method to divide the data into 70% train set, 15% test set, and 15% validation set.The models used and experiments performed in this study are presented in this section along with the results used to evaluate how well the system performed in relation to other systems.To accomplish this, we first evaluated the dataset we used to build our model, then explained the www.nature.com/scientificreports/results of that implementation, and finally evaluated the results using the four metrics in Eqs. ( 1)-( 4).Bayesian hyper-parameter tuning algorithm was used to set values for hyper-parameters in our neural network (NN) model.This helps us to find a vector of hyper-parameters that works well with the problem domain in order to determine which one would perform effectively.Hyperparameters including batch size, learning rate, and dropout rates were optimized.We used the hyper-parameters presented in Table 1.

Human ethics statements
Authors confirm that all experiments were performed in accordance with the Helsinki declaration guidelines and regulations.Authors confirm that informed consent was obtained from all participants.Authors confirm that all experimental protocols were approved by the institutional licensing/research ethics committee of the author's institution which include members of staff of the College of Informatics, University of Gondar, Ethiopia, namely: Tsehaye Wasihun, Yigezu Agonafir, Solomon Zewdie, Sied Hassen, Ibrahim Gashaw.

Results and discussion
This section presents the implemented models as well as the experiments, results produced to evaluate the models performance and the comparison of the various deep learning methods.To evaluate our model, we analyzed the dataset using the deep learning approaches that we utilized to create our model.The results of the implementation and the metrics used to evaluate the results produced are presented in this section.When we examined the performance with durations of 5 s and 10 s, we observed that the performance with durations of 5 s outperforms that of 10 s.The results of the experimentations when LSTM with MFCC is used as a feature is presented in Table 2, while that which used BLSTM with MFCC as a feature is presented in Table 3.  www.nature.com/scientificreports/For 5 s, Guragigna achieved a precision, recall, and f1-score of 90, 92, and 91, respectively, and for 10 s, it achieved a precision, recall, f1-score of 82, 86, and 84.As a result, it can be deduced that the length of 5 s was superior to that of 10 s.
As shown in Table 3, the developed BLSTM model using MFCC as a feature achieved a recall of between 80 and 95% lasting 10 s and between 90 and 97% lasting 5 s.As a consequence, when the results of 5 s and 10 s are compared, the findings presented in Table 3 surpass the later.As a result, it can be concluded that 5 s was preferred over 10 s.
As illustrated in Fig. 6, BLSTM outperforms LSTM with respect to the LID models when employing the MFCC features.As a result, we can infer that the BLSTM is a more crucial algorithm than the LSTM for the building of LID models.With an average of 93.5%, 93.25%, and 93.25% for Precision, Recall, and F1-score, respectively, we achieved better results as compared to existing systems.Generally, for the two www.nature.com/scientificreports/models that we proposed, the result of 5 s was greater than 10 s.The reason behind this is that, it is essential to evaluate speech for a short time frame to execute a stable acoustic feature.

Confusion matrix of the proposed models
When the models use the confusion matrix to make predictions for each language, we observed how it becomes confused.We discussed the proposed models' confusion matrix with a specific feature, the MFCC feature.4.

Conclusion and future work Conclusion
This paper presented the development of a language identification (LID) system for Ethio-Semitic languages using Recurrent Neural Networks (RNN).We developed our corpus in order to implement the proposed system
13:19346 | https://doi.org/10.1038/s41598-023-46646-3www.nature.com/scientificreports/B.Fast fourier transformFast Fourier Transform (FFT) is used to convert the data from the frequency domain to the spatial domain.The samples in each frame are converted into frequency domain.A fast algorithm for performing the discrete Fourier transform is the Fourier transformation (DFT)33,34 .

Fig- ure 7
describes the confusion matrix of which the model is examined using the test data set by calculating the number of correctly and incorrectly classified test samples within each class.

Figure 7 (
a, b) displays the percentage of sample data that are classified as True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) in each of the four classes.Geez is taken as an illustration for the confusion matrix.It has the true positive value of 210, false-positive value 14, true negative value of 627, and false negative value of 7. Therefore, 210 samples are classified into the Amharic category, 1 sample is misclassified as Amharic, 4 samples are misclassified as Guragigna, 5 samples are misclassified as Tigrigna.Furthermore, Fig. 8 clearly illustrates the accuracy and cross-entropy (loss) performance evaluation of the LSTM model during the training and validation phases.At epoch 35, the accuracy for training and validation is 92.50% and 89.16%, respectively.Similarly, the LSTM architecture's training and validation losses are 0.0673 and 0.3820, respectively.

Figure 9
presents a visual display of the accuracy and cross-entropy (loss) performance evaluation of the BLSTM classifier during the training and validation periods.At epoch 35, the observed training and validation accuracy are 94.5% and 90.33%, respectively.Also, the BLSTM architecture's training and validation losses are 0.0187 and 0.4231, respectively.The BLSTM architecture surpassed the LSTM architecture in terms of both training and validation accuracy scores.A comparative analysis of the proposed model with existing models is presented in Table

Figure 8 .
Figure 8. Evaluation metrics of LSTM model with duration of 5 s.(a) accuracy (b) loss.

Figure 9 .
Figure 9. Evaluation metrics of BLSTM model with duration of 5 s (a) accuracy (b) loss.

Table 2 .
Experimental result using LSTM with MFCC as feature.

Table 3 .
Experimental result using BLSTM with MFCC as feature.

Table 4 .
Comparison of the proposed model with existing language identification models.