GeneAI 3.0: powerful, novel, generalized hybrid and ensemble deep learning frameworks for miRNA species classification of stationary patterns from nucleotides

Singh, Jaskaran; Khanna, Narendra N.; Rout, Ranjeet K.; Singh, Narpinder; Laird, John R.; Singh, Inder M.; Kalra, Mannudeep K.; Mantella, Laura E.; Johri, Amer M.; Isenovic, Esma R.; Fouda, Mostafa M.; Saba, Luca; Fatemi, Mostafa; Suri, Jasjit S.

doi:10.1038/s41598-024-56786-9

Download PDF

Article
Open access
Published: 26 March 2024

GeneAI 3.0: powerful, novel, generalized hybrid and ensemble deep learning frameworks for miRNA species classification of stationary patterns from nucleotides

Jaskaran Singh¹,
Narendra N. Khanna²,
Ranjeet K. Rout³,
Narpinder Singh⁴,
John R. Laird⁵,
Inder M. Singh⁶,
Mannudeep K. Kalra⁷,
Laura E. Mantella⁸,
Amer M. Johri⁸,
Esma R. Isenovic⁹,
Mostafa M. Fouda¹⁰,
Luca Saba¹¹,
Mostafa Fatemi¹² &
…
Jasjit S. Suri¹³

Scientific Reports volume 14, Article number: 7154 (2024) Cite this article

351 Accesses
Metrics details

Subjects

Abstract

Due to the intricate relationship between the small non-coding ribonucleic acid (miRNA) sequences, the classification of miRNA species, namely Human, Gorilla, Rat, and Mouse is challenging. Previous methods are not robust and accurate. In this study, we present AtheroPoint’s GeneAI 3.0, a powerful, novel, and generalized method for extracting features from the fixed patterns of purines and pyrimidines in each miRNA sequence in ensemble paradigms in machine learning (EML) and convolutional neural network (CNN)-based deep learning (EDL) frameworks. GeneAI 3.0 utilized five conventional (Entropy, Dissimilarity, Energy, Homogeneity, and Contrast), and three contemporary (Shannon entropy, Hurst exponent, Fractal dimension) features, to generate a composite feature set from given miRNA sequences which were then passed into our ML and DL classification framework. A set of 11 new classifiers was designed consisting of 5 EML and 6 EDL for binary/multiclass classification. It was benchmarked against 9 solo ML (SML), 6 solo DL (SDL), 12 hybrid DL (HDL) models, resulting in a total of 11 + 27 = 38 models were designed. Four hypotheses were formulated and validated using explainable AI (XAI) as well as reliability/statistical tests. The order of the mean performance using accuracy (ACC)/area-under-the-curve (AUC) of the 24 DL classifiers was: EDL > HDL > SDL. The mean performance of EDL models with CNN layers was superior to that without CNN layers by 0.73%/0.92%. Mean performance of EML models was superior to SML models with improvements of ACC/AUC by 6.24%/6.46%. EDL models performed significantly better than EML models, with a mean increase in ACC/AUC of 7.09%/6.96%. The GeneAI 3.0 tool produced expected XAI feature plots, and the statistical tests showed significant p-values. Ensemble models with composite features are highly effective and generalized models for effectively classifying miRNA sequences.

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Discovery of potent inhibitors of α-synuclein aggregation using structure-based iterative learning

Article Open access 17 April 2024

Introduction

MicroRNAs (miRNAs) are short RNA molecules that play a crucial role in regulating gene expression^1,2. Typically consisting of 20–25 nucleotides, they are formed through the transcription of longer RNA molecules by cellular enzymes. By binding to target messenger RNA (mRNA), miRNAs can inhibit mRNA’s translation, thereby controlling the expression of specific genes. This mechanism influences various biological processes such as proliferation³, apoptosis⁴, development^5,6, and differentiation⁷. Disruptions in miRNA expression have been associated with diseases like cancer^8,9,10 and cardiovascular disease^11,12,13. Accurately classifying miRNA sequences based on their origin^14,15,16 is crucial due to the diverse roles that miRNA sequences play in disease development across different species^17,18,19. This classification enables the identification of conserved miRNA sequences and their target genes, contributing to a better understanding of miRNA function and the detection of potential threats^20,21,22.

Machine learning’s application has been constantly observed in multiple bioinformatics studies^{23,24,25,26,27,28,29,30,31,32,33}, including several tools have gained attention in the field of miRNA identification. These tools include Mipred²⁵, Triplet³⁴, HeteroMirPred³⁵, micropred³⁶, PlantMiRNAPred³⁷, and mirnaDetect³⁸. They have the ability to extract pre-miRNAs from protein-coding regions that exhibit stem-loop structures similar to genuine pre-miRNAs but have not been identified as such. In addition, numerous computational methods have been developed to enhance miRNA identification. These methods include MatureByes³⁹, MiRMat⁴⁰, MiRRim2⁴¹, MiRdup⁴², MaturePred⁴³, MiRPara⁴⁴, mirExplorer⁴⁵, Matpred⁴⁶, and MiRduplexSVM⁴⁷. MiRNA identification can be performed using de novo methods, which involve computational tools, or by utilizing next-generation sequencing data^48,49. These methods focus on identifying pre-miRNA sequences that exhibit hairpin-like structures in the input data. They are categorized based on expression-based features or computed sequences.

The intricate nonlinear nature of miRNA sequences poses challenges for these methods, primary due to the high-dimensional feature spaces associated with the sequences⁵⁰. To address these challenges, primitive methods like ensemble ML (EML) methods that employ voting mechanisms^51,52,53 have been introduced. This was follwed by deep learning (DL) models, such as Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM)^54,55. DL models have the capability to capture the nonlinear complexity of miRNA sequences, making them well-suited for characterization and classification tasks^54,55,56,57. Despite the promising results achieved by solo DL (SDL) models in miRNA classification, they often require large labeled datasets and are susceptible to overfitting⁵⁸, which hinders their generalization capabilities⁵⁹. To further improve classification performance, hybrid DL (HDL) and ensemble DL (EDL) models have been proposed^60,61. These models leverage the strengths of multiple DL architectures^62,63,64.

Extracting additional features from miRNA sequences is a valuable strategy for overcoming the aforementioned limitations. Although features like k-mer frequency and dinucleotide composition effectively capture sequence-specific details^65,66,67, they have inherent limitations in extracting comprehensive information. To address these challenges, conventional features such as Energy, Contrast, and Entropy can be employed to capture structural characteristics^68,69,70,71. Additionally, contemporary features like Shannon Entropy and Hurst Exponent can be derived to obtain additional insights. By combining both sequence-specific and structural features into a composite feature set, the effectiveness of DL models can be further enhanced, resulting in a more robust approach. Further, incorporation of CNN layers in this paradigm enhances classification by capturing local patterns and spatial dependencies. Hence usage of CNN-based EDL models with extracted composite features is paramount in building a robust and state-of-the-art framework for miRNA classification.

In the spirit of improving species classification by employing EDL and EML classifiers, along with novel composite feature extraction we built an extensive set of ensemble-based AI classifiers, focusing on four main hypotheses. First, we investigate the benefits of using EML models with voting compared to SML models for miRNA species classification in binary classification (BC) and multiclass classification (MCC) scenarios. Second, we validate the superiority of EDL models over HDL and SDL models. Additionally, we explore the advantages of incorporating CNN layers in miRNA species classification, comparing them to models without CNN layers. Lastly, we examine the advantage of transitioning from EDL models to EML models in ensemble-based species classification. By introducing composite features and enhancing ensemble learning, our approach brings a fresh perspective to design and improves the reliability of genomic sequence testing. Consequently, it enhances the accuracy of miRNA sequence classification, surpassing previous research that relied solely on statistical techniques.

Figure 1 presents an overall block diagram of GeneAI 3.0 (AtheroPoint LLC, Roseville, CA, USA). With the input of miRNA species data containing gene sequences, the system performs an intensive data preparation (elliptical preprocessing block), which includes binary encoding of the gene sequence, scaling, augmentation using Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN)⁷², and interpolation. It then performs an elaborated feature extraction (elliptical feature extraction block), where it derives composite features from the binary miRNA sequence. GenAI 3.0 then incorporates 38 extensive AI models (classification block): nine SML, five EML, six SDL, twelve HDL and six EDL models, and classifies the species along with performance metrics (performance block), consisting of statistical tests and explainable AI (XAI) graphs.

Our research findings validated the advantages of utilizing EDL models in gene classification by conducting experiments that establish the model order as EDL > HDL > SDL. We have also validated the benefits of EML models over SML models in the miRNA classification. Furthermore, we have assessed the performance improvements achieved using EDL models compared to EML models, as well as the advantages gained from incorporating CNN layers in DL models. Alongside our primary contributions, we have investigated the impact of training data size on the model's performance and validated the reliability and stability of our approach through statistical tests. Finally, we have employed XAI plots to interpret our classification findings and offer insights into species classification.

The paper starts elaboration on the methodology which discusses the extracted features, employed classifiers and optimization parameters along with experimental protocols in “Methodology” section. The results are presented in “Results” section, while “Performance evaluation” section provides a performance evaluation with Receiver Operating Characteristic (ROC) curves, and influence of training data size. “Reliability analysis using statistical tests” section demonstrates reliability using statistical tests, while “Explainable artificial intelligence” section uses XAI plots used to enhance the interpretability. “Discussion” section presents a discussion of the principal findings, a benchmarking with previous studies, and an overview of the study's strengths, weaknesses, and extensions. Finally, “Conclusion” section concludes the paper.

Methodology

In order to explore the connection between miRNA sequences and their corresponding species, we employed statistical ML and DL models for classification in our methodology. The initial stage involved collecting the primary dataset that would serve as the foundation for classification, ensuring its suitability for utilization in the classifiers. Next, we conducted quality control procedures, including categorical encoding of the miRNA sequences, data scaling, oversampling of the minority class, interpolation of missing sequences, and label encoding of the class labels. We also computed both conventional and contemporary features from the dataset. Subsequently, we meticulously designed the architecture of all the AI models used, along with the hyperparameter tuning approaches, loss functions, and training details employed to train these models. Lastly, we defined the performance metrics and experimental protocols utilized in our study.

Data and data preparation

This study utilized the miRNA Database available at http://www.mirbase.org/ for experimental design, data collection, and discussion purposes. The database encompasses miRNA sequences from various species, including Humans, Gorillas, Mouse, and Rat. Th dataset used in this study consisted of 2654 Human, 369 Gorilla, 1978 Mouse, and 764 Rat miRNA sequences. A ribonucleic acid (RNA) molecule is composed of a backbone comprising sugar ribose and phosphate groups. In contrast to deoxyribonucleic acid (DNA), the sugar ribose lacks deoxyribose and is connected to one of four bases: adenine (A), uracil (U), cytosine (C), or guanine (G). To convert miRNA sequences containing these four bases into binary sequences, we applied a set of rules that mapped each base to a corresponding binary digit^73,74:

$${\text{A}}/{\text{G }} \to { 1} \;{\text{and}}\; {\text{C}}/{\text{U }} \to \, 0$$

(1)

This was done using truncation of sequence^75,76,77. This resulted in four datasets of binary sequences from the four species: Humans, Gorillas, Mouse, and Rat. Table 1 lists the specifications of each dataset.

Table 1 Specifications of the miRNA dataset.

Full size table

The class labels for the four species (Human, Gorilla, Mouse, and Rat) were encoded between 0 and 3 to be used as target classes in the classifiers. This label encoding was employed to convert the category labels for each species into numerical values, allowing for the application of DL techniques to analyze the relationships between miRNA sequences and the different species.

To facilitate this analysis, six binary class datasets and four multiclass datasets were prepared. These datasets were carefully curated and preprocessed to cover various scenarios among the four species. In the binary class datasets, two species were compared using a binary classification approach. The objective was to accurately differentiate between the two species using the provided dataset features. We created multiple datasets with the aim of achieving generalization^78,79,80,81 in species classification. The purpose behind this initiative was to train our model on a variety of datasets, ensuring its effectiveness in real-life scenarios. This approach allows any gene sequence to be pre-processed, features extracted, and utilized by our model. The binary datasets consisted of the following pairwise species comparisons: Human vs. Gorilla, Human vs. Rat, Human vs. Mouse, Mouse vs. Gorilla, Mouse vs. Rat, and Gorilla vs. Rat. For the multiclass datasets, the methodology used was "one vs. all." Each species was considered as one class, while the other three species were treated as the second class. The four multiclass datasets created were: Human vs. All, Rat vs. All, Gorilla vs. All, and Mouse vs. All. By utilizing these processed datasets, researchers could leverage DL techniques to gain insights into the relationships between miRNA sequences and different species.

Data availability/availability of data and materials

Due to its propriety nature, supporting data cannot be made available openly but are available from the corresponding author on reasonable request.

Quality Control

There is an unbalanced distribution of data points among the various classes in the dataset we acquired for our study. Data size in particular plays a vital role both in generalization vs. memorization protocols. When data size is low, we have seen studies where two types of data augmentation have been adopted^{74,79,82,83,84,85,86,87}. If it is an image data, the data augmentation consisted of increasing the data size by flipping and rotating the images^{82,83,84,85,86,87}. On the other hand, if the data is a point or tabular data, then the augmentation can be accomplished using SMOTE⁷⁴ or ADASYN protocols⁸⁸. To address this issue, we utilized the ADASYN technique, as depicted in Fig. 1. ADASYN is a method that generates synthetic samples for the minority class, thereby achieving a more balanced distribution of data points among the different classes. This approach is beneficial because imbalanced data can hinder the performance of supervised ML algorithms, which often prioritize the majority class and may exhibit poor performance on the minority classes. By employing ADASYN and balancing the representation of the classes in the dataset, we can enhance the performance of various classifiers. Some examples of classifiers that can benefit from this balanced data include Gradient Descent Boosting⁸⁹, Support Vector Machine (SVM)⁷⁷, and Logistic Regression (LR)⁸⁸.

We also employed linear interpolation to handle missing values within the "Human" class. Linear interpolation is a method that estimates the missing values by assuming a linear relationship between the available data points. By applying linear interpolation to the four instances with missing values, we successfully completed the dataset and ensured the integrity of the data for further analysis.

Additionally, to improve the performance of our algorithms on the imbalanced dataset, we implemented data scaling techniques^73,74 to standardize the features and ensure their similarity in scale. This enabled faster convergence of the algorithms and enhanced the accuracy of predictions. We specifically employed the Min–Max Scaler method, which rescales the data to a fixed range between 0 and 1. This is achieved by subtracting the minimum value and dividing by the range^73,74. By utilizing this method, we standardized the features and reduced their values, which expedited the training process for both ML and DL models.

Feature representation and composite features extraction

The miRNA sequence S_t consists of four nucleotide bases: A, C, U, and G, which can be arranged in different combinations. The presence of these nucleotides in the miRNA sequence signifies their interdependencies, and through the analysis of their patterns, distinct characteristics can be identified to distinguish between various species. In order to differentiate species based on feature representations of miRNA sequences, we developed an innovative approach to uncover these nucleotide co-occurrences. To demonstrate the possible arrangements of these nucleotides in miRNA gene sequences, we utilized co-occurrence matrices generated through vector combinations, as depicted in the provided Table 2.

Table 2 Possible sets of occurrences of nucleobases A, C, U, and G in an RNA sequence formed by the combination of vectors, where I, J, K, L, M, N, O, and P are the co-occurrence matrices.

Full size table

In order to gain insights into the inherent patterns of miRNA, it is essential to investigate the co-occurrences of nucleobases and analyze both their stationary and non-stationary patterns. To extract valuable information from these patterns, we employed the widely utilized grey-level co-occurrence matrix⁹⁰, a technique commonly employed in texture analysis and pattern recognition⁹¹. We have adopted the same feature extraction namely entropy, contrast, energy, homogeneity, dissimilarity as previously published by our group^{92,93,94,95,96} Such algorithms are being used for tissue characterization in medical imaging^97,98. For each miRNA sequence, we computed multiple co-occurrence matrices, namely I, J, K, L, M, N, O, and P. These matrices captured diverse patterns formed by the nucleobases A, C, U, and G. In Tables (ST1–ST8), we present these co-occurrence matrices, which offer an overview of the different nucleobase arrangements and their corresponding frequencies.

The primary objective of constructing co-occurrence matrices from the miRNA sequence S_t is to analyze the occurrence frequency of specific combinations and offsets of the nucleobases A, C, U, and G. The co-occurrence matrix ${\varvec{XC}}{ }$ has a size of $q$ × 4 for a given offset, where $q$ represents the number of distinct nucleobase combinations found in sequence S_t. Each element in the co-occurrence matrices presented in Tables (ST1–ST8), denoted as the ($l$, $m$)^th position, indicates the frequency of the $l$^th and $m$^th nucleobases occurring in the sequence S_t, which has a length of $n$. This relationship can be mathematically expressed using the following equation:

$${\varvec{XC}}{ } = { }\mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{n} \left\{ {\begin{array}{*{20}l} {1,} \hfill & {XC \left( {i,j} \right) = l \Lambda XC \left( {i + \Delta i,j + \Delta j} \right) = {\text{ m}}} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right.$$

(2)

The computation of matrix ${\varvec{XC}}$ is contingent upon the spatial relationship defined by the offset ($\Delta i$, $\Delta j$). These co-occurrence matrices are utilized to analyze the frequency of various combinations of the nucleobases A, C, U, and G in the sequence S_t. In order to extract distinctive and discriminative features, the ${\varvec{XC}}$ matrices are subjected to normalization, resulting in the transformed matrices $\user2{ XC}^{\user2{^{\prime}}}$.

$${\varvec{XC}}^{\user2{^{\prime}}} = \frac{{\varvec{X}}}{{\mathop \sum \nolimits_{l = 0}^{q} \mathop \sum \nolimits_{m = 0}^{q} {\varvec{X}}\left( {l,m} \right)}}$$

(3)

Subsequently, the normalized co-occurrence matrix ${\varvec{XC}}^{\user2{^{\prime}}}$ is utilized to compute several properties, which include Entropy, Contrast, Energy, Homogeneity, and Dissimilarity^{95,99,100,101}. The mathematical equations for these properties can be found in Table 3. These properties serve as quantitative measures to characterize different aspects of the co-occurrence patterns captured in the matrix ${\varvec{XC}}^{\user2{^{\prime}}}$. Afterwards, the features outlined in Table 3 are computed for each co-occurrence matrix Tables (ST1–ST8), and the corresponding feature vectors are presented in Table 4. Consequently, these feature vectors are utilized to construct the final feature set representation, denoted as f_Set, for an RNA sequence of a miRNA sequence S_t:

$${\mathbf{f}}_{{{\mathbf{Set}}}} \, = \,\left( {{\mathbf{f}}_{{\text{I}}} ,{\mathbf{f}}_{{\text{J}}} ,{\mathbf{f}}_{{\text{K}}} ,{\mathbf{f}}_{{\text{L}}} ,{\mathbf{f}}_{{\text{M}}} ,{\mathbf{f}}_{{\text{N}}} ,{\mathbf{f}}_{{\text{O}}} ,{\mathbf{f}}_{{\text{P}}} } \right).$$

Table 3 Features extracted from a co-occurrence matrix ${\varvec{XC}}^{\user2{^{\prime}}}$ of miRNA sequence S_t. ${\varvec{XC}}^{\user2{^{\prime}}}$.

Full size table

Table 4 Extracted Feature vectors from the cooccurrence matrices.

Full size table

Shannon entropy

Shannon Entropy (SE) is a valuable metric for quantifying the information content or uncertainty within a given sequence. It assesses the entropy of information in a Bernoulli process where two possibilities (0/1) occur with a probability of $p$ ^{102,103,104,105}. The SE signifies the degree of uncertainty present in a binary string and can be computed using the following formula:

$$SE = - \mathop \sum \limits_{i = 0}^{1} p_{i} \log_{2} (p_{i} )$$

(4)

where $p_{i}$ represents the probability of a binary sequence having two distinct values. When $p$ = 0, indicating that the event is impossible, there is no ambiguity, and the SE is 0. Likewise, when $p$ = 1, indicating a certain outcome, the SE is also 0. In the case where $p$ = 1/2¹⁰⁶, the level of uncertainty is at its highest, resulting in an SE value of 1.

Hurst exponent

Hurst Exponent (HE) is a measure that characterizes the autocorrelation properties of a time series¹⁰⁷ and finds applications in applied mathematics. It takes values between 0 and 1, where values in the range of [0, 0.5] indicate negative autocorrelation in the time series^108,109,110. Positive autocorrelation, on the other hand, is indicated by values in the range of [0.5, 1]. A HE value of 0.5 suggests that the variable is uncorrelated with its previous values, indicating a random series. HE score increases with the strength of the correlation between successive values. The following equation is used to calculate the HE of a binary sequence D of length $n$, where $D_{i}$ represents the i^th element of the binary sequence D.

$$\frac{\Phi \left( n \right)}{{{\text{V}}\left( n \right)}} = \left( \frac{n}{2} \right)^{HE}$$

(5)

where

$$\Phi \left( n \right) = \max \left( {Y_{1} \ldots Y_{n} } \right) - \min \left( {Y_{1} \ldots Y_{n} } \right)$$

(6)

$$V\left( n \right) = \sqrt {\frac{1}{n}\left[ {\mathop \sum \limits_{i = 1}^{n} \left( {D_{i} - \upmu } \right)^{2} } \right]}$$

(7)

$$Y_{t} = \mathop \sum \limits_{i = 1}^{t} \left( {D_{i} - \upmu } \right), \forall t = 1, 2, 3 \ldots n$$

(8)

$$\upmu = \frac{1}{n} \mathop \sum \limits_{i = 1}^{n} D_{i}$$

(9)

Fractal dimension

The Fractal Dimension (FD) of miRNA sequences is a widely used feature for analyzing their structural complexity. The first step in calculating the FD involves transforming each miRNA sequence into indicator matrices^111,112. The four nucleotides {A, U, C, G}c are represented by the symbol ${\tilde{\text{T}}}_{{{\text{miRNA}}}}$, and $D_{N}$ represents a miRNA sequence of length $N$ composed of four symbols chosen from ${\tilde{\text{T}}}_{{{\text{miRNA}}}}$. The indicator function for each miRNA sequence is defined by the following equation:

$$F:{ }D_{N} \times D_{N} \to \left\{ {0,1} \right\}, and{ }D_{N} = \left\{ {0,1} \right\}$$

(10)

Here the indicator matrix will be:

$$I\left( {N,N} \right) = \left\{ {\begin{array}{*{20}c} {1, s_{i} = s_{j} } \\ {0, s_{i} \ne s_{J} } \\ \end{array} } \right. \quad where\;s_{i} , s_{j} \epsilon D_{N}$$

(11)

To convert the miRNA sequence into a binary representation, a 2D dot-plot image is generated using the $I\left( {N,N} \right)$ matrix, which consists of values 0 and 1. This binary image visually represents the distribution of zeros and ones in the sequence, where white dots represent 0 and black dots represent 1. The FD can be computed from an indicator matrix by averaging the sigma $\sigma \left( k \right)$ values of 1 randomly selected from an $N$ × $N$ indicator matrix^112,113,114. The following equation is used to calculate the FD based on the sigma $\sigma \left( k \right)$ value:

$$FD = - \frac{1}{N}\mathop \sum \limits_{k = 2}^{N} \frac{\log (\sigma \left( k \right))}{{\log k}}$$

(12)

Machine learning and deep learning classifiers

In this comprehensive data analysis, we developed a total of fourteen ML models, including nine SML models and five EML models. Additionally, we constructed 24 DL models, consisting of six SDL models, twelve HDL models, and six EDL models.

Machine learning classifiers

For simplicity and availability, we selected the following ML models: LR^115,116, Linear SVM^117,118,119, Decision Tree (DT)¹²⁰, RF^121,122,123, Extra Trees (ET)^124,125, Extreme Gradient Boost (XGBoost)^88,126, K-Nearest Neighbors (KNN)^127,128, Linear Discriminant Analysis (LDA)^129,130, Light Gradient Boosting Machine (LGBM)^131,132, and Naive Bayes (NB)¹³³. We specifically chose six nonlinear models (DT, RF, ET, XGBoost, KNN, LGBM) as they are suitable for nonlinear classification tasks, which is crucial for effectively classifying binary-encoded miRNA species. Each model possesses unique strengths and weaknesses, and by evaluating multiple models, we can compare their performances and select the most effective one. Furthermore, we created five EML models: (i) LR and SVM, (ii) DT and KNN, (iii) DT and RF, (iv) RF, DT, and ET, and (v) ET, XGBoost, and LGBM. These models were constructed using a voting-based ensemble classifier approach.

Solo deep learning classifiers

Among the DL models, we developed six SDL models: GRU (Gated Recurrent Unit), Bidirectional GRU (BiGRU), RNN (Recurrent neural network), Bidirectional RNN (BiRNN), LSTM, and Bidirectional LSTM (BiLSTM). These models were specifically designed to capture the temporal dependencies and intricate patterns present in the miRNA sequences, further enhancing the classification performance. We conducted rigorous evaluation and testing to assess the performance and effectiveness of each SDL model, for the selection of the most suitable architecture for miRNA species classification.

Hybrid deep learning classifiers

While these SDL models have shown limited success in miRNA classification, combining them into HDL models has proven to be beneficial in overcoming data scarcity and improving performance^{82,84,134,135}. HDL models can effectively address domain-specific challenges and enhance accuracy in tasks such as miRNA classification by leveraging multiple architectural components. Considering these advantages, we constructed twelve HDL models: (i) LSTM-GRU, (ii) BiLSTM-BiGRU, (iii) LSTM-CNN, (iv) BiLSTM-CNN, (v) GRU-CNN, (vi) BiGRU-CNN, (vii) BiRNN-CNN, (viii) BiGRU-GRU, (ix) BiLSTM-LSTM, (x) BiRNN-RNN, (xi) RNN-CNN, and (xii) LSTM-GRU-CNN.

Ensemble deep learning classifiers

Furthermore, we created six EDL models: (i) BiLSTM-BiGRU and LSTM-GRU, (ii) BiLSTM-BiGRU and BiRNN-RNN, (iii) BiGRU-GRre U and LSTM-CNN, (iv) BiRNN-CNN and GRU-CNN, (v) BiLSTM-LSTM and RNN-CNN, and (vi) BiLSTM-CNN and BiGRU-CNN by concatenating their output vectors. By combining these multiple vectors, we can leverage the strengths and advantages of each individual model. The EDL models are depicted in Figures F1, F2, F3, F4, F5, and F6 in the supplementary material. All constituent models are utilized without their output layers and are truncated until the dropout layers. These model components are then concatenated using a concatenate layer and further employed as input to a dense layer network. Finally, the network is connected to a softmax layer for predicting the species.

Hypertuning parameters and optimization

During the study, the models were trained using a batch size of 64. The loss function chosen for training was categorical cross-entropy, which is commonly used for multi-class classification tasks. This loss function quantifies the dissimilarity between the predicted and actual probability distributions^136,137.

The objective is to minimize the discrepancy between these distributions, leading to a reliable system that generates predicted probabilities that closely align with the true distribution. Categorical cross-entropy ensures that the differences between all probabilities are minimized. The mathematical equation for categorical cross-entropy is provided below:

$${\text{L}}_{{{\text{CCE}}}} { = }\frac{{1}}{{\text{N}}}\mathop \sum \limits_{{\text{i = 1}}}^{{\text{N}}} \mathop \sum \limits_{{\text{c = 1}}}^{{{\text{TC}}}} {1}_{{{\text{y}}_{{\text{i}}} \epsilon {\text{TC}}_{{\text{c}}} { }}} {\text{loga}}_{{{\text{model}}}} {\text{(y}}_{{\text{i}}} \epsilon {\text{TC}}_{{\text{c }}} {)}$$

(13)

where N represents the total number of miRNA sequences, TC denotes the number of species categories, and $1_{{{\text{y}}_{{\text{i}}} \epsilon {\text{TC}}_{{\text{c}}} }}$ indicates that the ^h observation belongs to the c^th category. Table ST9 in the supplementary material provides details on the number of epochs, initial learning rates, and optimizers utilized for each EDL model. The implementation of the study was carried out using Python 3.8 and the TensorFlow framework. The system execution occurred on a machine that featured a 12 GB NVIDIA P100 16 Graphics Processing Unit (GPU), an Intel Xeon Processors processor, and 12 GB of RAM.

Performance metrics

The proposed models were assessed for both binary and multiclass classification tasks, with the multiclass approach utilizing the "one vs. all" strategy^138,139 for each species. To evaluate the models, several parameters were considered: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). A sample belonging to a specific species is considered a TP if it is correctly classified as such. Likewise, a sample not belonging to the species is labeled as a TN if it is correctly classified as not belonging. However, if a sample not belonging to the species is incorrectly classified as belonging, sample not belonging to the species is incorrectly classified as belonging, it is a FP, and if a sample belonging to the species is incorrectly classified as not belonging, it is a FN. These parameters allow the derivation of various performance evaluation (PE) metrics, including: (i) Accuracy (η): Indicates the proportion of correct overall predictions out of the total predictions made. (ii) Recall (R): Represents the ratio of correctly predicted positive class instances to all positive members in the dataset. (iii) Precision (P): Measures the ratio of correctly predicted positive class instances to the total number of classified positive predictions. (iv) F1-score (F): The F1-score is the harmonic mean of precision and recall, serving as a valuable metric for evaluating model performance, especially on imbalanced datasets. (v) Area-under-the-curve ($\upalpha$): It quantifies the two-dimensional area beneath the plotted ROC curve and is commonly used to assess model performance in both binary and multiclass classification problems.

In this study, we introduce formulations to measure the overall robustness of the model. To achieve this, six quantities are measured in this section, including $\overline{\upeta }\left( {\text{m, K10}} \right)$, which represents the accuracy of model m summarized over all D datasets, $\overline{\upeta }\left( {\text{d, K10}} \right)$, which indicates the accuracy of dataset d achieved by summarizing M models, $\overline{\upeta }_{{{\text{sys}}}}$, which represents the overall system accuracy achieved by averaging accuracy over M models and D datasets, $\overline{\upalpha }\left( {\text{m, K10}} \right)$, which summarizes the AUC of model m over all D datasets, $\overline{\upalpha }\left( {\text{d, K10}} \right)$, which indicates the robustness of dataset d achieved by summarizing the AUC over M models, and $\overline{\upalpha }_{{{\text{sys}}}}$, which represents the overall system robustness achieved by averaging the AUC over M models and D datasets. These formulations are measured in each section for a combination of ML, SDL, HDL, and EDL models, as well as a combination of six binary and four multiclass datasets and their combinations. All these formulas were computed using the default K10 partition protocol.

$$\upeta = \frac{{\text{TP + TN}}}{{\text{TP + FP + FN + TN}}}$$

(14)

$${\text{R}} = \frac{{{\text{TP}}}}{{\text{TP + FN}}}$$

(15)

$${\text{P}} = \frac{{{\text{TP}}}}{{\text{TP + FP}}}$$

(16)

$$F = 2* \frac{{\text{P * R}}}{{\text{P + R}}}$$

(17)

$$\overline{\upeta }\left( {\text{m, K10}} \right){ = }\frac{{\mathop \sum \nolimits_{{\text{d = 1 }}}^{{\text{D}}} \upeta \left( {\text{m, d, K10}} \right)}}{{\text{D}}}{ }$$

(18)

$$\overline{\upeta }\left( {\text{d, K10}} \right){ = }\frac{{\mathop \sum \nolimits_{{\text{m = 1}}}^{{\text{M}}} { }\upeta \left( {\text{m, d, K10}} \right)}}{{\text{M}}}$$

(19)

$$\overline{\upeta }_{{{\text{sys}}}} { = }\frac{{\mathop \sum \nolimits_{{\text{d = 1 }}}^{{\text{D}}} \mathop \sum \nolimits_{{\text{m = 1}}}^{{\text{M}}} { }\upeta \left( {\text{m, d, K10}} \right)}}{{{\text{M }} \times {\text{D}}}}$$

(20)

$$\overline{\upalpha }\left( {\text{m, K10}} \right){ = }\frac{{\mathop \sum \nolimits_{{\text{d = 1 }}}^{{\text{D}}} \upalpha \left( {\text{m, d, K10}} \right)}}{{\text{D}}}$$

(21)

$$\overline{\upalpha }\left( {\text{d, K10}} \right){ = }\frac{{\mathop \sum \nolimits_{{\text{m = 1}}}^{{\text{M}}} { }\upalpha \left( {\text{m, d, K10}} \right)}}{{\text{M}}}$$

(22)

$$\overline{\upalpha }_{{{\text{sys}}}} { = }\frac{{\mathop \sum \nolimits_{{\text{d = 1 }}}^{{\text{D}}} \mathop \sum \nolimits_{{\text{m = 1}}}^{{\text{M}}} { }\upalpha \left( {\text{m, d, K10}} \right)}}{{{\text{M}} \times {\text{D}}}}$$

(23)

Experimental protocols

To verify our hypothesis, we trained nine SML, six EML, six SDL, twelve HDL, and six EDL models, totalling 38 AI models, using a composite feature set. The feature set consisted of conventional features, including Entropy, Dissimilarity, Energy, Homogeneity, and Contrast, as well as contemporary features, such as Shannon entropy, Hurst exponent, and Fractal dimension. To test the resilience of the features on the AI models, we created various subsets of data with ten different datasets (six binary class and four multiclass).

Experiment 1: EDL Models vs. HDL Models vs. SDL Models

The main objective of this study is to examine and compare the effectiveness of SDL, HDL, and EDL models in classifying species using miRNA sequences. To achieve this, we trained and evaluated the performance of 24 AI models: six SDL, twelve HDL, and six EDL. The models were trained and tested using six binary and four multiclass balanced composite feature datasets. To evaluate the performance of these 24 AI models, their predictions were averaged across all ten datasets (6 binary class and 4 multiclass), and a comprehensive comparison was performed. To ensure the reliability of the results, the experiment utilized the K10 Cross-Validation protocols.

Experiment 2: EDL Models with CNN layers vs. without CNN layers

This study focuses on examining and comparing the impact of employing CNN layers into EDL models for species classification using miRNA sequences. The training and evaluation were conducted on twelve AI models, comprising four CNN-Based HDL models and two Non-CNN-Based HDL models. The models were trained and tested using six binary and four multiclass balanced composite feature datasets. To evaluate the performance of these 6 AI models, their predictions were averaged across all ten datasets (6 binary class and 4 multiclass), and a comprehensive comparison was performed. To ensure the reliability of the results, the experiment utilized the K10 Cross-Validation protocols.

Experiment 3: EML Models vs. SML Models

The primary aim of this study is to assess and contrast the efficacy of EML models versus SML models in the classification of species using miRNA sequences. The training and evaluation process involved 14 AI models, including nine SML models and five EML models. The models were trained and tested using six binary and four multiclass balanced composite feature datasets. To evaluate the performance of these 14 AI models, their predictions were averaged across all ten datasets (6 binary class and 4 multiclass), and a comprehensive comparison was performed. To ensure the reliability of the results, the experiment utilized the K10 Cross-Validation protocols.

Experiment 4: EDL Models vs. EML Models

The final objective of this study is to evaluate and compare the advantages offered by EDL models over EML models in stratifying species using miRNA sequences. A total of eleven AI models were trained and evaluated, including five EML models and six EDL models. The models were trained and tested using six binary and four multiclass balanced composite feature datasets. To evaluate the performance of these AI models, their predictions were averaged across all ten datasets (6 binary class and 4 multiclass), and a comprehensive comparison was performed. To ensure the reliability of the results, the experiment utilized the K10 Cross-Validation protocols.

Results

The protocols were employed to conduct tests on miRNA data from ten datasets, comprising of six binary class datasets and four multiclass datasets. The binary datasets included Human vs. Gorilla, Human vs. Rat, Human vs. Mouse, Mouse vs. Gorilla, Mouse vs. Rat, and Gorilla vs. Rat datasets. Additionally, there were four multiclass datasets, namely Human vs. All, Rat vs. All, Gorilla vs. All, and Mouse vs. All. To analyze the data, a total of fourteen ML models and eighteen DL models were utilized. The ML models consisted of nine SML models and five EML models. The DL models consisted of six SDL models, twelve HDL models and six EDL models The training process involved using the TensorFlow and Sklearn frameworks, and a Tesla P100 GPU on the K10 partition protocol was utilized for executing the training process. Experimental results were obtained based on these procedures.

EDL models vs. HDL models vs. SDL models

In this experiment, we conducted a comparison of six SDL classifiers, twelve HDL models and six EDL models. The performance evaluation involved calculating the average mean accuracy (ACC) and area-under-the-curve (AUC) for all the models across ten datasets, consisting of six binary class datasets and four multiclass datasets. The binary datasets comprised Human vs. Gorilla, Human vs. Rat, Human vs. Mouse, Mouse vs. Gorilla, Mouse vs. Rat, and Gorilla vs. Rat, while the multiclass datasets included Human vs. All, Rat vs. All, Gorilla vs. All, and Mouse vs. All. The results of the experiment are presented in Tables ST10, ST11, ST12, and ST13 given in the supplementary material.

Table ST10 shows that the SDL4 classifier (BiLSTM) achieved the best performance among all SDL models, with an ACC/AUC of 90.06%/0.9112. In Tables ST11 and ST12, the HDL2 classifier (BiLSTM-BiGRU) performed the best among all HDL models, with an ACC/AUC of 92.53%/0.9306. Furthermore, in Table ST13, the EDL6 classifier (BiLSTM-CNN ⊕ BiGRU-CNN) achieved the highest performance among all HDL/EDL models, with an ACC/AUC of 93.38%/0.9407. Table 5 presents the mean comparison, indicating that EDL/HDL classifiers outperformed SDL classifiers on all datasets. The mean accuracy and AUC differences between HDL and SDL across all datasets were 2.17% and 2.4%, respectively. The mean accuracy and AUC differences between EDL and HDL across all datasets were 2.01% and 1.52%, respectively. Additionally, the mean accuracy and AUC differences between EDL and SDL across all datasets were 4.18% and 3.92%, respectively.

Table 5 Comparison of SDL vs. HDL vs. EDL models.

Full size table

These results validate our hypothesis that HDL classifiers perform better due to the complex nature of miRNA. HDL models can capture intricate nonlinear relationships between input features and output labels by recursively splitting the data into smaller subsets, enabling accurate predictions. Furthermore, combining multiple models in EDL/HDL classifiers allows them to leverage the strengths of different models, leading to improved performance. The ability to customize and adjust these models based on specific problem domains further enhances their effectiveness.

EDL models with CNN layers vs. EDL models without CNN layers

In this experiment, we conducted a comparison to assess the impact of adding CNN layers in the architecture of EDL models. Specifically, we evaluated the performance of four CNN-based EDL classifiers (EDL3, EDL4, EDL5, and EDL6) and two non-CNN-based EDL classifiers (EDL1 and EDL2) on ten datasets, comprising of six binary class datasets and four multiclass datasets. The evaluation metrics of average mean accuracy and AUC were calculated and reported in Table 6. The results of our experiment demonstrated that incorporating CNN layers in the EDL models significantly enhanced their classification performance. By utilizing feature extraction techniques, the models exhibited improved accuracy and AUC scores. The mean absolute difference in accuracy and AUC across all datasets, resulting from the feature extraction process using contemporary features, was found to be 0.73% and 0.92%, respectively. These findings validated our hypothesis that incorporating CNN layers in DL models can enhance their effectiveness in classifying miRNA sequences. This improvement stems from the ability of CNN layers to capture both temporal and spatial dependencies within the data, enabling the models to learn hierarchical representations. The combination of temporal and spatial information allows for more comprehensive and accurate classification of miRNA sequences.

Table 6 Comparison of EDL models with CNN vs. without CNN layers.

Full size table

EML models vs. SML models

In this experiment, we conducted a comparison of nine SML classifiers and five EML models. The performance evaluation involved calculating the average mean accuracy and AUC for all the models across ten datasets, consisting of six binary class datasets and four multiclass datasets. The binary datasets comprised Human vs. Gorilla, Human vs. Rat, Human vs. Mouse, Mouse vs. Gorilla, Mouse vs. Rat, and Gorilla vs. Rat, while the multiclass datasets included Human vs. All, Rat vs. All, Gorilla vs. All, and Mouse vs. All. The results obtained from the experiment are presented in Tables ST14 and ST15 in the supplementary material. Table ST14 displays the performance results of the SML models, where the ET classifier achieved the highest performance with an ACC/AUC of 90.33%/0.9049. It was followed by RF with an ACC/AUC of 89.31%/0.8922 and LGBM with an ACC/AUC of 88.06%/0.8896. In Table ST15, the EML4 classifier (DT ⊕ RF ⊕ ET) demonstrated the best performance among all the EML models, achieving an ACC/AUC of 91.14%/0.9171.

Table 7 presents the mean comparison, indicating that the EML classifiers outperformed the SML classifiers on all datasets. The average accuracy and AUC differences between EML and SML across all datasets were 6.24% and 6.46%, respectively. These findings validate our hypothesis that EML models perform better due to the complex nature of miRNA, as they can capture intricate nonlinear relationships by recursively partitioning the data into smaller subsets, enabling accurate predictions. The use of a voting classifier in EML models allows them to combine the strengths of different models, leading to improved performance.

Table 7 Comparison of SML vs. EML models.

Full size table

EDL models vs. EML models

In this experiment, we conducted a comparison of five EML classifiers and six EDL models. The performance evaluation involved calculating the average mean accuracy and AUC for all the models across ten datasets, consisting of six binary class datasets and four multiclass datasets. The binary datasets comprised Human vs. Gorilla, Human vs. Rat, Human vs. Mouse, Mouse vs. Gorilla, Mouse vs. Rat, and Gorilla vs. Rat, while the multiclass datasets included Human vs. All, Rat vs. All, Gorilla vs. All, and Mouse vs. All. Table 8 presents the mean comparison, indicating that the EDL classifiers outperformed the EML classifiers on all datasets. The average accuracy and AUC differences between EDL and EML across all datasets were 7.09% and 6.96%, respectively.

Table 8 Comparison of EML vs. EDL models.

Full size table

These findings validate our hypothesis that EDL models outperform EML models due to their ability to capture complex patterns and relationships in the data through multiple layers of non-linear transformations. This can be attributed to their complex architecture, which allows them to automatically learn hierarchical representations of miRNA data, capturing both local and global patterns.

Performance evaluation

The evaluation process encompassed a comprehensive analysis of the models' performance, employing various visualization techniques such as ROC curves and bar charts to visualize the performance of the models. To ensure the system's stability, its robustness and model stability are evaluated through observing effect of training data size on classifiers. This allowed us to provides insight into the reliability and stability of the models and identify areas for improvement.

Receiver operating curves, mean accuracy curves, and mean AUC for classifier models

We plotted ROC curves of two best models, with all their classifiers, on all six binary datasets: Human vs. Gorilla, Human vs. Rat, Human vs. Mouse, Mouse vs. Gorilla, Mouse vs. Rat, and Gorilla vs. Rat. The performance of the models across their complete operating range was thoroughly evaluated, as shown in Fig. 1. In Fig. 2, the ROC curve for the EML4 Model is presented. The AUC score for Rat vs. Gorilla is the highest at 0.9909, followed by Mouse vs. Gorilla with an AUC score of 0.9713. This is followed by Human vs. Gorilla with an AUC of 0.9496 and Mouse vs. Rat with an AUC of 0.9448. The AUC score for Human vs. Rat is 0.9015, and Human vs. Mouse has the lowest AUC score of 0.7908. Figure 3 displays the ROC curve for the best-performing EDL (EDL6) Model. Among the comparisons, Mouse vs. Rat has the highest AUC score of 0.9815, followed by Rat vs. Gorilla with an AUC score of 0.9797. The AUC score for Mouse vs. Gorilla is 0.9761, and Human vs. Gorilla has an AUC of 0.9548. The AUC score for Human vs. Rat is 0.8893, and Human vs. Mouse has the lowest AUC score of 0.8854.

Furthermore, to establish the statistical significance of our results, p-values were computed for all species in each dataset. Our findings indicate that the p-values were less than 0.01, signifying a high confidence level in the observed differences between the species.

Bar charts are effective visual tools for presenting table data. Figure 4 illustrates the accuracy of nine SML, five EML, six SDL, twelve HDL, and six EDL models averaged across multiple binary and multiclass datasets. The mean accuracy increased progressively from 79.56% (SML) to 85.8% (EML), 88.71% (SDL), 90.88% (HDL), and 92.83% (EDL) models. Additionally, Fig. 5 depicts the AUC of the same models, showing a similar progressive increase in mean accuracy from 0.8014 (SML) to 0.866 (EML), 0.8964 (SDL), 0.9204 (HDL), and 0.933 (EDL) models when averaged across multiple binary and multiclass datasets.

Effect of training data size on classifier performance: varying partitional protocols

In this experimental study, we investigated the influence of varying training data sizes on the performance of DL models. Performance metrics were evaluated using different Cross-Validation protocols, namely K10 (default), K5, K4, and K2. Our analysis, presented in Table 9, revealed a gradual decline in performance metrics across these protocols. The evaluation included 24 DL classifiers, consisting of 6 SDL, 12 HDL, and 6 EDL models, applied to ten datasets encompassing both binary class and multiclass datasets. The average mean accuracy and AUC were computed, indicating a decrease in mean accuracy from 90.82% (K10) to 85.96% (K2), corresponding to a 4.86% reduction. Similarly, the AUC decreased from 0.9175 (K10) to 0.8634 (K2), indicating a 5.41% decline. Despite the reduced amount of training data in the K2 (50:50) validation protocol, our DL models demonstrated reliable performance metrics. This finding emphasizes the effectiveness of our approach, particularly the benefits gained from using ensemble models along with feature extraction. Hence, our models exhibit strong performance even in scenarios with limited training data, demonstrating their ability to maintain consistent performance under such conditions.

Table 9 Mean performance of 24 DL models on different Cross-Validation protocols.

Full size table

Reliability analysis using statistical tests

The stability of the system was thoroughly assessed and validated using three statistical tests conducted on the EDL models across all ten testing sets. There are several published studies which uses statistical tests for establishing the reliability and stability of the AI system^{80,81,140,141}. These tests are conducted on the employed models, and the specific tests we carried out are all showcased in the manuscript, namely Adjusted R2, Z (Two-Tailed), and ANOVA tests. The purpose of these tests was to determine the significance of the predicted data and monitor the p-value in the ANOVA test, ensuring it was less than 0.01 (p < 0.01). Detailed results of these tests, conducted following the methodology outlined in^{96,142,143,144,145}, are presented in Table ST16 in the supplementary material. The outcomes revealed that all six EDL models (EDL1, EDL2, EDL3, EDL4, EDL5, and EDL6) exhibited statistical significance with p < 0.01 in the ANOVA test, indicating strong outcomes and highlighting the models' reliability, stability, and clinical importance. The adjusted R-squared test evaluated the accuracy of the models by measuring the extent of feature variance, while the Z-score in the two-tailed tests indicated the deviation of the score from the mean population in terms of standard deviation. Therefore, these statistically validated findings reinforce the significance of our results and provide strong support for the reliability of the EDL models in this study.

Explainable artificial intelligence

To gain further insights into the decision-making process of the ML algorithms, we employed XAI techniques, specifically utilizing the SHapley Additive exPlanations (SHAP) method^{146,147,148,149,150}. By leveraging SHAP, we were able to delve into the impact of different features on the classification outcomes, enhancing our understanding of species-specific information and the distinctive effects of individual features on each species. This invaluable information contributes to a deeper comprehension and differentiation among the various species.

Using the SHAP explainer¹⁵¹, we developed an interpretable AI classifier as discussed in Fig. 1 that provided insights into the significance of different features for each species. The SHAP-generated graphs presented in Figs. 6, 7, 8 and 9 revealed that the "Fractal" feature played a crucial role in classifying all species except for the Mouse. In the case of the Mouse species, the most important feature was "f3," followed by "Hurst" and "f9." For the other three species, "Fractal" was the primary feature, accompanied by "Shannon" for Humans and "f4" for Gorilla and Rat. The importance of the remaining features, derived from the co-occurrence matrix as detailed in Table 4, gradually decreased. These findings emphasize the significance of feature selection when constructing accurate and dependable classifiers, particularly in biology and ecology¹⁵².

Discussion

Principal findings

After conducting an extensive study, we obtained valuable insights and drew conclusions pertaining to our research problem: (i) We devised four hypotheses and developed a total of 38 AI classifiers, which consisted of nine SML classifiers, five EML classifiers, six SDL classifiers, twelve HDL classifiers, and six EDL classifiers, in order to test them. (ii) For our experimental analysis, we utilized ten pre-processed datasets, comprising six binary classification datasets and four multiclass classification datasets. (iii) To enhance the processing and conversion of miRNA sequences into co-occurrence features, we implemented a novel quality control phase for our system. This involved performing scaling and binary encoding of the sequences. (iv) Our findings indicate that EML classifiers outperformed SML classifiers, yielding a mean accuracy increase of 6.24% and a 6.46% increase in AUC. Furthermore, HDL classifiers exhibited a significant advantage over SDL classifiers, with an increase in accuracy and AUC of 2.17% and 2.4%, respectively. (v) Also, EDL classifiers further improved upon HDL classifiers, with a mean accuracy of 2.01% and an AUC of 1.52%. (vi) Additionally, EDL classifiers significantly improved upon EML classifiers, with a mean accuracy increase of 7.09% and an AUC of 6.96%. (vii) We also observed that utilizing CNN-based HDL models with a feature extraction methodology greatly improved performance compared to non-CNN-based HDL models, yielding a mean accuracy increase of 0.73% and a 0.92% increase in AUC. (viii) We ensured the reliability and stability of our system by subjecting the classifiers to statistical tests. (ix) In order to verify the system's stability with smaller gene data sizes, we conducted a power analysis on the six binary class and four multiclass datasets, thereby validating the precision of the GeneAI 3.0 system. (x) We evaluated the impact of training data size by implementing Cross-Validation protocols in an increasing order. (xi) Finally, we utilized the SHAP explainer to interpret the classification results of the best (EDL6) model. This allowed us to gain insights into the significance of each species' features in their respective classifications.

Benchmarking: a comparative analysis

Numerous methods have been suggested for miRNA classifiers and species-independent lncRNA predictors, such as Precursor miRNAs classification, Non-coding RNA classification, and cross-species miRNA identification. These methods have undergone extensive validation and proven effective in identifying and categorizing miRNA and lncRNA. In contrast, this study introduces a unique approach to classify miRNA based on stationary patterns derived from gene sequences. The primary aim is to determine the species of origin by analyzing specific parameters associated with each species family. While this approach is innovative, its effectiveness and practicality need to be assessed through a comparative analysis with existing methods. Comparing different approaches is crucial for advancing the field of miRNA classification and enhancing our comprehension of miRNA biology. Therefore, it is essential to evaluate the proposed approach's accuracy, efficiency, and generalizability in comparison to established methods.

Table 10 focused on six studies that focused on developing classifiers for miRNA and lncRNA. Yousef et al.¹⁵³ employed a RF classifier and created a specific feature set called k-mer, which consisted of k-mer Distance, k-mer location distance, and k-mer first-last distance. These features were added to the basic k-mer features to classify Precursor miRNA. The evaluation of their method was conducted using a database obtained from USEARCH. Cao et al.¹⁵⁴ explored the utilization of an RF model with incremental feature selection and the Pearson correlation coefficient. Their objective was to predict lncRNA from both lncRNA and mRNA transcripts in a dataset consisting of six species. The dataset used in their study was sourced from Ensemble data repository.

Gu et al.¹⁵⁵ introduced an Ensemble Learning approach for miRNA-related disease classification using a multi-classifier system based on associated probabilities. Their method aimed to discover new potential associations between miRNA and diseases. The results were validated using various versions of the HMDD database, making it a reliable approach that does not rely on known associations between miRNA and diseases. Zhao et al.¹⁵⁶, introduces an improved paradigm for miRNA target prediction was presented. They utilized a DT-based meta-strategy and a multi-threshold sequential voting method for meta-prediction. This approach aimed to enhance the accuracy of existing miRNA target prediction schemes.

Jiang et al.¹⁵⁷ implemented a neural network-based scheme for end-to-end classification of pre-miRNA. They utilized a database consisting of 98 features, including n-gram frequency, structural sequence, structural diversity, and energy. The approach incorporated primary and secondary structure information to identify pre-miRNA in seven different species. Amin et al.¹⁵⁸ employed a comprehensive feature extraction approach for non-coding RNA classification. They constructed an extensive feature database and trained it using LR and RF models. The database consisted of peptide features, open reading frame (ORF) features, and whole sequence features, with classifiers individually applied to each feature class. A hierarchical majority voting mechanism was utilized to combine the features.

Table 10 Benchmarking table showing studies that were implemented for miRNA and lncRNA classification.

Full size table

In our proposed work (R7), we introduce a novel approach for miRNA classification based on species of origin. Multiple LSTM, GRU, CNN, and RNN-based SML, EML, SDL, HDL, and EDL models are employed. A feature extraction module is used to extract both conventional features like entropy and energy, as well as contemporary features such as Shannon entropy, Hurst exponent, and fractal dimension. This integration of different features helps build a more robust model. Our study focuses on achieving generalization, employing XAI as part of scientific validation, and conducting thorough testing to ensure the reliability and stability of the GeneAI 3.0 system.

Special note on ensemble-based feature extraction in miRNA classification

Ensemble-based feature extraction techniques have emerged as a powerful approach in miRNA classification tasks. By combining multiple feature extraction methods, using concatenation and splitting, these ensembles can effectively capture diverse aspects of miRNA sequences, leading to improved classification performance. The ensemble architecture allows for the fusion of features extracted from different methods, such as structural and compositional information, enabling the neural network to leverage complementary information and capture complex patterns in miRNA data. This approach not only enhances the classification accuracy but also helps mitigate overfitting by providing a regularization effect. Additionally, by incorporating different ensemble architectures, including completely different paradigms, the ensemble-based feature extraction further enriches the classification process, allowing for a more comprehensive and robust miRNA classification.

The effectiveness of ensemble-based techniques in miRNA classification is not limited to DL but also observed in traditional ML approaches. Techniques like RF and stacked ML models employ ensembles of multiple ML models to enhance classification performance. The ensemble architectures, such as weighted averaging, hard voting, and soft voting, play a crucial role in combining the predictions or features extracted from different models, leveraging their complementary strengths, and achieving better classification outcomes in miRNA analysis. By harnessing the collective intelligence of multiple models, ensemble-based feature extraction offers a powerful framework to improve the accuracy, sensitivity, and specificity of miRNA classification models. These ensemble-based approaches pave the way for more reliable and robust miRNA classification, enabling researchers to gain deeper insights into the complex world of gene expression and regulation.

Special note on generalization

For generalizations, the models have to undergo training and testing on multiple datasets. Our group has done several methods for generalization^78,79,80,81. In⁸¹, we developed an ensemble-based transfer learning paradigm, successfully classifying skin lesion images from two different and diverse datasets. We trained on one set and classified lesions from the other set. Study⁸⁰ focused on our work in depression detection, where we developed a generalized model for text classification with the primary goal of detecting depression. Study⁷⁹ attempted to achieve generalization in Covid-19 patients' lung segmentation across five different combinations of data by employing unseen data tests and statistical analyses. Finally, Study⁷⁸, focusing on COVID-19 lung computed tomography segmentation, achieved generalization by testing on two unseen datasets, pairing 72 Italian and 80 Croatian patients.

We achieved generalization in these systems by simplifying the model, enabling it to work across multiple domains effectively in various situations through the mixing of domains. In Study⁸⁰, we trained a model to be robust enough for depression detection as well as sentiment analysis by facilitating inter-dataset (cross-domain) training and leveraging knowledge from a multi-domain dataset. Our model demonstrated the capability to detect depression even when trained on a sentiment dataset, while also analysing sentiment when trained on a depression dataset. Likewise in this study, we conducted both multi-class and binary class classification, comprising a total of 10 datasets, where the model demonstrated satisfactory performance. This generalization ensures the effectiveness of our models for use in real-life scenarios, as any gene sequence can be pre-processed, features extracted and utilized by our model.

Strengths, weakness, and extensions

Our study presents a novel approach to gene dataset analysis using 38 AI classifiers, which consisted of nine SML classifiers, five EML classifiers, six SDL classifiers, twelve HDL classifiers, and six EDL classifiers. Through rigorous evaluation, we found that these models demonstrated exceptional performance in both binary and multiclass classification tasks. Furthermore, our study involved building an extensive composite feature set, generating new features such as Shannon entropy, Hurst exponent, and Fractal dimension, which were incorporated with existing co-occurrence features to enhance the AI system's performance. Additionally, our study addressed the challenge of interpretability by incorporating XAI techniques, allowing us to gain insights into the inner workings of the models. This enables us to leverage feature-specific knowledge and concentrate on further research for each species independently. It provides a critical overview of the important features that individually impact the likelihood of a miRNA sequence belonging to a specific species. This knowledge, derived from the feature plots, is crucial for the practical implementation of the machine learning model in our study. It will significantly influence how we hypertune our model and, on a biological level, understand which features (both conventional and contemporary) matter more for each species. Notably, our proposed methodology demonstrated robustness through its consistent performance in multiple statistical tests, including the Adjusted R2 Test, paired T-test, ANOVA, and null-hypothesis significance testing (p-value). Across all six binary and four multiclass datasets, our methodology consistently provided interpretable, reliable and accurate results, highlighting its potential to improve classification accuracy in gene species classification.

One limitation of our gene classification approach is the potential for model generalization and overfitting due to the limited size of the available training data, especially in binary class classification tasks. Although ensemble-based models have been employed to mitigate this issue, there is room for improvement by utilizing Generative Adversarial training-based mechanisms to synthesize additional data. Another weakness is the absence of attention mechanisms, which could hinder the model's ability to mitigate overfitting and enhance its overall robustness. To address these limitations, incorporating attention-based techniques can offer a more focused and streamlined classification of species, ultimately improving the accuracy and reliability of the gene classification scheme.

In the future, we can further enhance our gene classification scheme by addressing limitations and implementing potential improvements. One major limitation is the lack of diversity in the dataset, which can hinder the model's ability to generalize. To overcome this, we can incorporate a wider range of gene species and sequences into the dataset. This can be achieved by leveraging big data sources¹⁵⁹ or exploring other public data repositories^160,161. By expanding the dataset, we can train more complex models that exhibit improved accuracy and generalization performance. In dealing with gene sequence data, graph neural networks and attention-enabled mechanisms show promise^162,163,164. These approaches can better capture the intricate relationships between gene sequences and the species of origin. By leveraging these techniques, we can enhance the accuracy and interpretability of our gene classification scheme. To address the scarcity of data available for training models, we can consider employing Generative Adversarial training-based schemes. These schemes can generate synthetic data, thereby augmenting the training set and helping to overcome the data dearth^165,166,167. We also plan to enhance our model by employing a cross-domain-based framework. This involves training on one gene sequence dataset and testing on another from a different database. More gene data can be selected, evaluated to prove the deep learning methods. Another avenue to explore is the utilization of Autoencoders in gene classification. Autoencoders have the ability to reduce dimensionality and extract essential features from the data. By incorporating an Autoencoder-based paradigm, we can improve the efficiency and accuracy of gene classification tasks^168,169,170. Additionally, applying pruning strategies for AI models¹⁴¹ and studying the comorbidity effect in genomics can contribute to enhancing the classification system. Pruning techniques optimize the model's architecture and computational efficiency, while investigating comorbidity sheds light on the interconnected nature of genetic factors and disease manifestation¹⁷¹. By implementing these potential improvements, we can develop more accurate and robust models with broader applicability in the fields of genetics and bioinformatics.

Conclusion

This study presents a novel paradigm for feature extraction in miRNA classification using EDL and EML models. Specifically, we utilized 38 types of AI models (nine SML, six EML, six SDL and twelve HDL and six EDL) architectures, to extract features from co-occurrence-based binary-coded sequences. The extracted composite features combined contemporary and conventional features, resulting in a total of 43 generated features. We conducted a thorough data analysis using 10 classification algorithms, including binary and multiclass classifiers, and four experimental protocols to evaluate the effectiveness of our proposed scheme. Our results showed that our proposed scheme outperformed existing methods regarding accuracy, sensitivity, and specificity. Furthermore, we conducted Cross-Validation to ensure the robustness of our model, and our results demonstrated that our model was highly reliable even with limited training data. Finally, we conducted statistical tests to demonstrate the reliability and stability of our Artificial Intelligence system.

Data availability

The datasets generated during and analyzed during the current study are not publicly available due to their propriety nature but are available from the corresponding author on reasonable request.

Code availability

The code used during the current study are not publicly available due to due to their propriety nature but are available from the corresponding author on reasonable request.

Abbreviations

AI:: Artificial intelligence
ACC:: Accuracy
ADASYN:: Adaptive synthetic sampling approach for imbalanced leaning
ANOVA:: Analysis of variance
AUC:: Area-under-the-curve
BC:: Binary classification
BiGRU:: Bidirectional GRU
BiLSTM:: Bidirectional LSTM
BiRNN:: Bidirectional RNN
CNN:: Convolutional neural network
DL:: Deep learning
DT:: Decision trees
ET:: Extra trees
EDL:: Ensemble deep learning
EML:: Ensemble machine learning
FD:: Fractal dimension
FN:: False negative
FP:: False positive
GPU:: Graphics processing unit
GRU:: Gated recurrent unit
HDL:: Hybrid deep learning
HE:: Hurst exponent
HINN:: Hierarchical input neural networks
KNN:: K-nearest neighbors
LDA:: Linear discriminant analysis
LGBM:: Light gradient boosting model
lncRNA:: Long non-coding RNAs
LR:: Logistic regression
LSTM:: Long short-term memory
MCC:: Multiclass classification
miRNA:: MicroRNA
ML:: Machine learning
mRNA:: Messenger RNA
NB:: Naïve Bayes
ORF:: Open reading frame
RF:: Random forest
ROC:: Receiver operating curves
RNA:: Ribonucleic acid
RNN:: Recurrent neural network
SDL:: Solo deep learning
SHAP:: Shapley additive explanations
SE:: Shannon entropy
SML:: Solo machine learning
SVM:: Support vector machine
TP:: True positive
TN:: True negative
XAI:: Explainable AI
Xgboost:: Extreme gradient boost
$\upeta$ :: Accuracy
R:: Recall
P:: Precision
F :: F1-score
$\mu$ :: Arithmetic mean
A:: Adenine
U:: Uracil
C:: Cytosine
G:: Guanine
${\varvec{XC}}$ :: Co-occurrence matrix
${\varvec{XC}}^{\prime }$ :: Normalized co-occurrence matrix
S _t :: MiRNA sequence
$n$ :: Length of miRNA sequence
$p$ :: Probability of Bernoulli process
$D_{N}$ :: Binary miRNA sequence
f _Set :: Final feature set representation
$\Phi \left( n \right)$ :: Difference between maximum and minimum instances of the binary miRNA sequence
$V\left( n \right)$ :: Standard deviation of the Binary miRNA sequence
${\tilde{\text{T}}}_{{{\text{miRNA}}}}$ :: Nucleotides representation: {A, U, C, G}
${\text{L}}_{{{\text{CCE}}}}$ :: Categorical cross-entropy Loss
$\overline{\upeta }\left( {{\text{m}},{\text{K}}10} \right)$ :: Accuracy of model ‘m’ summarized over all D datasets over K10 protocol
$\overline{\upeta }\left( {\text{d, K10}} \right)$ :: Accuracy achieved over dataset ‘d’ over all M Models over K10 protocol
$\overline{\upeta }_{{{\text{sys}}}}$ :: Overall system accuracy over M models and D datasets
$\overline{\upalpha }\left( {\text{m, K10}} \right)$ :: AUC of model m summarized over all D datasets
$\overline{\upalpha }\left( {\text{d, K10}} \right)$ :: AUC achieved over dataset d over all M Models
$\overline{\upalpha }_{{{\text{sys}}}}$ :: Overall system AUC over M models and D datasets
M:: Total number of Models used in study
D:: Total number of Datasets used in study
⊕ :: Concatenation of two models

References

Anglicheau, D., Muthukumar, T. & Suthanthiran, M. MicroRNAs: Small RNAs with big effects. Transplantation 90(2), 105 (2010).
Article CAS PubMed PubMed Central Google Scholar
Nelson, P., Kiriakidou, M., Sharma, A., Maniataki, E. & Mourelatos, Z. The microRNA world: Small is mighty. Trends Biochem. Sci. 28(10), 534–540 (2003).
Article CAS PubMed Google Scholar
Pogue, A. et al. Micro RNA-125b (miRNA-125b) function in astrogliosis and glial cell proliferation. Neurosci. Lett. 476(1), 18–22 (2010).
Article CAS PubMed Google Scholar
Cheng, A. M., Byrom, M. W., Shelton, J. & Ford, L. P. Antisense inhibition of human miRNAs and indications for an involvement of miRNA in cell growth and apoptosis. Nucleic Acids Res. 33(4), 1290–1297 (2005).
Article CAS PubMed PubMed Central Google Scholar
La Torre, A., Georgi, S. & Reh, T. A. Conserved microRNA pathway regulates developmental timing of retinal neurogenesis. Proc. Natl. Acad. Sci. 110(26), E2362–E2370 (2013).
Article PubMed PubMed Central Google Scholar
Ren, Z. & Ambros, V. R. Caenorhabditis elegans microRNAs of the let-7 family act in innate immune response circuits and confer robust developmental timing against pathogen stress. Proc. Natl. Acad. Sci. 112(18), E2366–E2375 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Otto, T. et al. Cell cycle-targeting microRNAs promote differentiation by enforcing cell-cycle exit. Proc. Natl. Acad. Sci. 114(40), 10660–10665 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Kim, H. S. et al. MicroRNA-31 functions as a tumor suppressor by regulating cell cycle and epithelial-mesenchymal transition regulatory proteins in liver cancer. Oncotarget 6(10), 8089 (2015).
Article PubMed PubMed Central Google Scholar
Luo, Q. et al. Tumor-suppressive microRNA-195-5p regulates cell growth and inhibits cell cycle by targeting cyclin dependent kinase 8 in colon cancer. Am. J. Transl. Res. 8(5), 2088 (2016).
CAS PubMed PubMed Central Google Scholar
Karatas, O. F. et al. miR-33a is a tumor suppressor microRNA that is decreased in prostate cancer. Oncotarget 8(36), 60243 (2017).
Article PubMed PubMed Central Google Scholar
Barwari, T., Joshi, A. & Mayr, M. MicroRNAs in cardiovascular disease. J. Am. College Cardiol. 68(23), 2577–2584 (2016).
Article CAS Google Scholar
Small, E. M., Frost, R. J. & Olson, E. N. MicroRNAs add a new dimension to cardiovascular disease. Circulation 121(8), 1022–1032 (2010).
Article PubMed PubMed Central Google Scholar
Cheng, Y. & Zhang, C. MicroRNA-21 in cardiovascular disease. J. Cardiovasc. Transl. Res. 3, 251–255 (2010).
Article PubMed PubMed Central Google Scholar
Kloosterman, W. P. & Plasterk, R. H. The diverse functions of microRNAs in animal development and disease. Dev. Cell 11(4), 441–450 (2006).
Article CAS PubMed Google Scholar
Bhayani, M. K., Calin, G. A. & Lai, S. Y. Functional relevance of miRNA* sequences in human disease. Mutation Res./Fundam. Mol. Mech. Mutagenesis 731(1–2), 14–19 (2012).
Article CAS Google Scholar
Chen, X. et al. WBSMDA: Within and between score for MiRNA-disease association prediction. Sci. Rep. 6(1), 1–9 (2016).
Google Scholar
Chen, X., Wu, Q.-F. & Yan, G.-Y. RKNNMDA: Ranking-based KNN for MiRNA-disease association prediction. RNA Biol. 14(7), 952–962 (2017).
Article PubMed PubMed Central Google Scholar
You, Z.-H. et al. PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS Comput. Biol. 13(3), e1005455 (2017).
Article PubMed PubMed Central Google Scholar
Backes, C., Meese, E. & Keller, A. Specific miRNA disease biomarkers in blood, serum and plasma: Challenges and prospects. Mol. Diagn. Ther. 20, 509–518 (2016).
Article CAS PubMed Google Scholar
Jadideslam, G. et al. The MicroRNA-326: Autoimmune diseases, diagnostic biomarker, and therapeutic target. J. Cell. Physiol. 233(12), 9209–9222 (2018).
Article CAS PubMed Google Scholar
Shah, M. Y. & Calin, G. A. MicroRNAs as therapeutic targets in human cancers. Wiley Interdisci. Rev. RNA 5(4), 537–548 (2014).
Article CAS Google Scholar
Lin, C.-S. et al. Catalog of Erycina pusilla miRNA and categorization of reproductive phase-related miRNAs and their target gene families. Plant Mol. Biol. 82, 193–204 (2013).
Article CAS PubMed Google Scholar
Kleftogiannis, D. et al. Where we stand, where we are moving: Surveying computational techniques for identifying miRNA genes and uncovering their regulatory role. J. Biomed. Inform. 46(3), 563–573 (2013).
Article PubMed Google Scholar
Eszlinger, M. et al. Molecular profiling of thyroid nodule fine-needle aspiration cytology. Nat. Rev. Endocrinol. 13(7), 415–424 (2017).
Article CAS PubMed Google Scholar
Jiang, P. et al. MiPred: Classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 35(2), W339–W344 (2007).
Article PubMed PubMed Central Google Scholar
He, Y. et al. A support vector machine and a random forest classifier indicates a 15-miRNA set related to osteosarcoma recurrence. OncoTargets Ther. 15, 253–269 (2018).
Article Google Scholar
Ghobadi, M. Z., Emamzadeh, R. & Afsaneh, E. Exploration of mRNAs and miRNA classifiers for various ATLL cancer subtypes using machine learning. BMC Cancer 22(1), 1–8 (2022).
Article Google Scholar
Jha, A. & Shankar, R. Employing machine learning for reliable miRNA target identification in plants. BMC Genomics 12, 1–18 (2011).
Article Google Scholar
Stegmayer, G. et al. Predicting novel microRNA: A comprehensive comparison of machine learning approaches. Briefings Bioinform. 20(5), 1607–1620 (2019).
Article CAS Google Scholar
Rahman, M. H. et al. Bioinformatics and machine learning methodologies to identify the effects of central nervous system disorders on glioblastoma progression. Brief. Bioinform. 22(5), bbaa365 (2021).
Article MathSciNet PubMed Google Scholar
Wang, C. A modified machine learning method used in protein prediction in bioinformatics. Int. J. Bioautom. 19, 1 (2015).
Google Scholar
Le, N. Q. K., Li, W. & Cao, Y. Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection. Brief. Bioinform. 24(5), bbad319 (2023).
Article PubMed Google Scholar
Ou, Y.-Y. Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties. J. Mol. Graph. Model. 73, 166–178 (2017).
Article PubMed Google Scholar
Xue, C. et al. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinform. 6, 1–7 (2005).
Article Google Scholar
Lertampaiporn, S., Thammarongtham, C., Nukoolkit, C., Kaewkamnerdpong, B. & Ruengjitchatchawalya, M. Heterogeneous ensemble approach with discriminative features and modified-SMOTEbagging for pre-miRNA classification. Nucleic Acids Res. 41(1), e21–e21 (2013).
Article CAS PubMed Google Scholar
Batuwita, R. & Palade, V. microPred: Effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25(8), 989–995 (2009).
Article CAS PubMed Google Scholar
Xuan, P. et al. PlantMiRNAPred: Efficient classification of real and pseudo plant pre-miRNAs. Bioinformatics 27(10), 1368–1376 (2011).
Article CAS PubMed Google Scholar
Wei, L. et al. Improved and promising identification of human microRNAs by incorporating a high-quality negative set. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(1), 192–201 (2013).
Article Google Scholar
Blum, A. & Mitchell, T. Combining labeled and unlabeled data with co-training. in Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998, pp. 92–100.
He, C. et al. MiRmat: Mature microRNA sequence prediction. PLoS One 7(12), e51673 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Terai, G., Okida, H., Asai, K. & Mituyama, T. Prediction of conserved precursors of miRNAs and their mature forms by integrating position-specific structural features (2012).
Leclercq, M., Diallo, A. B. & Blanchette, M. Computational prediction of the localization of microRNAs within their pre-miRNA. Nucleic Acids Res. 41(15), 7200–7211 (2013).
Article CAS PubMed PubMed Central Google Scholar
Xuan, P., Guo, M., Huang, Y., Li, W. & Huang, Y. MaturePred: Efficient identification of microRNAs within novel plant pre-miRNAs. PloS One 6(11), e27422 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Wu, Y., Wei, B., Liu, H., Li, T. & Rayner, S. MiRPara: A SVM-based software tool for prediction of most probable microRNA coding regions in genome scale sequences. BMC Bioinform. 12(1), 1–14 (2011).
Article Google Scholar
Guan, D.-G., Liao, J.-Y., Qu, Z.-H., Zhang, Y. & Qu, L.-H. mirExplorer: Detecting microRNAs from genome and next generation sequencing data using the AdaBoost method with transition probability matrix and combined features. RNA Biol. 8(5), 922–934 (2011).
Article CAS PubMed Google Scholar
Li, J. et al. MatPred: Computational identification of mature micrornas within novel pre-MicroRNAs. BioMed Res. Int. 2015, 23 (2015).
Article Google Scholar
Karathanasis, N., Tsamardinos, I. & Poirazi, P. MiRduplexSVM: A high-performing miRNA-duplex prediction and evaluation methodology. PloS One 10(5), e0126151 (2015).
Article PubMed PubMed Central Google Scholar
Peace, R. & Green, J. R. Computational sequence-and NGS-based microRNA prediction. In Signal Processing and Machine Learning for Biomedical Big Data: CRC Press, 2018, pp. 381–410.
Chen, L. et al. Trends in the development of miRNA bioinformatics tools. Brief. Bioinform. 20(5), 1836–1852 (2019).
Article CAS PubMed PubMed Central Google Scholar
Page, J., Brenner, M. P. & Kerswell, R. R. Revealing the state space of turbulence using machine learning. Phys. Rev. Fluids 6(3), 034402 (2021).
Article ADS Google Scholar
Paul, T. K. & Iba, H. Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. 6(2), 353–367 (2008).
Article Google Scholar
Hassan, M. R. et al. A voting approach to identify a small number of highly predictive genes using multiple classifiers. BMC Bioinform. 10, 1–12 (2009).
Article Google Scholar
Li, Y. & Luo, Y. Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation. Quant. Biol. 8, 347–358 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zheng, X., Xu, S., Zhang, Y. & Huang, X. Nucleotide-level convolutional neural networks for pre-miRNA classification. Sci. Rep. 9(1), 628 (2019).
Article ADS PubMed PubMed Central Google Scholar
Tang, X. & Sun, Y. Fast and accurate microRNA search using CNN. BMC Bioinform. 20(23), 1–14 (2019).
Google Scholar
Park, S., Min, S., Choi, H.-S. & Yoon, S. Deep recurrent neural network-based identification of precursor micrornas. Adv. Neural Inf. Process. Syst. 30, 30 (2017).
Google Scholar
Amin, N., McGrath, A. & Chen, Y.-P.P. Evaluation of deep learning in non-coding RNA classification. Nat. Mach. Intell. 1(5), 246–256 (2019).
Article Google Scholar
Kleftogiannis, D., Theofilatos, K., Likothanassis, S. & Mavroudi, S. YamiPred: A novel evolutionary method for predicting pre-miRNAs and selecting relevant features. IEEE/ACM Trans. Comput. Biol. Bioinform. 12(5), 1183–1192 (2015).
Article CAS PubMed Google Scholar
Suri, J. S. et al. A powerful paradigm for cardiovascular risk stratification using multiclass, multi-label, and ensemble-based machine learning paradigms: A narrative review. Diagnostics 12(3), 722 (2022).
Article PubMed PubMed Central Google Scholar
Jamthikar, A. D. et al. Ensemble machine learning and its validation for prediction of coronary artery disease and acute coronary syndrome using focused carotid ultrasound. IEEE Trans. Instrum. Meas. 71, 1–10 (2021).
Article Google Scholar
Tandel, G. S. et al. Role of ensemble deep learning for brain tumor classification in multiple magnetic resonance imaging sequence data. Diagnostics 13(3), 481 (2023).
Article PubMed PubMed Central Google Scholar
Wang, H. et al. CL-PMI: A precursor MicroRNA identification method based on convolutional and long short-term memory networks. Front. Genet. 10, 967 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tasdelen, A. & Sen, B. A hybrid CNN-LSTM model for pre-miRNA classification. Sci. Rep. 11(1), 1–9 (2021).
Article Google Scholar
Chakraborty, R. & Hasija, Y. Predicting MicroRNA sequence using CNN and LSTM stacked in Seq2Seq architecture. IEEE/ACM Trans. Comput. Biol. Bioinform. 17(6), 2183–2188 (2019).
Article Google Scholar
Ru, X., Cao, P., Li, L. & Zou, Q. Selecting essential MicroRNAs using a novel voting method. Mol. Therapy-Nucleic Acids 18, 16–23 (2019).
Article CAS Google Scholar
Thomas, J., Thomas, S. & Sael, L. DP-miRNA: An improved prediction of precursor microRNA using deep learning model. In 2017 IEEE International Conference on Big Data and Smart Computing (BigComp), 2017: IEEE, pp. 96–99.
Asim, M. N. et al. MirLocPredictor: A ConvNet-based multi-label MicroRNA subcellular localization predictor by incorporating k-Mer positional information. Genes 11(12), 1475 (2020).
Article CAS PubMed PubMed Central Google Scholar
Fu, X. et al. Improved pre-miRNAs identification through mutual information of pre-miRNA sequences and structures. Front. Genet. 10, 119 (2019).
Article CAS PubMed PubMed Central Google Scholar
Fan, L. et al. Radiotranscriptomics signature-based predictive nomograms for radiotherapy response in patients with nonsmall cell lung cancer: Combination and association of CT features and serum miRNAs levels. Cancer Med. 9(14), 5065–5074 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wang, S., Tu, J., Wang, L. & Lu, Z. Entropy-based model for miRNA isoform analysis. PLoS One 10(3), e0118856 (2015).
Article PubMed PubMed Central Google Scholar
Thakur, V. et al. Characterization of statistical features for plant microRNA prediction. BMC Genomics 12(1), 1–12 (2011).
Article MathSciNet Google Scholar
He, H., Bai, Y., Garcia, E. A. & Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), 2008: IEEE, pp. 1322–1328.
Johri, A. M. et al. Deep learning artificial intelligence framework for multiclass coronary artery disease prediction using combination of conventional risk factors, carotid ultrasound, and intraplaque neovascularization. Comput. Biol. Med. 150, 106018 (2022).
Article PubMed Google Scholar
Konstantonis, G. et al. Cardiovascular disease detection using machine learning and carotid/femoral arterial imaging frameworks in rheumatoid arthritis patients. Rheumatol. Int. 42(2), 215–239 (2022).
Article PubMed Google Scholar
Saba L. et al., Plaque tissue morphology-based stroke risk stratification using carotid ultrasound: A polling-based PCA learning paradigm. In Vascular and Intravascular Imaging Trends, Analysis, and Challenges, Volume 2: Plaque characterization: IOP Publishing Bristol, UK, 2019, pp. 9–1–9–45.
Araki, T. et al. PCA-based polling strategy in machine learning framework for coronary artery disease risk assessment in intravascular ultrasound: A link between carotid and coronary grayscale plaque morphology. Comput. Methods Prog. Biomed. 128, 137–158 (2016).
Article Google Scholar
Maniruzzaman, M. et al. Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms. Comput. Methods Prog. Biomed. 176, 173–193 (2019).
Article Google Scholar
Suri, J. S. et al. Multicenter study on COVID-19 lung computed tomography segmentation with varying glass ground opacities using unseen deep learning artificial intelligence paradigms: COVLIAS 1.0 validation. J. Med. Syst. 46(10), 62 (2022).
Article PubMed PubMed Central Google Scholar
Dubey, A. K. et al. Ensemble deep learning derived from transfer learning for classification of COVID-19 patients on hybrid deep-learning-based lung segmentation: A data augmentation and balancing framework. Diagnostics 13(11), 1954 (2023).
Article PubMed PubMed Central Google Scholar
Singh, J., Singh, N., Fouda, M. M., Saba, L. & Suri, J. S. Attention-enabled ensemble deep learning models and their validation for depression detection: A domain adoption paradigm. Diagnostics 13(12), 2092 (2023).
Article PubMed PubMed Central Google Scholar
Sanga, P. et al. DermAI 1.0: A robust, generalized, and novel attention-enabled ensemble-based transfer learning paradigm for multiclass classification of skin lesion images. Diagnostics 13(19), 3159 (2023).
Article PubMed PubMed Central Google Scholar
Skandha, S. S. et al. A hybrid deep learning paradigm for carotid plaque tissue characterization and its validation in multicenter cohorts using a supercomputer framework. Comput. Biol. Med. 141, 105131 (2022).
Article CAS PubMed Google Scholar
Sanagala, S. S. et al. Ten fast transfer learning models for carotid ultrasound plaque tissue characterization in augmentation framework embedded with heatmaps for stroke risk stratification. Diagnostics 11(11), 2109 (2021).
Article PubMed PubMed Central Google Scholar
Jain, P. K. et al. Hybrid deep learning segmentation models for atherosclerotic plaque in internal carotid artery B-mode ultrasound. Comput. Biol. Med. 136, 104721 (2021).
Article CAS PubMed Google Scholar
Agarwal, M. et al. Wilson disease tissue classification and characterization using seven artificial intelligence models embedded with 3D optimization paradigm on a weak training brain magnetic resonance imaging datasets: A supercomputer application. Med. Biol. Eng. Comput. 59, 511–533 (2021).
Article PubMed Google Scholar
Saba, L. et al. Ultrasound-based internal carotid artery plaque characterization using deep learning paradigm on a supercomputer: A cardiovascular disease/stroke risk assessment system. Int. J. Cardiovasc. Imaging 37, 1511–1528 (2021).
Article PubMed Google Scholar
Skandha, S. S. et al. 3-D optimized classification and characterization artificial intelligence paradigm for cardiovascular/stroke risk stratification using carotid ultrasound-based delineated plaque: Atheromatic™ 2.0. Comput. Biol. Med. 125, 103958 (2020).
Article CAS PubMed Google Scholar
Teji, J. S., Jain, S., Gupta, S. K. & Suri, J. S. NeoAI 1.0: Machine learning-based paradigm for prediction of neonatal and infant risk of death. Comput. Biol. Med. 147, 105639 (2022).
Article PubMed Google Scholar
Saxena, S. et al. Fused deep learning paradigm for the prediction of o6-methylguanine-DNA methyltransferase genotype in glioblastoma patients: A neuro-oncological investigation. Comput. Biol. Med. 10, 106492 (2023).
Article Google Scholar
Acharya, U. R. et al. GyneScan: An improved online paradigm for screening of ovarian cancer via tissue characterization. Technol. Cancer Res. Treatm. 13(6), 529–539 (2014).
Article Google Scholar
Umer, S., Dhara, B. C. & Chanda, B. Texture code matrix-based multi-instance iris recognition. Pattern Anal. Appl. 19, 283–295 (2016).
Article MathSciNet Google Scholar
Acharya, U. R. et al. Ovarian tumor characterization using 3D ultrasound. Technol. Cancer Research Treatm. 11(6), 543–552 (2012).
Article Google Scholar
Acharya, U. R., Faust, O., Sree, S. V., Molinari, F. & Suri, J. S. ThyroScreen system: High resolution ultrasound thyroid image characterization into benign and malignant classes using novel combination of texture and discrete wavelet transform. Comput. Methods Prog. Biomed. 107(2), 233–241 (2012).
Article Google Scholar
Acharya, U. R. et al. Cost-effective and non-invasive automated benign & malignant thyroid lesion classification in 3D contrast-enhanced ultrasound using combination of wavelets and textures: A class of ThyroScan™ algorithms. Technol. Cancer Res. Treatm. 10(4), 371–380 (2011).
Article CAS Google Scholar
Suri, J. S. et al., Symptomatic vs. asymptomatic plaque classification in carotid ultrasound (2011).
Acharya, U. R. et al. Data mining framework for fatty liver disease classification in ultrasound: A hybrid feature extraction paradigm. Med. Phys. 39(7), 4255–4264 (2012).
Article PubMed Google Scholar
Shrivastava, V. K., Londhe, N. D., Sonawane, R. S. & Suri, J. S. Reliable and accurate psoriasis disease classification in dermatology images using comprehensive feature space in machine learning paradigm. Expert Syst. Appl. 42(15–16), 6184–6195 (2015).
Article Google Scholar
Acharya, U. R. et al. Evolutionary algorithm-based classifier parameter tuning for automatic ovarian cancer tissue characterization and classification. Ultraschall in der Medizin-Eur. J. Ultrasound 35(03), 237–245 (2014).
CAS Google Scholar
Biswas, M. et al. Symtosis: A liver ultrasound tissue characterization and risk stratification in optimized deep learning paradigm. Comput. Methods Prog. Biomed. 155, 165–177 (2018).
Article Google Scholar
Acharya, U. et al. Diagnosis of Hashimoto’s thyroiditis in ultrasound using tissue characterization and pixel classification. Proc. Inst. Mech. Eng. Part H: J. Eng. Med. 227(7), 788–798 (2013).
Article CAS Google Scholar
Rodrigues, P. S., Giraldi, G. A., Provenzano, M., Faria, M. D., Chang, R. F. & Suri, J. S. A new methodology based on q-entropy for breast lesion classification in 3-D ultrasound images. In 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, 2006: IEEE, pp. 1048–1051.
Burgin, M. Inductive complexity and shannon entropy. In Information and Complexity: World Scientific, 2017, pp. 16–32.
Zurek, W. H. Algorithmic randomness and physical entropy. Phys. Rev. A 40(8), 4731 (1989).
Article ADS MathSciNet CAS Google Scholar
Roach, T. N., Nulton, J., Sibani, P., Rohwer, F. & Salamon, P. Entropy in the tangled nature model of evolution. Entropy 19(5), 192 (2017).
Article ADS Google Scholar
Acharya, U. R. et al. Linear and nonlinear analysis of normal and CAD-affected heart rate signals. Comput. Methods Prog. Biomed. 113(1), 55–68 (2014).
Article Google Scholar
Rout, R. K., Hassan, S. S., Sindhwani, S., Pandey, H. M. & Umer, S. Intelligent classification and analysis of essential genes using quantitative methods. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16(1), 1–21 (2020).
Google Scholar
Acharya, U. R., Sree, S. V., Ang, P. C. A., Yanti, R. & Suri, J. S. Application of non-linear and wavelet based features for the automated identification of epileptic EEG signals. Int. J. Neural Syst. 22(02), 1250002 (2012).
Article PubMed Google Scholar
Li, W. & Kaneko, K. Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence. Europhys. Lett. 17(7), 655 (1992).
Article ADS CAS Google Scholar
Arneodo, A. et al. What can we learn with wavelets about DNA sequences?. Phys. A Stat. Mech. Appl. 249(1–4), 439–448 (1998).
Article CAS Google Scholar
Carbone, A., Castelli, G. & Stanley, H. E. Time-dependent Hurst exponent in financial time series. Phys. A Stat. Mech. Appl. 344(1–2), 267–271 (2004).
Article MathSciNet Google Scholar
Rout, R. K., Pal Choudhury, P., Maity, S. P., Daya Sagar, B. & Hassan, S. S. Fractal and mathematical morphology in intricate comparison between tertiary protein structures. Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 6(2), 192–203 (2018).
Article Google Scholar
Upadhayay, P. D., Agarwal, R. C., Rout, R. K. & Agrawal, A. P. Mathematical Characterization of Membrane Protein Sequences of Homo-Sapiens. in 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), 2019: IEEE, pp. 382–386.
Cattani, C. Fractals and hidden symmetries in DNA. Math. Probl. Eng. 20, 10 (2010).
MathSciNet Google Scholar
Rout, R. K., Ghosh, S. & Choudhury, P. P. Classification of mer proteins in a quantitative manner. Int. J. Comput. Appl. Eng. Sci. 10, 2 (2014).
Google Scholar
Cuadrado-Godia, E. et al. Ranking of stroke and cardiovascular risk factors for an optimal risk calculator design: Logistic regression approach. Comput. Biol. Med. 108, 182–195 (2019).
Article PubMed Google Scholar
Jamthikar, A. et al. Cardiovascular/stroke risk prevention: A new machine learning framework integrating carotid ultrasound image-based phenotypes and its harmonics with conventional risk factors. Indian Heart J. 72(4), 258–264 (2020).
Article PubMed PubMed Central Google Scholar
Shrivastava, V. K., Londhe, N. D., Sonawane, R. S. & Suri, J. S. Exploring the color feature power for psoriasis risk stratification and classification: A data mining paradigm. Comput. Biol. Med. 65, 54–68 (2015).
Article PubMed Google Scholar
Huang, S. et al. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics 15(1), 41–51 (2018).
CAS PubMed Google Scholar
Liu, Y., Guo, J., Hu, G. & Zhu, H. Gene prediction in metagenomic fragments based on the SVM algorithm. BMC Bioinform. 14, 1–12 (2013).
Article CAS Google Scholar
Tandel, G. S. et al. Multiclass magnetic resonance imaging brain tumor classification using artificial intelligence paradigm. Comput. Biol. Med. 122, 103804 (2020).
Article PubMed Google Scholar
Devetyarov, D. & Nouretdinov, I. Prediction with Confidence Based on a Random Forest Classifier. In AIAI 37–44 (Springer, 2010).
Google Scholar
Kursa, M. B. Robustness of Random Forest-based gene selection methods. BMC Bioinform. 15, 1–8 (2014).
Article Google Scholar
Goldstein, B. A., Polley, E. C. & Briggs, F. B. Random forests for genetic association studies. Stat. Appl. Genet. Mol. Biol. 10, 1 (2011).
Article MathSciNet Google Scholar
Sharaff, A. & Gupta, H. Extra-tree classifier with metaheuristics approach for email classification. In Advances in Computer Communication and Computational Sciences: Proceedings of IC4S 2018, 2019: Springer, pp. 189–197.
Lanjewar, M. G., Parab, J. S., Shaikh, A. Y. & Sequeira, M. CNN with machine learning approaches using ExtraTreesClassifier and MRMR feature selection techniques to detect liver diseases on cloud. Cluster Comput. 1, 16 (2022).
Google Scholar
Jamthikar, A. D. et al. Multiclass machine learning vs. conventional calculators for stroke/CVD risk assessment using carotid plaque predictors with coronary angiography scores as gold standard: A 500 participants study. Int. J. Cardiovasc. Imaging 37, 1171–1187 (2021).
Article PubMed Google Scholar
Pan, F., Wang, B., Hu, X. & Perrizo, W. Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis. J. Biomed. Inform. 37(4), 240–248 (2004).
Article CAS PubMed Google Scholar
Li, L., Darden, T. A., Weingberg, C., Levine, A. & Pedersen, L. G. Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. Comb. Chem. High Throughput Screen. 4(8), 727–739 (2001).
Article CAS PubMed Google Scholar
Sharma, A. & Paliwal, K. K. Linear discriminant analysis for the small sample size problem: An overview. Int. J. Mach. Learn. Cybern. 6, 443–454 (2015).
Article Google Scholar
Park, C. H. & Park, H. A comparison of generalized linear discriminant analysis algorithms. Pattern Recogn. 41(3), 1083–1097 (2008).
Article ADS Google Scholar
Ahamed, B. S. & Arya, S. LGBM classifier based technique for predicting type-2 diabetes. Eur. J. Mol. Clin. Med. 8(3), 454–467 (2021).
Google Scholar
Liu, T., Zhang, X., Chen, R., Deng, X. & Fu, B. Development, comparison, and validation of four intelligent, practical machine learning models for patients with prostate-specific antigen in the gray zone. Front. Oncol. 13, 1157384 (2023).
Article CAS PubMed PubMed Central Google Scholar
De Ferrari, L. & Aitken, S. Mining housekeeping genes with a Naive Bayes classifier. BMC Genomics 7(1), 1–14 (2006).
Article Google Scholar
Jena, B. et al. Artificial intelligence-based hybrid deep learning models for image classification: The first narrative review. Comput. Biol. Med. 137, 104803 (2021).
Article PubMed Google Scholar
Das, S. et al. An artificial intelligence framework and its bias for brain tumor segmentation: A narrative review. Comput. Biol. Med. 10, 5273 (2022).
Google Scholar
Sharma, N. et al. Segmentation-based classification deep learning model embedded with explainable AI for COVID-19 detection in chest X-ray scans. Diagnostics 12(9), 2132 (2022).
Article PubMed PubMed Central Google Scholar
Divate, M. et al. Deep learning-based pan-cancer classification model reveals tissue-of-Origin specific gene expression signatures. Cancers 14(5), 1185 (2022).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y. & Zheng, Y. F. One-against-all multi-class SVM classification using reliability measures. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., 2005, vol. 2: IEEE, pp. 849–854.
Aly, M. Survey on multiclass classification methods. Neural Netw. 19(1–9), 2 (2005).
Google Scholar
Suri, J. S. et al. COVLIAS 2.0-cXAI: Cloud-based explainable deep learning system for COVID-19 lesion localization in computed tomography scans. Diagnostics 12(6), 1482 (2022).
Article CAS PubMed PubMed Central Google Scholar
Agarwal, M. et al. Eight pruning deep learning models for low storage and high-speed COVID-19 computed tomography lung segmentation and heatmap-based lesion localization: A multicenter study using COVLIAS 2.0. Comput. Biol. Med. 146, 105571 (2022).
Article CAS PubMed PubMed Central Google Scholar
Saba, L. et al. Intra-and inter-operator reproducibility analysis of automated cloud-based carotid intima media thickness ultrasound measurement. J. Clin. Diagn. Res. 12, 2 (2018).
Google Scholar
Biswas, M. et al. Deep learning strategy for accurate carotid intima-media thickness measurement: An ultrasound study on Japanese diabetic cohort. Comput. Biol. Med. 98, 100–117 (2018).
Article PubMed Google Scholar
Huang, S.-F. et al. Analysis of tumor vascularity using three-dimensional power Doppler ultrasound images. IEEE Trans. Med. Imaging 27(3), 320–330 (2008).
Article PubMed Google Scholar
Maniruzzaman, M. et al. Accurate diabetes risk stratification using machine learning: Role of missing value and outliers. J. Med. Syst. 42, 1–17 (2018).
Article Google Scholar
Kamal, M. S. et al. Alzheimer’s patient analysis using image and gene expression data and explainable-AI to present associated genes. IEEE Trans. Instrum. Meas. 70, 1–7 (2021).
Article Google Scholar
Kamal, M. S., Dey, N., Chowdhury, L., Hasan, S. I. & Santosh, K. Explainable AI for glaucoma prediction analysis to understand risk factors in treatment planning. IEEE Trans. Instrum. Meas. 71, 1–9 (2022).
Article Google Scholar
Marcílio, W. E. & Eler, D. M. From explanations to feature selection: Assessing SHAP values as feature selection mechanism. in 2020 33rd SIBGRAPI conference on Graphics, Patterns and Images (SIBGRAPI), 2020: Ieee, pp. 340–347.
Lubo-Robles, D., Devegowda, D., Jayaram, V., Bedle, H., Marfurt, K. J. & Pranter, M. J. Machine learning model interpretability using SHAP values: Application to a seismic facies classification task. In SEG International Exposition and Annual Meeting, 2020: SEG, p. D021S008R006.
Meng, Y., Yang, N., Qian, Z. & Zhang, G. What makes an online review more helpful: An interpretation framework using XGBoost and SHAP values. J. Theor. Appl. Electron. Commerce Res. 16(3), 466–490 (2020).
Article Google Scholar
Cau, R. et al. Machine learning approach in diagnosing Takotsubo cardiomyopathy: The role of the combined evaluation of atrial and ventricular strain, and parametric mapping. Int. J. Cardiol. 373, 124–133 (2023).
Article PubMed Google Scholar
Singh, P. & Sharma, A. Interpretation and classification of arrhythmia using deep convolutional network. IEEE Trans. Instrum. Meas. 71, 1–12 (2022).
Google Scholar
Yousef, M. & Allmer, J. Classification of precursor MicroRNAs from different species based on K-mer distance features. Algorithms 14(5), 132 (2021).
Article MathSciNet Google Scholar
Cao, L. et al. PreLnc: An accurate tool for predicting lncRNAs based on multiple features. Genes 11(9), 981 (2020).
Article CAS PubMed PubMed Central Google Scholar
Gu, C. & Li, X. Prediction of disease-related miRNAs by voting with multiple classifiers. BMC Bioinform. 24(1), 1–17 (2023).
Article Google Scholar
Zhao, B. & Xue, B. Improving prediction accuracy using decision-tree-based meta-strategy and multi-threshold sequential-voting exemplified by miRNA target prediction. Genomics 109(3–4), 227–232 (2017).
Article CAS PubMed Google Scholar
Jiang, L., Zhang, J., Xuan, P. & Zou, Q. BP neural network could help improve pre-miRNA identification in various species. BioMed Res. Int. 2016, 2 (2016).
Article Google Scholar
Amin, N., McGrath, A. & Chen, Y.-P.P. FexRNA: Exploratory data analysis and feature selection of non-coding RNA. IEEE/ACM Trans. Comput. Biol. Bioinform. 18(6), 2795–2801 (2021).
Article CAS PubMed Google Scholar
El-Baz, A. & Suri, J. S. Big Data in Multimodal Medical Imaging (CRC Press, 2019).
Book Google Scholar
Project MinE: Study design and pilot analyses of a large-scale whole-genome sequencing study in amyotrophic lateral sclerosis. Eur. J. Human Genet. 26(10), 1537–1546 (2018).
Moore, A. C., Winkjer, J. S. & Tseng, T.-T. Bioinformatics resources for microRNA discovery. Biomark. Insights 10, 29513 (2015).
Article Google Scholar
Zhang, Z.-Y. et al. iLoc-miRNA: Extracellular/intracellular miRNA prediction using deep BiLSTM with attention mechanism. Brief. Bioinform. 23(5), bbac395 (2022).
Article MathSciNet PubMed Google Scholar
Li, Z., Zhong, T., Huang, D., You, Z.-H. & Nie, R. Hierarchical graph attention network for miRNA-disease association prediction. Mol. Therapy 30(4), 1775–1786 (2022).
Article CAS Google Scholar
Yan, C. et al. PDMDA: Predicting deep-level miRNA–disease associations with graph neural networks and sequence features. Bioinformatics 38(8), 2226–2234 (2022).
Article CAS PubMed Google Scholar
Wan, C. & Jones, D. T. Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks. Nat. Mach. Intell. 2(9), 540–550 (2020).
Article Google Scholar
Lan, L. et al. Generative adversarial networks and its applications in biomedical informatics. Front. Public Health 8, 164 (2020).
Article PubMed PubMed Central Google Scholar
Wei, K., Li, T., Huang, F., Chen, J. & He, Z. Cancer classification with data augmentation based on generative adversarial networks. Front. Comput. Sci. 16, 1–11 (2022).
Article Google Scholar
Wei, R. & Mahmood, A. Recent advances in variational autoencoders with representation learning for biomedical informatics: A survey. Ieee Access 9, 4939–4956 (2020).
Article Google Scholar
Gokhale, M., Mohanty, S. K. & Ojha, A. A stacked autoencoder based gene selection and cancer classification framework. Biomed. Signal Process. Control 78, 103999 (2022).
Article Google Scholar
Betechuoh, B. L., Marwala, T. & Tettey, T. Autoencoder networks for HIV classification. Curr. Sci. 91, 11 (2006).
Google Scholar
Suri, J. S. et al. COVID-19 pathways for brain and heart injury in comorbidity patients: A role of medical imaging and artificial intelligence-based COVID severity classification: A review. Comput. Biol. Med. 124, 103960 (2020).
Article CAS PubMed PubMed Central Google Scholar

Download references

Funding

This research received no funding.

Author information

Authors and Affiliations

Department of Computer Science, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
Jaskaran Singh
Department of Cardiology, Indraprastha APOLLO Hospitals, New Delhi, India
Narendra N. Khanna
Department of Computer Science and Engineering, NIT Srinagar, Hazratbal, Srinagar, India
Ranjeet K. Rout
Department of Food Science, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
Narpinder Singh
Heart and Vascular Institute, Adventist Health St. Helena, St Helena, CA, USA
John R. Laird
Advanced Cardiac and Vascular Institute, Sacramento, CA, USA
Inder M. Singh
Department of Radiology, Massachusetts General Hospital, Boston, MA, 02115, USA
Mannudeep K. Kalra
Department of Biomedical and Molecular Sciences, Queen’s University, Kingston, ON, Canada
Laura E. Mantella & Amer M. Johri
Laboratory for Molecular Genetics and Radiobiology, University of Belgrade, Belgrade, Serbia
Esma R. Isenovic
Department of Electrical and Computer Engineering, Idaho State University, Pocatello, ID, 83209, USA
Mostafa M. Fouda
Department of Neurology, University of Cagliari, Cagliari, Italy
Luca Saba
Department of Physiology and Biomedical Engineering, Mayo Clinic, Rochester, MN, 55905, USA
Mostafa Fatemi
Stroke Monitoring and Diagnostic Division, AtheroPoint LLC, Roseville, CA, 95661, USA
Jasjit S. Suri

Authors

Jaskaran Singh
View author publications
You can also search for this author in PubMed Google Scholar
Narendra N. Khanna
View author publications
You can also search for this author in PubMed Google Scholar
Ranjeet K. Rout
View author publications
You can also search for this author in PubMed Google Scholar
Narpinder Singh
View author publications
You can also search for this author in PubMed Google Scholar
John R. Laird
View author publications
You can also search for this author in PubMed Google Scholar
Inder M. Singh
View author publications
You can also search for this author in PubMed Google Scholar
Mannudeep K. Kalra
View author publications
You can also search for this author in PubMed Google Scholar
Laura E. Mantella
View author publications
You can also search for this author in PubMed Google Scholar
Amer M. Johri
View author publications
You can also search for this author in PubMed Google Scholar
Esma R. Isenovic
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa M. Fouda
View author publications
You can also search for this author in PubMed Google Scholar
Luca Saba
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Fatemi
View author publications
You can also search for this author in PubMed Google Scholar
Jasjit S. Suri
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, J.S., N.N.K., and J.S.S.; methodology, J.S., R.K.R., J.R.L., L.E.M., and J.S.S.; investigation, N.N.K., J.R.L., I.M.S., L.S., and J.S.S.; resources, M.M.F., L.S.; writing-original draft preparation, J.S., R.K.R., and M.K.K.; writing-review and editing, J.S., A.M.J., E.R.I., and J.S.S.; visualization, J.S., N.S., E.R.I., L.S., and J.S.S.; supervision, J.R.L., I.M.S., A.M.J., M.M.F., M.F., L.S., and J.S.S.; All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Jasjit S. Suri.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Singh, J., Khanna, N.N., Rout, R.K. et al. GeneAI 3.0: powerful, novel, generalized hybrid and ensemble deep learning frameworks for miRNA species classification of stationary patterns from nucleotides. Sci Rep 14, 7154 (2024). https://doi.org/10.1038/s41598-024-56786-9

Download citation

Received: 11 July 2023
Accepted: 11 March 2024
Published: 26 March 2024
DOI: https://doi.org/10.1038/s41598-024-56786-9

Keywords

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Highly accurate protein structure prediction with AlphaFold

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Discovery of potent inhibitors of α-synuclein aggregation using structure-based iterative learning

Introduction

Methodology

Data and data preparation

Data availability/availability of data and materials

Quality Control

Feature representation and composite features extraction

Shannon entropy

Hurst exponent

Fractal dimension

Machine learning and deep learning classifiers

Machine learning classifiers

Solo deep learning classifiers

Hybrid deep learning classifiers

Ensemble deep learning classifiers

Hypertuning parameters and optimization

Performance metrics

Experimental protocols

Experiment 1: EDL Models vs. HDL Models vs. SDL Models

Experiment 2: EDL Models with CNN layers vs. without CNN layers

Experiment 3: EML Models vs. SML Models

Experiment 4: EDL Models vs. EML Models

Results

EDL models vs. HDL models vs. SDL models

EDL models with CNN layers vs. EDL models without CNN layers

EML models vs. SML models

EDL models vs. EML models

Performance evaluation

Receiver operating curves, mean accuracy curves, and mean AUC for classifier models

Effect of training data size on classifier performance: varying partitional protocols

Reliability analysis using statistical tests

Explainable artificial intelligence

Discussion

Principal findings

Benchmarking: a comparative analysis

Special note on ensemble-based feature extraction in miRNA classification

Special note on generalization

Strengths, weakness, and extensions

Conclusion

Data availability

Code availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Comments

Search

Quick links