AMP-RNNpro: a two-stage approach for identification of antimicrobials using probabilistic features

Antimicrobials are molecules that prevent the formation of microorganisms such as bacteria, viruses, fungi, and parasites. The necessity to detect antimicrobial peptides (AMPs) using machine learning and deep learning arises from the need for efficiency to accelerate the discovery of AMPs, and contribute to developing effective antimicrobial therapies, especially in the face of increasing antibiotic resistance. This study introduced AMP-RNNpro based on Recurrent Neural Network (RNN), an innovative model for detecting AMPs, which was designed with eight feature encoding methods that are selected according to four criteria: amino acid compositional, grouped amino acid compositional, autocorrelation, and pseudo-amino acid compositional to represent the protein sequences for efficient identification of AMPs. In our framework, two-stage predictions have been conducted. Initially, this study analyzed 33 models on these feature extractions. Then, we selected the best six models from these models using rigorous performance metrics. In the second stage, probabilistic features have been generated from the selected six models in each feature encoding and they are aggregated to be fed into our final meta-model called AMP-RNNpro. This study also introduced 20 features with SHAP, which are crucial in the drug development fields, where we discover AAC, ASDC, and CKSAAGP features are highly impactful for detection and drug discovery. Our proposed framework, AMP-RNNpro excels in the identification of novel Amps with 97.15% accuracy, 96.48% sensitivity, and 97.87% specificity. We built a user-friendly website for demonstrating the accurate prediction of AMPs based on the proposed approach which can be accessed at http://13.126.159.30/.

Our model, AMP-RNNpro, significantly outperforms other state-of-the-art methods.We have developed an efficient prediction framework based on our proposed model; the model can be accessed at http:// 13. 126.159.30/.

Workflow of the study
This study introduces a novel approach to identifying AMPs based on a comparatively larger dataset constructed and acquired through a comprehensive literature review.Our procedural methodology is depicted in Fig. 1.We have applied the CD-HIT to reduce the redundancy of the sequences to obtain a more furnished dataset.Eight feature extraction methods have been employed on the finalized dataset.We trained and tested machine-learning approaches by utilizing 33 methods on each of the eight feature encodings.The performance of the models was rigorously tested using independent tests and tenfold cross-validation strategies.To construct the secondary dataset, we selected six models based on their overall performances: K-nearest Neighbor (KNN), Random Forest (RF), Extreme Gradient Boosting Classifier (XGB), Extra-trees Classifier (EX), and two meta-classifiers, Voting Classifier (Voting), and a Recurrent neural network (RNN) based approach called AMP-RNNpro.All the models and relevant parametric variables were derived using Scikit-learn, a freely available data-mining library for Python 27,28 .Based on the eight feature encoding methods, we generated probabilistic values from the selected models, yielding 48 dimensional (48D) features fed into the final predictor.In the secondary dataset (48D probabilistic values), there have been more positive values than negative ones.Consequently, we used a www.nature.com/scientificreports/balancing strategy called the Synthetic Minority Oversampling Technique (SMOTE) for the negative class 29 .Afterward, we fed the balanced dataset into six models, and according to the comparison results of these models, the AMP-RNNpro model has emerged as our meta-model of choice, given that it has taken 48D features as input and provided the most efficient outcomes.Finally, our methodology incorporates SHapely Additive exPlanation (SHAP) techniques to illustrate the top 20 features 30 , which significantly contribute to our model's performance.

Dataset description
We collected four datasets for this study.Initially, we collected XUAMP data as our first dataset from Xu et al. 3 .The authors constructed their dataset by merging samples from several repositories such as the DRAMP 11 , DRAMP 2.0 31 , LAMP 13 , YADAMP 32 , etc.They selected 3072 samples with a sequence homology of less than 40%.
As we constructed numerous datasets, we collected the second dataset from Yan et al. 21.The authors created the DBAASP non-redundant independent test dataset by curating positive classes from DBAASPV3 33 and negative classes from the UniProt databases 34 .In the DBAASP dataset, the authors obtained 356 samples, with the positive samples reducing the redundancy by 90% homology and the negative dataset by 40% homology.Accordingly, we gathered another dataset LAMP 13 and DRAMP 11 .As mentioned, the XUAMP dataset has already been used to build their databases with a 40% threshold.In the current study, we merged all the datasets and applied the Cluster Database at High Identity with Tolerance (CD-HIT) 35 with an 80% threshold and 5-word size.This procedure was conducted to reduce redundancy and increase efficiency in both the training and test datasets.This comprehensive selection of datasets guarantees a thorough and accurate evaluation of the capabilities of the proposed technique under various circumstances.Table 1 lists the statistical information of the datasets.Generally, the length of the sequences was not greater than 100 or less than 10.However, sequences with non-conventional amino acids, such as "B, J, O, U, X, Z" are rarely found 15 .These sequences were excluded while performing our study.The peptide protein sequences obtained were focused on "A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, W, Y" and filtered for further analysis.Figure 2 illustrates the amino acid distribution of the final datasets.
Figure 2 exhibits the compositional distribution of 20 amino acids in percentage for both positive and negative cases.The corresponding letters in the Fig. 2 indicates all the amino acids.There are 9 (nine) non-polar amino acids such as alanine (A), phenylalanine (F), glycine (G), isoleucine (I), leucine (L), methionine (M), proline (P), valine (V), and tryptophan (W).There are 6 polar, uncharged amino acids such as serine (S), cysteine (C), asparagine (N), glutamine (Q), threonine (T), and tyrosine (Y).Two amino acids are present in the acidic amino acid group.They are glutamic acid (E) and aspartic acid (D).Accordingly, lysine (K), arginine (R), and histidine (H), are essential amino acids 36 .In this study, we observed significant differences in the amino acid composition of active antimicrobial peptides (AMPs) and their inactive antimicrobial peptides (non-AMPs), as demonstrated by the bar graph analysis.We observed that in the positive AMPs, the non-polar amino acid proline (P) and the polar amino acid group serin (S) were enriched by more than 100%.For non-AMPs, the non-polar groups alanine (A) and leucine (L) constituted more than 80% of the total amino acids..
[I] Amino acid compositional features

AAC
The AAC calculates the normalized quantities of each amino acid sequence.It provides an overview of the proportion of each peptide 39 .The mathematical formula is as follows: where k denotes certain kinds of amino acids, N k is the length of the sequences, and N is the total number of amino acids.In this study, we used 20D of the AAC features.

ASDC
ASDC is an adapted version of the dipeptide composition that generates a comprehensive descriptive process that considers all pertinent data between neighboring residues and intervening residues 39 .The feature vector of the ASDC can be defined as where T − 1 is the interference amino acids, f i is the frequency of peptides, which is ≤ T − 1 intervening of amino acids, this study used 400D of the ASDC features.

DP
Another feature-encoding method is DP.This is based on the frequencies of k-spaced amino acid pairs, and the composition of the protein sequence and distance pairs used in PseAAC, which indicates pairs of amino acids that are detached by a certain quantity of residues.The Reduced Alphabet Scheme uses amino acids like clusters to reduce the dimensionality of the feature vector 40 .This formula is expressed as follows: where DP(i, j) is the number of the distance pair of peptides, N is the length of the sequence, n is the distance between two peptides, and w is the weight of the i th residue at k th sequences.In this study, the 20D DP features were used.
[II] Grouped amino acid compositional features GAAC GAAC features are split into five groups: aliphatic groups with GAVLMI (6 amino acids), aromatic groups with FYW (3 amino acids), positively charged groups with KHR (3 amino acids), negatively charged groups with DE (2 amino acids), and group 5 with uncharged groups with STCPNQ (6 amino acids) 39,41,42 .The mathematical formula can be specified as where k is the acid type, G is the group number, N is the total number of acids, and G(k) is the groups number of peptides.In this study, we used 5D of the GAAC features.

CKSAAGP
CKSAAGP considers amino acid combinations detached according to any k residues, giving a more adaptable way to identify local sequence trends in protein sequences.It includes evaluating the presence of amino acid groupings within a specified distance and potentially finding significant morphological and functional patterns 43 .
The formula can be defined as: (1) where T means the length of peptides, N is the total number of acids, and g1, g2,… g5 is the group of amino acids.100D CKSAAGP-based features have been used in this study.

[III] Autocorrelation features
MORAN This is a mathematical correlation-based feature 44 used to evaluate the closely related nearby measurements in a spatial data collection 45 .In this study, we used 16D features of the MORAN feature.The formula can be stated as: where T is the total quantity of the position at β kj , N is the current number of coordinates, a is the normalized value of a k, a j parameter, and β kj is the dimension of the coordinates.

NMBroto
This is similar to the MORAN feature.However, their differences lie in function, normalization, and calculation as NMBroto is calculated using the frequencies of k-spaced amino acid pairs and the amino acid composition of the protein sequence 46 .NMBroto can be defined as: where k denotes the position of peptides.A, t, lag denote the length of the residues and the distance between the peptides.This study used the 16D feature of the NMBroto.

[IV] Pseudo-amino acid compositional features
PseKRAAC This is an extension of the Pseudo Amino Acid Composition PseAAC.This feature has 16 types of clustering methods; in this study, we used type 7 features, also called multiple clusters, with 4 clustering methods 47 .The formula can be defined as: where w is the weight of the j th position, n is the length of the tuple, N is the length of the sequence, and fi is the frequency in the i th residue.

Our proposed model construction
RNN is one of the most popular deep learning models used in various fields to detect the classes accurately 48 .RNNs can handle sequential or natural language processing (NLP) data.At each step, RNN possesses the internal layer of the input and the hidden state from the previous phase.This invisible state enables the recollection of the network and allows it to verify correlations in sequential input 49 .We selected this process for the optimal outcome to detect the AMPs, as RNNs are mainly used for the time series data, though could be utilized for sequence data, thus rendering them appropriate for jobs requiring sequential information.RNNs are intended to identify relationships and patterns in sequential data.FASTA patterns might vary in dimension, and RNNs can handle sequences of varied lengths despite requiring set input weights.This adaptability is significant in genetics and bioinformatics, where sequences might change in length.
We have constructed our meta-model "AMP-RNNpro" as shown in Fig. 3, that is optimized with six layers-an input layer, four hidden layers, and a dense layer.Accordingly, fifty epochs, three activation functions, and various filter sizes have been used in the independent test.The filter sizes connected with these layers are 128, 64, 32, and 16.We adopted the ReLU activation function in the first three layers, and in the fourth layer, we used the tanh function to handle the complexity.We added dropouts of 0.5, 0.2, 0.2, and 0.2 to reduce over-fitting.Finally, a dense layer contains a single neuron with a sigmoid activation function, producing binary numbers 0 and 1.A test result indicates an AMP if it is greater than 0.5; otherwise, it suggests a non-AMP.This study used Adam Optimizer to adjust the model's internal parameters.Notably, the Keras library, a popular tool for developing and upgrading neural networks, was used to compute our model 50 .The RNN structure, sigmoid function, tanh, and ReLU formulas are specified as: where W hh is the matrix weight of the recurrent connections, W xh is the input connection weight, b h denotes the bias vector, j is the current state, j t−1 is the previous state, and α is the activation function, R u denotes the ReLU, m is the maximum, where it returns the maximum value between 0 and u , and u is the input.S a denotes the sigmoid function, where e presents the exponential function and the output range (0,1).H t is the tanh func- tion; this function range is (−1, 1), e t − e −t denotes the hyperbolic sine, and e t + e −t denotes the hyperbolic cosine function.

Machine-learning models
This study used 33 models, where we applied some traditional models and some meta-models using stacking classifiers, voting classifiers, along with simple RNN model.We investigated several combinations of voting and stacking based models.All the models are demonstrated in the supplementary file (S1).Among them we selected two meta-classifiers, Voting and AMP-RNNpro, additionally, four distinct classification methods, including K-nearest Neighbor (KNN), Random Forest (RF), Extreme Gradient Boosting Classifier (XGB), and Extra-trees Classifier (EX) based on their performance, and we have employed several hyper-parameters to obtain a better outcome.These models are further described in the following.
KNN is one of the most widely used classification techniques.In general, KNN analyzes most classes between the data points "K" in the feature area or the nearest data 51 .We set the K as 100 neighbors to account for the 100 nearest neighbors in the data sets.To obtain the distance between the data points, we applied the Manhattan technique.We used the weights parameter as distance for deciding whether closet neighbors had a more substantial impact on the prediction with their weights.Accordingly, we used the "kd tree" algorithm for the final dimension results.
Another classification technique, RF, predicts the result using the voting stage to generate many decisionmaking structures during the training phase 52 .In this study, the RF model is configured with "sqrt" as the feature dimension, where the number of features boosts the model's robustness and prevents over-fitting.The node splitting threshold was set at "entropy," predictability for repeatable outcomes was set at a random state value of "100," and the prediction method employed was an ensemble of "100" decision trees (DT) to extract feature information from the feature-dimensional selection.
The XGB model combines a highly streamlined operation with the potential of gradient-boosting method, where this method of tree construction and each subsequent tree address the mistakes made by its predecessors to produce an accurate result 53 .This study used an estimator of "100" for the number of boosts and a learning rate of "0.1"; a subsample of "1.0" denotes all training samples applied in each round.The regularization parameter is "30" for preventing underfitting or overfitting.
In the EX classifier builds the trees using random split techniques and provides the result by combining methods 54 .Where, EX provided the most effective results by the averaging method's.This study used a "100" estimator for the classification.
Another popular ensemble approach in machine learning is the Voting classifier, where included have included the estimated probabilities across multiple baseline models such as KNN, RF, XGB, DT, and EX, which are subsequently employed as input data and used voting soft parameter to deliver the final classification results.

Experimental results
In this study, we have used several performance evaluation metrics as mentioned in the previous section to justify the performances of the developed models.We compared the performances of several machine learning models with our proposed model AMP-RNNpro.All the results have been compared and analyzed in this section, highlighting the performances of the proposed model.

Performances of machine learning models
Table 2 demonstrates the independent test method for providing a better outcome than the cross-validation.In the supplementary file, we have added the cross-validations and other independent test performances accordingly.
In Table 2, from the various descriptors, it can be deduced that the best performance has been obtained from ASDC feature encoding, demonstrating as a potential candidate among the eight feature encoding techniques.From the AAC feature selection, the best outcome has been obtained by EX considering the overall evaluation metrics than the other models.AMP-RNNpro performed better than the other models, securing 95.58% accuracy, surpassing other models on ASDC features.It can be included that the AMP-RNNpro model performed remarkably not only with ASDC feature but also with additional features while considering all the evaluation metrics.In ASDC, the sensitivity and specificity of this model have been obtained respectively, 97.65% and 93.85%, which indicates proficiency in detecting a new sample precisely.Following that, in the CKSAAGP feature, AMP-RNNpro has performed considerably better than the other models, obtaining an accuracy of > 90%.In the DP feature encoding approach, EX has performed notably, providing an accuracy of 95.19% and the other evaluators scoring more than 90%.In GAAC encoding, AMP-RNNpro resulted in better performance than the other models.In the MORAN feature, both EX and AMP-RNNpro have performed well, resulting in a similar accuracy of 90.18%.But we calculated the other evaluation metrics where the EX model performed notably in consideration of the sensitivity and specificity, which is 96.45%, 84.93% on par with the AMP-RNNpro model, which has achieved 95.95% on sensitivity and 85.95% on specificity.In the NMBroto and PseKRAAC feature approaches, the RF model obtained the highest accuracy than the other models.It is prevalent that ASDC can provide enormous potential in detecting AMPS, whereas AMP-RNNpro displayed the most outstanding performance considering other classifires.Accordingly, all models are statistically significant, except for AMP-RNNpro of the DP descriptor, where the p-value is greater than 0.01, indicating that the model does not have sufficient methods to reject the null hypothesis.The study found that a p-value of less than 0.05 indicates scientific validation, which can result in a significant difference when making decisions 58 .All the p-values are included in the supplementary file.
In Table 3, we demonstrated the analysis of 48D probabilistic values, where we merged all the probabilistic values which are generated from our best six chosen model of machine learning method.However, this table showed that AMP-RNNpro has optimal performances than others, where AMP-RNNpro excels in performance and demonstrates excellent results in various evaluation metrics.This model exhibits accuracy in classifying tasks with a 97.15%.In K, a measurement of inter-rater consistency, indicates the model's stability with an exceptional value of 94.30%.The MCC of 94.31%.Furthermore, the model's capacity to accurately capture the positive class of 96.48% in Sn and specifically detect the negative class with 97.87% respectively.The model's balanced performance is indicated by the f1-score, precision, which achieves an astounding 97.23% with values of and 97.87%.Though in KNN and Voting has high precision rat but the AMP-RNNpro has optimal values in other assessments with adequate precision, where this model captured the actual class more than 97 times www.nature.com/scientificreports/and balanced the actual class and the predicted class more precisely.In Sp, Voting has 98.34%, which is high performance to distinguish the negative classes from the samples, however, our proposed model AMP-RNNpro has potential performance to detect the non-AMPs.Overall, the AMP-RNNpro method is a suitable model for determination of antimicrobials from FASTA sequences.www.nature.com/scientificreports/AMP-RNNpro, KNN, RF, and Ex classifiers have a remarkable AUC value of 0.98 in subplots of F, G.In the E subplot, The AMP-RNNpro, KNN, and Ex classifiers have 0.98 AUC value.KNN and RF classifiers have an AUC score of 0.92 in Subplot H.In I, we demonstrated the probabilistic values outcomes, where it is clearer that, with the probability values most of the models outperformed with this dataset, where AMP-RNNpro model obtained 99.61% of AUC score, demonstrates the proficiency in accurately distinguishing the classes.As a result, Fig. 4 illustrates the overall decent performance of these methods, with the majority identifying AMPs effectively with AUC values over 0.99.

Comparison of AMP-RNNpro with others model in the current study
To demonstrate the strengths of probabilistic feature combinations over single-feature encoding, we generated figures based on several performance evaluation metrics.Our study, represented in Fig. 5, arranges feature extraction strategies according to performance.It becomes prevalent for every performance evaluation metric that AMP-RNNpro outperforms every other single based model.Though in single-based descriptor's XGB, and RF demonstrated an excellent performances in MCC, SP, and Sn, but overall performance consideration, we conclude that AMP-RNNpro model has optimal numbers with 48D probabilistic values but also this model well performed in single based feature encoding method, where, in accuracy term's, AAC, ASDC, CKSAAGP has optimal performed with AMP-RNNpro, and with the probability this framework obtained higher accuracy than the others method.In MCC, SN, and Sp have also this model provided a sufficient results.Therefore, considering the overall performances, we conclude that our proposed model AMP-RNNpro achieved a better outcome in every evaluation metrics with an adequate performance.

Performance comparison of the existing predictor
Figure 6 illustrates a comprehensive comparison of specificity and sensitivity outcomes in several models including our proposed model and other existing models such as sAMPpred-GAT, iAMP-2L, AMPlify, iAMPpred, LMpred, AMPFinder, AMPscanner.The results show that our model, AMP-RNNpro, outperformed all other models.The increased specificity indicates that our algorithm correctly detects AMPs.
In Table 4, we have shown performance comparisons of our model with several existing prediction tools.It demonstrates that our model achieved higher accuracy and AUC scores than the other proposed models.Our proposed model has taken probabilistic features derived from 8 feature encoding techniques which possess intrinsic differentiating capability and delivered a composed outcome by identifying the negative class with 97.87% specificity and the positive class with 96.48% sensitivity.Moreover, our model has obtained a 99.61% AUC score and 97.15% accuracy.So, it can be concluded that our model has optimally distinguished between the active and inactive AMPs.In comparison with the iAMPred and iAMP-2L models' performance on the independent test dataset of AMPs, our model has an increase in accuracy by 4% and specificity of 10% over these two models.Based on the independent test analysis, AMP-RNNpro outperformed AMPlify model by 15% in accuracy and 30% in sensitivity.The difference between sensitivity and specificity of AMPlify is over 30 percent that may lead to an unbalanced detection on unseen data.Our suggested model is more powerful and more accessible to detect the AMPs than the complex GAT-based feature selections model sAMPpred-GAT which used cross-validation method for evaluation.In our study, we have evaluated our model based on independent test as it is more viable to depict how suited our model is for practical application than the cross-validation technique.However, sAMPpred-GAT model's performances are relatively lower than AMP-RNNpro and also possess difference between sensitivity and specificity over 35% that may greatly affect to the unbiasedness of the model.LMpred and AMPfinder tested their models on various datasets.AMP-RNNpro outperformed LMpred by 3 percent in accuracy, sensitivity, and specificity.In comparison with AMPfinder, AMP-RNNpro achieved 3% higher results in accuracy.In AMPfinder model's performance, the gap between specificity and sensitivity is 10% whereas in our model it is 1% which demonstrates a more consistent performance in differentiating between the AMPs and non-AMPs.By comparing our proposed model to the majority of the state-of-the-art, we can conclude that our proposed model can successfully deliver more balanced and accurate results which will be more efficient for real life applications.

Adaptability and stability analysis
We conducted experiments with our proposed model on a diverse dataset.We experimented with AMPFinder's D1 test dataset and iAMPCN's initial stages test dataset to evaluate the model's capabilities with these datasets.

Case study 1
We used AMPFinders D1's dataset 26 , and we observed that there were 980 active sequences and 982 non-active sequences.To validate our model with the dataset, we have recognized that AMP-RNNpro obtained 96.73% in accuracy, 99.82% in sensitivity, and 62.96% in specificity.It is clearly observed that our model performed well in the independent test approach.

Case study 2
We have another experiment with the iAMPCN 22 models on a first-stage independent test dataset to validate our models.The authors stated that they organized their dataset by aggregating the various data repositories.However, we collected 2000 negative and positive samples to assess our model.The results of this study showed 96.13% in accuracy, 91.16% in sensitivity, and 98.46% in specificity.This result demonstrated our model's remarkable and potent ability to recognize the AMPs dataset.

Interpretation
AMP-RNNpro has been constructed with optimal probabilistic features from eight feature encoding techniques.Hence, it has delivered a more robust and precise performance compared to the previous predictors.Following recent studies, a model interpretation by illustrating the impacts of the probabilistic features on performance has been accomplished using SHAP 30 .In Fig. 7, the illustration demonstrates the top 20 features based on their overall impact on the outcome of our model.
In    62 .As a result, it can be apprehended that the further exploration of these features holds greater possibilities both in detection and drug discovery.

Website implementation
We have implemented a website of our model to predict the AMPs.The interface of our prediction tool is shown in Fig. 8.
We have designed a simple interface that is easier to understand and efficient to use for detecting AMPs with proper functionalities.Initially, an input section is given, allowing a user to provide sequences in FASTA format for AMP prediction of the AMPs.Below the input section are two buttons: 'Predict' and 'Example' .After clicking the 'Predict' button, it shows the prediction result in the output box.The output is shown in the following First-In-First-Out (FIFO) format.When the user presses 'Example' button it will give some sequences in the input section.The output will be shown as positive for the active AMPs and negative for the inactive-AMPs.Additionally, if the given sequences contain any unnecessary numbers or strings then the excessive numbers or strings will be excluded while the prediction and the result will be provided for the clipped sequences.Our prediction tool can be found at http:// 13. 126.159.30/.

Conclusion
A robust and novel method, named AMP-RNNpro, has been developed for detecting AMPs based on eight features of different criteria, additionally providing insights into the features that play a dominant role in the detection.The proposed model comprises compositional, positional, and physiochemical, as well as other properties for detecting AMPs with high accuracy and precision.Our recommended method is novel as the probabilistic features possess more innate abilities to distinguish AMPs.Thus, it analyzes AMPs more swiftly, instantly  identifying if they have anti-characteristics and categorizing the features.In healthcare institutions, it is crucial for efficiently and rapidly appraising patient medication.We have built a user-friendly website to predict the AMPs with our proposed model.To increase the precision and efficiency of AMP identification, future studies are needed to explore new feature encoding methods and ensembled deep neural networks feature selection techniques that may help in measuring the contribution of each feature encoding technique in discerning AMPs from non-AMPS considering the incorporation of larger datasets from the medical field.

Figure 1 .
Figure 1.Overview of AMP-RNNpro method (A) Dataset collection, preprocessing phase, and feature encoding.(B) Applying these feature encodings on independent test and cross-validation methods with 33 individual models, then selecting six best models from 33 models.(C) Probability feature construction, deployment of RNN as the final predictor, and illustration of outcomes.

Figure 4
Figure 4 compares the true positive and true negative rates for six classifiers using eight feature encodings and probabilistic techniques (AAC, ASDC, CKSAAGP, DP, GAAC, MORAN, NMBroto, PseKRAAC, 48D probability merged dataset).The approaches are labeled A, B, C, D, E, F, G, H and I.When a thorough analysis is considered, AMP-RNNpro stands out as the best model inside the machine learning framework for feature encoding and 48D dataset.The RF, AMP-RNNpro, KNN, and Ex classifiers each attain a noteworthy AUC value of 0.99 in subplots A, B, and D. In C, the AMP-RNNpro, KNN, and Ex classifiers achieved 0.99 AUC score.The

Figure 4 .
Figure 4. AUC Roc curve analysis on six classifiers on eight feature encoding methods and probabilistic merged dataset.The approaches are labeled (A) AAC feature (B) ASDC feature (C) CKSAAGP feature (D) DP feature (E) GAAC feature (F) MORAN feature (G) NMBroto feature (H) PseKRAAC feature (I) 48-dimensional probabilistic features from six classifier.

Fig. 7 ,
best six models based on AAC and ASDC, CKSAAGP features indicate their most significant contribution in the detection of AMPs.The remaining 2 spots of the top 20 have been taken by models based on the NMBroto feature encoding technique.So, it dictates that the compositional features of AAC and ASDC play a vital role in the detection and development of medications.Wang et al. previously conducted AAC, the amino

Figure 6 .
Figure 6.Comparison of the other proposed model with AMP-RNNpro.

Figure 8 .
Figure 8. AMP-RNNpro framework's website.Demonstrates a input box, example button, predict button and outcome of the AMPs.

Table 1 .
Datasets and statistical information.

Table 2 .
Performance of machine learning classifiers and AMP-RNNpro on feature encoding methods.Significant values are in bold.

Table 3 .
Performance analysis of probabilistic features frameworks.Significant values are in bold.
61id composition and ASDC which represents the amino acid chain.The authors stated that these two features have significant potential for drug discoveries and peptide identification60.Kabir et al. also mentioned that the AAC feature is more impactful in detecting AMPs61.Park et al. proposed an antimicrobial function: anticancer prediction tools, The study found that CKSAAGP was one of the most important features for predicting the anticancer Vol.:(0123456789) Scientific Reports | (2024) 14:12892 | https://doi.org/10.1038/s41598-024-63461-6www.nature.com/scientificreports/

Table 4 .
Performances of AMP-RNNpro and existing AMPs prediction tools.