Better understanding and prediction of antiviral peptides through primary and secondary structure feature importance

Chowdhury, Abu Sayed; Reehl, Sarah M.; Kehn-Hall, Kylene; Bishop, Barney; Webb-Robertson, Bobbie-Jo M.

doi:10.1038/s41598-020-76161-8

Download PDF

Article
Open access
Published: 06 November 2020

Better understanding and prediction of antiviral peptides through primary and secondary structure feature importance

Abu Sayed Chowdhury¹,
Sarah M. Reehl²,
Kylene Kehn-Hall^3,4,5,
Barney Bishop⁶ &
…
Bobbie-Jo M. Webb-Robertson¹

Scientific Reports volume 10, Article number: 19260 (2020) Cite this article

5057 Accesses
38 Citations
1 Altmetric
Metrics details

Subjects

Abstract

The emergence of viral epidemics throughout the world is of concern due to the scarcity of available effective antiviral therapeutics. The discovery of new antiviral therapies is imperative to address this challenge, and antiviral peptides (AVPs) represent a valuable resource for the development of novel therapies to combat viral infection. We present a new machine learning model to distinguish AVPs from non-AVPs using the most informative features derived from the physicochemical and structural properties of their amino acid sequences. To focus on those features that are most likely to contribute to antiviral performance, we filter potential features based on their importance for classification. These feature selection analyses suggest that secondary structure is the most important peptide sequence feature for predicting AVPs. Our Feature-Informed Reduced Machine Learning for Antiviral Peptide Prediction (FIRM-AVP) approach achieves a higher accuracy than either the model with all features or current state-of-the-art single classifiers. Understanding the features that are associated with AVP activity is a core need to identify and design new AVPs in novel systems. The FIRM-AVP code and standalone software package are available at https://github.com/pmartR/FIRM-AVP with an accompanying web application at https://msc-viz.emsl.pnnl.gov/AVPR.

SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins

Article Open access 08 March 2022

Machine learning prediction of antiviral-HPV protein interactions for anti-HPV pharmacotherapy

Article Open access 21 December 2021

A machine learning platform to estimate anti-SARS-CoV-2 activities

Article 03 May 2021

Introduction

Zoonotic viruses such as Ebola virus, Zika virus, West Nile virus and recently severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) can cause life-threatening disease outbreaks due to their high genetic diversity, variety of routes for transmission, and ability to replicate efficiently and to persist in their hosts^1,2,3,4. The control of viral disease continues to be a challenging task due to increased resistance to available antiviral therapies, which are limited, and the continual emergence of novel viral pathogens. Antiviral peptides (AVPs) are a subset of antimicrobial peptides and are a potential resource for the development of new potent therapeutics for preventing or treating viral infection. The ability of AVPs to target various aspects of the viral lifecycle, ranging from their attachment to host cells to their ability to impair viral replication within the cells has been the subject of multiple studies^{5,6,7,8,9,10,11,12,13}. AVPs can be natural or synthetic, obtained by introducing chemical groups or non-natural amino acids into natural peptide sequences^4,13,14. Considering AVPs in the design of new antiviral therapeutics is advantageous because it allows us to capitalize on their low molecular weight, low toxicity, high specificity and effectiveness, and minor side effects¹⁵. Machine learning is a powerful strategy for identifying AVPs by leveraging the ever-increasing data available in public databases, such as AVP Prediction (AVPpred)¹⁶, Antimicrobial Peptide Database (APD3)¹⁷, Collection of Antimicrobial Peptides (CAMPR3)¹⁸ and HIV inhibitory peptides database (HIPdb)¹⁹.

Researchers have previously developed machine learning models^{16,20,21,22,23,24,25} for predicting AVPs. Thakur et al.¹⁶ developed AVPpred, a web server for collecting and detecting highly effective AVPs. The authors used a support vector machine (SVM) to build two machine learning models based on amino acid composition (AAC) and physicochemical features. This was then extended to use a random forest (RF)-based model²⁰, which was able to outperform the SVM utilized in AVPpred. The RF models were constructed using AAC, physicochemical properties, aggregation propensities of amino acids and secondary structure. Lissabet et al.²¹ developed a portable software version of the RF method called AntiVPP 1.0 that gives improved prediction accuracy. Qureshi et al.²² introduced a regression-based algorithm AVP-IC₅₀Pred to predict AVP half maximal inhibitory concentration (IC₅₀). Various peptide features such as AAC, binary profile, physicochemical properties, solvent accessibility were considered, and a number of machine learning techniques with individual and different combination of features were used to predict the IC₅₀ value of the peptide sequences. Further, based on the assumption that AVPs have low sequence similarity the use of pseudo amino acid composition (PseAAC)²⁶ was introduced as AVP peptide features in the AdaBoost machine learning model²³. In recent years ensemble-based methods have been introduced, such as Meta-iAVP²⁵ and PePred-Suite²⁴. The Meta-iAVP approach uses machine learning to transform the feature space into a modified 6-dimensional predicted output vector, which then becomes the input data to the meta-classifier to predict the class of validation data set. PEPred-Suite is similar to Meta-iAVP where a RF is used as both the base and meta classifiers. Both Meta-iAVP and PEPred-Suite use these ensemble strategies to improve the AVP prediction accuracy.

The series of machine learning developments in AVP have to date focused on increasing the features that characterize a peptide and making minor modifications to the machine learning algorithm. They have not included feature reduction techniques that would determine the most relevant and non-redundant features from the initial set of input features. The performance of a machine learning model can rely heavily on using the most informative features, with the inclusion of non-informative features resulting in potential degradation in classifier performance. In the current study we identified the most important features by estimating Pearson's correlation coefficient and mean decrease of Gini index (MDGI) for all candidate features, which is a metric of feature importance based on the individual decision trees in a random forest model. The candidate features were generated from the physicochemical and secondary structure properties of a library of known AVP and non-AVP sequences. Subsequently, we applied a recursive feature elimination (RFE) algorithm in combination with the SVM to determine the order of importance of the different features. We evaluated multiple machine learning approaches, including SVM, RF and deep learning (DL) via multiple neural network architectures and hyperparameters, for training and testing purposes using our selected feature set. Our SVM-based method achieved the best test accuracy and Matthews correlation coefficient (MCC) values compared to the RF and DL approaches as well as outperformed AVPpred¹⁶ and Chang et al.’s method²⁰. We packaged the resulting approach into a software tool called Feature-Informed Reduced Machine Learning for Antiviral Peptide Prediction (FIRM-AVP).

Methods

Training and testing data

We used the same experimentally validated dataset reported in AVPpred¹⁶ that has been used consistently since its introduction to evaluate AVP prediction models. It consists of a total of 1056 unique peptides. This set of peptides was distilled from a starting collection of 1245 peptides that were reduced to remove peptides with too high of similarity. Out of them, 604 sequences are highly effective (positive samples), and 452 sequences are minimally or non-effective AVPs (negative samples). These datasets were used for training and validating the machine learning model. To construct the training and independent test sets to benchmark our results with existing SVM and RF-based models we followed the same process as described previously^16,20. This yields 544 and 407 positive and negative samples in the training dataset, respectively, and the validation/independent test set consisted of 60 and 45 positive and negative samples, respectively as defined by prior publications to assure accurate comparison. This validation set has similar overall viral diversity as the training set. On the AVPpred server there are additional peptides for the negative samples set, 544 in training set and 60 in the independent test set, however; these peptides have not been confirmed experimentally and thus are not included here.

Feature generation

We combined several sets of features based on the peptide sequences: a 20D feature vector for AAC expressed as the percentage representation of a particular amino acid in a peptide; a 400D feature set was generated based on the dipeptide composition (DC) which represents the fraction of dipeptides within a peptide sequence; and the PseAAC and amphiphilic pseudo amino acid composition (APseAAC) proposed by Chou^26,27 to incorporate sequence-order information. The dimension of the PseAAC feature vector is $20 + {\varvec{\lambda}} \times {\varvec{\omega}}$ where ${\varvec{\lambda}}$ is the discrete correlation factor and ${\varvec{\omega}}$ is the weight factor of the sequence information. In our case, we set ${\varvec{\lambda}}$ = 5 and ${\varvec{\omega}}$ = 0.05 by considering the minimum length of our collected AVP and non-AVP sequences. So, in the 25D PseAAC feature vector, the first 20 features are the traditional AAC and the other components are the rank-different correlation factors that represents the sequence-order information. We produced a $20 + 2{\varvec{\lambda}}$ i.e., 30D, APseAAC feature vector where the first 20 features are the basic AAC and the remaining components indicate the correlation factor for the physicochemical properties of peptides. We also utilized the composition, transition, and distribution (CTD) model^28,29,30,31 to generate feature vectors for 8 physicochemical properties; hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure, solvent accessibility and surface tension of peptide sequences. In the CTD model, amino acids are classified into three classes based on their physicochemical properties. For composition, we obtained 3D feature vector that give the fraction of each encoded class in a peptide sequence. A transition feature vector of 3D gives the transition of one class followed by another class and vice versa. We also obtained a 15D feature vector for distribution that indicates the percent distribution (i.e., 1%, 25%, 50%, 75% and 100%) of each class in a peptide sequence. As we have 8 physicochemical properties, the CTD model gives a (3 + 3 + 15)$\times$ 8 = 168D feature vector. Finally, we retrieved features from the secondary structure of peptide sequences. A total of six features were extracted from the location information, spatially consecutive states and segment sequences of the three main types of secondary structure; ${\varvec{\alpha}}$-helix, ${\varvec{\beta}}$-strand and ${\varvec{\gamma}}$-coil. The details of feature extraction from the CTD model and secondary structure information of amino acid sequences were explained in our previous works^32,33,34. In summary, we generated 649 peptide sequence-based features listed in Table 1 using the R programming language (ver 4.0.0)³⁵. We utilized the protr (ver. 1.6-2)³⁰ and DECIPHER (ver. 2.14.0)³⁶ packages to extract features from peptide sequences.

Table 1 List of 649 peptide features.

Full size table

The DC feature vector (dipep_1, dipep_2, …, dipep_400) are the dipeptide composition (Supplementary Table S1) of the amino acids in order A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V. The PseAAC and APseAAC are the feature vectors (pseudo_1, pseudo_2, …, pseudo_25) and (amphipseudo_1, amphipseudo_2, …, amphipseudo_30), respectively. The composition feature vector (comp_1, comp_2, …, comp_24) and transition feature vector (tran_1, tran_2, …, tran_24) are the composition and transition values in the order-physicochemical property 1 (group 1), physicochemical property 1 (group 2), physicochemical property 1 (group 3) and so on. In the distribution feature vector (dist_1, dist_2, …, dist_120), the first 15D features are the group1, group2 and group3 distribution values for the first physiochemical property and so on. The physicochemical properties and their groups are listed as supplementary Table S2. Finally, in the 6D secondary structure feature vector, ss_1, ss_2 and ss_3 are the location-oriented features for the ${\varvec{\alpha}}$-helix, ${\varvec{\beta}}$-strand and ${\varvec{\gamma}}$-coil, respectively. The other three features ss_4, ss_5 and ss_6 gives the normalized maximum spatial consecutive ${\varvec{\alpha}}$-helix and ${\varvec{\beta}}$-strand in the secondary structure sequence, and occurrences of segmented sequences “${\varvec{\beta}}$-strand ${\varvec{\alpha}}$-helix ${\varvec{\beta}}$-strand” after ignoring ${\varvec{\gamma}}$-coil states from the secondary structure.

Machine learning models

We utilized three machine learning approaches to train the AVP prediction model, traditional SVM and RF methods, as well as DL via multiple architectures and hyperparameters using the machine learning library, caret (ver. 6.0-86)³⁷. For the DL, variations on the Multi-layer Perceptron were the most successful. These binary classification models were then used to classify the test set of peptides. Note that we tuned the SVM and RF models with the training dataset and used the best models for prediction. The SVM model was tuned using the radial basis function kernel with cost values of 4, 8, 16, 32, 64, and 128. The RF model was tuned with ntree values of 50, 100, 200, 300, 400 and 500 and mtry values of 2, 4, 8, 16, and 32. The final SVM model used a cost value of 8, and RF model was with ntree = 100 and mtry = 32, which was chosen as best models for the selected feature on the training data. We utilized the e1071 (ver. 1.7-3)³⁸ package to tune the models.

Feature selection

The 649 features may contain redundant and information irrelavent to the classification of AVPs. To reduce the dimensions of the features we calculated the Pearson's correlation coefficient [using Eq. (1)] between two feature vectors $x$ and $y$ across all of the peptides to observe the linear correlation between features. Here $E$, $\mu$ and $\sigma$ are the expectation, mean and standard deviation values, respectively.

$$ \rho = \frac{{E\left[ {\left( {x - \mu_{x} } \right)\left( {y - \mu_{y} } \right)} \right]}}{{\sigma_{x} \sigma_{y} }}. $$

(1)

If the absolute value of the correlation between two features is greater than a threshold value, one of the two features were removed randomly from further consideration. We considered a range of correlation threshold from 0.7 to 0.95 in increments of 0.05. A correlation threshold was selected to optimize the Area Under a Receiver Operating Characteristic Curve (AUC) associated with the feature selection, which set the parameter to 0.85 and reduced the dataset to 568 features. We utilized the R stats package (ver. 3.6.2) to compute the Pearson correlation values between features.

As a next step, we computed mean decrease of Gini index (MDGI) using an RF model for the remaining features. We can find the feature importance using MDGI to measure the contribution of each feature to the homogeneity of the nodes and leaves in the RF model³⁹. A node is considered as more pure in the RF model if the Gini index is closer to 0. The Gini index is calculated using Eq. (2) where we subtract the sum of the squared probabilities of each of the two classes from 1.

$$ Gini = 1 - \mathop \sum \nolimits_{i = 1}^{2} P_{i}^{2} . $$

(2)

So, the Gini index values of 0 and 1 indicate completely homogeneous data and completely heterogeneous data, respectively. To find the feature importance, whenever a feature is used to divide data at a node, we calculated the Gini index at the root node and at both the leaves. The difference in the Gini index of splitting root node and weighted Gini index of the child nodes was estimated to find the fall of Gini index values in a decision tree of the RF model²⁰. For each feature, MDGI is the average value of all the decrease of Gini index over all the decision trees created in the RF model and higher MDGI value indicate elevated feature importance. Based on the MDGI we down-selected to 169 features with positive MDGI. The randomForest (ver. 4.6-14)⁴⁰ package was used to estimate the MDGI values of the features.

Recursive feature elimination

Following reduction of the number of features based on Pearson's correlation coefficient and MDGI values, we applied the RFE technique⁴¹ to the machine learning models using the training data for the reduced feature set to order the features by importance. RFE evaluates the training performance of a machine learning model for a feature set and gives the ranking of the features. We considered 10-fold cross validation with 5 repeats to evaluate the training performance of the machine learning models. We utilized caret (ver. 6.0-86)³⁷ to implement the RFE algorithm.

Performance measurement

We utilize the area under the receiver operating characteristic (ROC) curve (AUC) values to measure the training performance of the models via RFE for the reduced feature set. ROC curves use a combination of the true positive rate and false positive rate to provide a summary of the prediction capability of a machine learning model where a perfect classifier has an AUC of 1.0 and a random binary classifier will have an AUC of 0.5. We report the final test performance of our classifiers using the same metrics as previously reported for other AVP prediction algorithms, which include sensitivity, specificity, accuracy and MCC values [Eqs. (3–6)], where TP, TN, FP, and FN are true positives (positives accurately classified), true negatives (negatives accurately classified), false positives (negatives classified as positives), and false negatives (positives classified as negatives), respectively. The MCC value is used to evaluate the efficacy of a classifier as the number of positive and negative examples in the datasets is imbalanced and the range of this value is [− 1, 1]. Higher MCC value indicates better prediction.

$$ Sensitivity = \frac{TP}{{TP + FN}}, $$

(3)

$$ Specificity = \frac{TN}{{TN + FP}}, $$

(4)

$$ Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}, $$

(5)

$$ MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {\left( {TP + FP} \right)\left( {TP + FN} \right)\left( {TN + FP} \right)\left( {TN + FN} \right)} }}. $$

(6)

Data availability

All experimental data are available at https://github.com/pmartR/FIRM-AVP.

Results

AVP prediction performance

The performance of the FIRM-AVP SVM, RF and DL models were compared based on the standard metrics of sensitivity, specificity, accuracy, and MCC for their performance on the validation/independent dataset where a positive AVP peptide is defined as a probability of greater than 0.5 and a negative AVP as less than or equal to 0.5. Evaluating overall accuracy, we observe that the SVM and RF models have very high AUC values, 0.962 and 0.958, respectively. Table 2 details the results of the models for the 169 features based on our feature reduction. The SVM model achieved 92.4% accuracy and 0.84 MCC, which is better than the RF model. Both the SVM and RF machine learning approaches yield a posterior probability that represent the probability that a peptide is AVP given the data represented as the 169 features, or likewise the probability that a peptide is non-AVP. We evaluated the probabilities of the 60 positive AVPs for the SVM versus the RF and found that on average the strength of the prediction based on the probability for the SVM was larger than the RF by ~ 0.02 (paired t-test p-value ~ 0.14). Thus, there is marginal evidence that the SVM yields a more confident identification, but it is not statistically significant based on this data at a p-value threshold of 0.05. However, when evaluating the negative class there is a significant improvement gained by the SVM. The average non-AVP peptide is generally correctly classified with a larger probability by ~ 0.09 (p-value ~ 5E−5). This difference in strength of classification of the non-AVP class is what is largely driving the reduction in false positives for the SVM, which is observed in the specificity values reported in Table 2. The DL approaches were likely sub-optimal because while multiple nonlinearities exist in these data, the training examples are too few to both describe the nonlinearities and adequately generalize to new data. Evidence of such a conclusion is apparent in discrepancies between training and testing loss, even in the presence of regularization. Future work of importance is to grow and create more variety in the AVP benchmark dataset, which has not been updated in 8 years, which would aid in the application of more recent machine learning approaches.

Table 2 Performance comparison of our models with existing models on independent validation data.

Full size table

For the independent test set, we then compared the performance of our FIRM-AVP SVM model with no feature reduction (AVP-649D), as well as with the AVPpred¹⁶ and the Chang et al.’s RF approaches (RFcompo + structure + agg)²⁰ (Table 2). There is a clear increase in accuracy based on the reduced feature set from the full 649 features, for example our best performing SVM model increased the MCC from 0.79 to 0.84 by reducing to the highest importance features. In terms of prior analyses, the AVPcompo and AVPphysico are the models of AVPpred based on AAC and physicochemical features, respectively whereas RFcompo + structure + agg is the Chang et al.’s RF method that uses both (AAC), secondary structure and aggregation features. Chang et al.’s RF method outperforms AVPpred with an accuracy of 89.5% and 0.79 MCC value. However, our FIRM-AVP SVM models that is built on an optimized feature set performed better than either of these two methods in terms of accuracy and MCC and the FIRM-AVP RF model was similar to that of prior models. The most accurate model is Meta-iAVP²⁵, which is based on an ensemble of machine learning algorithms. This however comes with a challenge in interpretation and gaining insight into the features that are driving antiviral activity as was the goal with FIRM-AVP. The same validation set run on each of the 6 machine learning algorithms separately have MCC values that range from 0.34 to 0.73, well below the FIRM-AVP using a single classifier on the optimized feature set.

Recursive feature rankings

We performed RFE operations on the SVM model with the training data using 169 features from the initial feature selection with repetition measure the training performance of the SVM (in terms of AUC) via RFE algorithm. Note that the AUC values gradually decreased as features were removed from the model as depicted in Supplementary Fig. S1, and we obtained the highest AUC values of 0.89 and 0.92 for the SVM and RF models, respectively, by including all 169 features. This indicates that we do not need further feature reduction, and thus we utilize the RFE results to sort the importance of the features. Table 3 lists the top-5 features found after RFE analysis. Both secondary structure, composition and PseAAC features are in the top-5 features for both machine learning models. Peptide secondary structure features are identified as top ranked features in SVM and RF methods, respectively. All rankings of the selected features for both SVM and RF models are listed in Supplementary Table S3.

Table 3 Top-5 features obtained in SVM and RF methods from RFE analysis.

Full size table

Software tool and user's manual

We developed the standalone software tool, FIRM-AVP based on the SVM algorithm. The open source software are available at https://github.com/pmartR/FIRM-AVP. Additionally, a web-based version of the software is available at https://msc-viz.emsl.gov/AVPR/ . To use the web application the users need to provide either a single peptide sequence or a FASTA file of peptide sequences to be analyzed and predictions will be returned that include the probability that a peptide sequence is antiviral (Fig. 1). As previously mentioned, a current limitation in AVP prediction is the scale of the data available on which to build predictive models. To make the software more useful for those working on improving the algorithm via collecting additional training data, the software provides the user an option to add new known AVP and non-AVP sequences to retrain the machine learning model. A simple page refresh will reset the model. The graphical user interface and options the web application provides are shown in Fig. 1. The feature generation and selection components of the software were implemented using R. The graphical user interface design and implementation were created using the R web application framework shiny (ver. 1.4.0.2)⁴².

Discussion

Identifying potential AVPs is of great importance for the discovery of new drugs to treat viral infection. In this work, we introduced a machine learning model for predicting AVPs using a core set of 169 features identified via correlation and machine learning analyses. Our SVM and RF models were developed based on the features generated from the AAC, DC, PseAAC, APseAAC, CTD, and predicted secondary structure properties of peptide sequences. To verify the effectiveness of our best feature sets, we tested the performance of our models using an independent dataset that included the same validation/independent as prior methods^16,20. We achieved higher accuracies and MCC values relative to single classifier models that did not include feature reduction, as well as existing published models, demonstrating the effectiveness of the feature selection approach. The software tool FIRM-AVP based on our approach is publicly available for user with flexible options to not only make predictions, but to update the underlying prediction model. The need for more training data was a limiting factor to the DL approach, which had lower overall accuracy than the SVM and RF approaches.

We evaluated multivariate feature importance using our selected feature sets via RFE. Secondary structure and distribution features were identified as top ranked features in our SVM and RF models, respectively. Location oriented features for $\alpha$-helix conformation and distributional features associated with positive charge as the most important features of the machine learning models. The PSeAAC feature for leucine and lysine amino acids were also important in distinguishing AVP and non-AVP sequences. The location oriented feature for $\alpha$-helix and PSeAAC features for leucine and lysine amino acids support the abundance of the $\alpha$-helix structure, and leucine and lysine residues in AVPs that were claimed in the RF-based method²⁰ and HIPdb¹⁹. The observed significance of α-helical structure is consistent with the fact that many known antimicrobial peptides exhibit varied degrees of helical conformation and spatial partitioning of cationic and hydrophobic residues⁴³. Here, both the SVM and RF approaches establish helix distributional features that are associated with antiviral peptides^44,45. How these properties factor in peptide antiviral activity is not clear, however they are known to contribute to their interactions with cell membranes.

The discovery of new antiviral therapies is imperative to address the challenge of new viral epidemics and AVPs can be a valuable resource for the development of novel therapies to combat viral infection. One of the core needs is not only improving the accuracy of AVP prediction models, but also building explainable models that can aid in understanding the fundamental multivariate properties that are associated with anti-viral activity. This is a necessary step in the design of AVP design for novel viral systems.

References

Domingo, E. Mechanisms of viral emergence. Vet. Res. 41, 38 (2010).
Article Google Scholar
Nichol, S. T., Arikawa, J. & Kawaoka, Y. Emerging viral diseases. Proc. Natl. Acad. Sci. 97, 12411–12412 (2000).
Article ADS CAS Google Scholar
Phan, T. Genetic diversity and evolution of SARS-CoV-2. Infect. Genet. Evol. 81, 104260 (2020).
Article CAS Google Scholar
Qureshi, A., Thakur, N., Tandon, H. & Kumar, M. AVPdb: A database of experimentally validated antiviral peptides targeting medically important viruses. Nucleic Acids Res. 42, D1147–D1153 (2014).
Article CAS Google Scholar
Gleenberg, I. O., Avidan, O., Goldgur, Y., Herschhorn, A. & Hizi, A. Peptides derived from the reverse transcriptase of human immunodeficiency virus type 1 as novel inhibitors of the viral integrase. J. Biol. Chem. 280, 21987–21996 (2005).
Article Google Scholar
Gleenberg, I. O., Herschhorn, A. & Hizi, A. Inhibition of the activities of reverse transcriptase and integrase of human immunodeficiency virus type-1 by peptides derived from the homologous viral protein R (Vpr). J. Mol. Biol. 369, 1230–1243 (2007).
Article CAS Google Scholar
Littler, E. & Oberg, B. Achievements and challenges in antiviral drug discovery. Antiviral Chem. Chemother. 16, 155–168 (2005).
Article CAS Google Scholar
Louis, J. M., Dyda, F., Nashed, N. T., Kimmel, A. R. & Davies, D. R. Hydrophilic peptides derived from the transframe region of Gag-Pol inhibit the HIV-1 protease. Biochemistry 37, 2105–2110 (1998).
Article CAS Google Scholar
Pang, W., Tam, S.-C. & Zheng, Y.-T. Current peptide HIV type-1 fusion inhibitors. Antiviral Chem. Chemother. 20, 1–18 (2009).
Article CAS Google Scholar
Rausch, D. et al. Peptides derived from the CDR3-homologous domain of the CD4 molecule are specific inhibitors of HIV-1 and SIV infection, virus-induced cell fusion, and postinfection viral transmission in vitro. Implications for the design of small peptide anti-HIV therapeutic agents. Ann. N. Y. Acad. Sci. 616, 125–148 (1990).
Article ADS CAS Google Scholar
Reusser, P. Antiviral therapy: Current options and challenges. Schweizerische medizinische Wochenschrift 130, 101–112 (2000).
CAS PubMed Google Scholar
Prusoff, W. H., Lin, T., August, E. M., Wood, T. G. & Marongiu, M. E. Approaches to antiviral drug development. Yale J. Biol. Med. 62, 215 (1989).
CAS PubMed PubMed Central Google Scholar
Qureshi, A., Kaur, G. & Kumar, M. AVC pred: An integrated web server for prediction and design of antiviral compounds. Chem. Biol. Drug Des. 89, 74–83 (2017).
Article CAS Google Scholar
Boas, L. C. P. V., Campos, M. L., Berlanda, R. L. A., de Carvalho Neves, N. & Franco, O. L. Antiviral peptides as promising therapeutic drugs. Cell. Mol. Life Sci. 76, 1–18 (2019).
Castel, G., Chtéoui, M., Heyd, B. & Tordo, N. Phage display of combinatorial peptide libraries: Application to antiviral research. Molecules 16, 3499–3518 (2011).
Article CAS Google Scholar
Thakur, N., Qureshi, A. & Kumar, M. AVPpred: Collection and prediction of highly effective antiviral peptides. Nucleic Acids Res. 40, W199–W204 (2012).
Article CAS Google Scholar
Wang, G., Li, X. & Wang, Z. APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Res. 44, D1087–D1093 (2016).
Article CAS Google Scholar
Waghu, F. H., Barai, R. S., Gurung, P. & Idicula-Thomas, S. CAMPR3: A database on sequences, structures and signatures of antimicrobial peptides. Nucleic Acids Res. 44, D1094–D1097 (2016).
Article CAS Google Scholar
Qureshi, A., Thakur, N. & Kumar, M. HIPdb: A database of experimentally validated HIV inhibiting peptides. PLoS ONE 8, e54908 (2013).
Article ADS CAS Google Scholar
Chang, K. Y. & Yang, J.-R. Analysis and prediction of highly effective antiviral peptides based on random forests. PLoS ONE 8, e70166 (2013).
Article ADS CAS Google Scholar
Lissabet, J. F. B., Belén, L. H. & Farias, J. G. AntiVPP 1.0: A portable tool for prediction of antiviral peptides. Comput. Biol. Med. 107, 127–130 (2019).
Article Google Scholar
Qureshi, A., Tandon, H. & Kumar, M. AVP-IC50Pred: Multiple machine learning techniques-based prediction of peptide antiviral activity in terms of half maximal inhibitory concentration (IC50). Pept. Sci. 104, 753–763 (2015).
Article CAS Google Scholar
Zare, M., Mohabatkar, H., Faramarzi, F. K., Beigi, M. M. & Behbahani, M. Using Chou’s pseudo amino acid composition and machine learning method to predict the antiviral peptides. Open Bioinform. J. 9, 13–19 (2015).
Article CAS Google Scholar
Wei, L., Zhou, C., Su, R. & Zou, Q. PEPred-Suite: Improved and robust prediction of therapeutic peptides using adaptive feature representation learning. Bioinformatics 35, 4272–4280 (2019).
Article Google Scholar
Schaduangrat, N., Nantasenamat, C., Prachayasittikul, V. & Shoombuatong, W. Meta-iAVP: A sequence-based meta-predictor for improving the prediction of antiviral peptides using effective feature representation. Int. J. Mol. Sci. 20, 5743 (2019).
Article CAS Google Scholar
Chou, K. C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct. Funct. Bioinform. 43, 246–255 (2001).
Article CAS Google Scholar
Chou, K.-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10–19 (2005).
Article CAS Google Scholar
Dubchak, I., Muchnik, I., Holbrook, S. R. & Kim, S.-H. Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. 92, 8700–8704 (1995).
Article ADS CAS Google Scholar
Dubchak, I., Muchnik, I., Mayor, C., Dralyuk, I. & Kim, S. H. Recognition of a protein fold in the context of the SCOP classification. Proteins Struct. Funct. Bioinform. 35, 401–407 (1999).
Article CAS Google Scholar
Xiao, N., Cao, D.-S., Zhu, M.-F. & Xu, Q.-S. protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31, 1857–1859 (2015).
Article CAS Google Scholar
Li, Z.-R. et al. PROFEAT: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 34, W32–W37 (2006).
Article CAS Google Scholar
Chowdhury, A. S., Call, D. R. & Broschat, S. L. Antimicrobial resistance prediction for Gram-negative Bacteria via Game theory-Based feature evaluation. Sci. Rep. 9, 1–9 (2019).
Article Google Scholar
Chowdhury, A. S., Khaledian, E. & Broschat, S. L. Capreomycin resistance prediction in two species of Mycobacterium using a stacked ensemble method. J. Appl. Microbiol. 127, 1656–1664 (2019).
Article CAS Google Scholar
Chowdhury, A. S., Call, D. R. & Broschat, S. L. PARGT: A software tool for predicting antimicrobial resistance in bacteria. Sci. Rep. 10, 1–7 (2020).
Article Google Scholar
R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2020).
Wright, E. S. Using DECIPHER v2.0 to analyze big biological sequence data in R. R J. 8, 352–359 (2016).
Article Google Scholar
Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008).
Article Google Scholar
e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien (2019).
Calle, M. L. & Urrea, V. Letter to the editor: Stability of random forest importance measures. Brief. Bioinform. 12, 86–89 (2011).
Article Google Scholar
Liaw, A. & Wiener, M. Classification and regression by randomForest. R News 2, 18–22 (2002).
Google Scholar
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).
Article Google Scholar
shiny: Web Application Framework for R (2020).
Huang, Y., Huang, J. & Chen, Y. Alpha-helical cationic antimicrobial peptides: Relationships of structure and function. Protein Cell 1, 143–152. https://doi.org/10.1007/s13238-010-0004-3 (2010).
Article CAS PubMed PubMed Central Google Scholar
Tossi, A., Sandri, L. & Giangaspero, A. Amphipathic, alpha-helical antimicrobial peptides. Biopolymers 55, 4–30. https://doi.org/10.1002/1097-0282(2000)55:1%3c4::AID-BIP30%3e3.0.CO;2-M (2000).
Article CAS PubMed Google Scholar
Zelezetsky, I. & Tossi, A. Alpha-helical antimicrobial peptides–using a sequence template to guide structure-activity relationship studies. Biochim. Biophys. Acta 1758, 1436–1449. https://doi.org/10.1016/j.bbamem.2006.03.021 (2006).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by the U.S. Army Medical Research Acquisition Activity, through the Accelerating Innovation in Military Medicine program under Award No. W81XWH-18-1-0801. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the Department of Defense or the U.S. Army. We are grateful to the St. Augustine Alligator Farm Zoological Park for their collaboration on this project. Computational work was completed at Pacific Northwest National Laboratory (PNNL). PNNL is operated by Battelle Memorial Institute for the Department of Energy under contract DEAC05-76RLO1830.

Author information

Authors and Affiliations

Biological Sciences Division, Pacific Northwest National Laboratory, J4-18, P.O. Box 999, Richland, WA, 99354, USA
Abu Sayed Chowdhury & Bobbie-Jo M. Webb-Robertson
Computing and Analytics Division, Pacific Northwest National Laboratory, P.O. Box 999, Richland, WA, 99354, USA
Sarah M. Reehl
School of Systems Biology, George Mason University, Manassas, VA, 20110, USA
Kylene Kehn-Hall
National Center for Biodefense and Infectious Diseases, George Mason University, Manassas, VA, 20110, USA
Kylene Kehn-Hall
Department of Biomedical Sciences and Pathobiology, Virginia Tech, Blacksburg, VA, 24061, USA
Kylene Kehn-Hall
Department of Chemistry and Biochemistry, George Mason University, Manassas, VA, 20110, USA
Barney Bishop

Authors

Abu Sayed Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar
Sarah M. Reehl
View author publications
You can also search for this author in PubMed Google Scholar
Kylene Kehn-Hall
View author publications
You can also search for this author in PubMed Google Scholar
Barney Bishop
View author publications
You can also search for this author in PubMed Google Scholar
Bobbie-Jo M. Webb-Robertson
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.S.C., S.M.R., and B.J.W.R developed and evaluated the machine learning algorithms. B.B. and K.K.H. provided feature evaluation and interpretation. All authors reviewed and edited the manuscript.

Corresponding author

Correspondence to Bobbie-Jo M. Webb-Robertson.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chowdhury, A.S., Reehl, S.M., Kehn-Hall, K. et al. Better understanding and prediction of antiviral peptides through primary and secondary structure feature importance. Sci Rep 10, 19260 (2020). https://doi.org/10.1038/s41598-020-76161-8

Download citation

Received: 17 July 2020
Accepted: 20 October 2020
Published: 06 November 2020
DOI: https://doi.org/10.1038/s41598-020-76161-8

This article is cited by

Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model
- Shahid Akbar
- Ali Raza
- Quan Zou
BMC Bioinformatics (2024)
BaPreS: a software tool for predicting bacteriocins using an optimal set of features
- Suraiya Akhter
- John H. Miller
BMC Bioinformatics (2023)
A separable temporal convolutional networks based deep learning technique for discovering antiviral medicines
- Vishakha Singh
- Sanjay Kumar Singh
Scientific Reports (2023)
Data-mining unveils structure–property–activity correlation of viral infectivity enhancing self-assembling peptides
- Kübra Kaygisiz
- Lena Rauch-Wirth
- Tanja Weil
Nature Communications (2023)
Recent Advances in Machine Learning-Based Models for Prediction of Antiviral Peptides
- Farman Ali
- Harish Kumar
- Fawaz Khaled Alarfaj
Archives of Computational Methods in Engineering (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Methods

Training and testing data

Feature generation

Machine learning models

Feature selection

Recursive feature elimination

Performance measurement

Data availability

Results

AVP prediction performance

Recursive feature rankings

Software tool and user's manual

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links