Abstract
Machine learning models are revolutionizing our approaches to discovering and designing bioactive peptides. These models often need protein structure awareness, as they heavily rely on sequential data. The models excel at identifying sequences of a particular biological nature or activity, but they frequently fail to comprehend their intricate mechanism(s) of action. To solve two problems at once, we studied the mechanisms of action and structural landscape of antimicrobial peptides as (i) membrane-disrupting peptides, (ii) membrane-penetrating peptides, and (iii) protein-binding peptides. By analyzing critical features such as dipeptides and physicochemical descriptors, we developed models with high accuracy (86–88%) in predicting these categories. However, our initial models (1.0 and 2.0) exhibited a bias towards α-helical and coiled structures, influencing predictions. To address this structural bias, we implemented subset selection and data reduction strategies. The former gave three structure-specific models for peptides likely to fold into α-helices (models 1.1 and 2.1), coils (1.3 and 2.3), or mixed structures (1.4 and 2.4). The latter depleted over-represented structures, leading to structure-agnostic predictors 1.5 and 2.5. Additionally, our research highlights the sensitivity of important features to different structure classes across models.
Similar content being viewed by others
Introduction
Machine learning (ML) models gradually emerge as cost-effective, time-saving, and informed strategies to accelerate the discovery and design of peptides and proteins. They contribute to our understanding of the sequence-structure–function relationships by capitalizing on the abundant protein sequences and more limited structural or functional information. Predictive ML models infer tailored properties or functions like tridimensional structure, protein–protein interactions, target binding affinity, stability, and solubility. Generative ML models create novel biological modalities with the desired properties—i.e., de novo protein design, antibody or enzyme engineering. The combination of predictive and generative models minimizes the need for extensive experiments and resources while increasing our chances of achieving successful outcomes, e.g., drug hits or leads. The main applications of ML models for biofuel production, material design, or drug development include designing peptides with one or more objectives (e.g., cell-penetrating property1, antiviral activity2, antimicrobial activity3,4, anticancer activity5,6,7,8), protein binders9,10,11, monoclonal antibodies12,13, protein families14,15,16, and enzymes17,18.
Machine learning models predominantly rely on peptide and protein sequences due to the wealth of information hidden in the linear chains of amino acids. The biological sequences are abundant and easily accessible in public and private repositories, owing to the advancements in genome sequencing, offering a robust basis for training ML algorithms. In addition, sequence-based models are computationally efficient and generalizable, making them an attractive choice for processing sizable datasets to predict the properties of new peptides and proteins. Nevertheless, understanding the sequence-structure–function relationships also requires structural information, a resource existing ML models often lack. This scarcity arises from the elevated costs and resources associated with X-ray crystallography or nuclear magnetic resonance. The last fifteen years of model development for antimicrobial peptides (AMPs)19,20,21,22,23,24,25 perfectly encapsulate the reliance of machine learning models on peptide and protein sequences.
AMPs, commonly called host-defense peptides (HDPs), are the first lines of defense of many organisms against pathogens acting via direct microbicidal activity or indirect stimulation of the host’s immune responses. They promise to combat global health threats and antimicrobial resistance to conventional antibiotics26,27,28. Most AMPs/HDPs are small amphipathic proteins, generally between 12 and 50 residues, with a net charge between + 2 and + 9 at physiological pH. Despite these common characteristics, the peptides are significantly diverse in sequence and structure. We recently recommended a fast and robust estimate of the structural landscape(s) of medium-large datasets for fold discovery and prior ML modeling. Our predictions identified loose α-helices as the main structure class (65.1%), followed by random coils (17.8%), and β-stranded and mixed structures accounted for the rest of the large AMP dataset29. Consequently, current AMP models (predictors and generators) might favor these dominant structure classes.
Antimicrobial peptides have long been thought to kill pathogens uniquely by interacting with their phospholipids, leading to cell death (membranolytic AMPs)30,31. Some AMPs/HDPs translocate through cell membranes before reaching intracellular targets (non-membranolytic AMPs)28. Their global cationic and amphipathic characteristics are essential to their electrostatic interactions and hydrogen bonding with bacterial and eukaryotic membranes and are inherently linked to their functional promiscuity32,33. Current ML models excel at identifying sequences of antimicrobial nature or activity, but they rarely comprehend their intricate mechanism(s) of action. In 2016, Lee and co-workers reported a seminal ML model to understand the relationships between peptide sequences and their interactions with cell membranes34. Two years later, Brand and co-workers classified membrane-active peptides by combining the results of differential scanning calorimetry and circular dichroism experiments with unsupervised ML methods35. Both original studies uncovered that physicochemical properties (i.e., amphiphilicity, helical propensity) could help predict the peptides’ antimicrobial nature and membrane activity using α-helical peptides. In addition, most biophysical experiments and molecular simulations supporting our understanding of their interactions with lipid membranes predominantly use α-helical probes36,37,38,39,40. While these mechanisms apply to most AMPs, the models may not generalize to other structures, and their mechanisms of action remain mainly unresolved.
In the present study, we developed predictive models capable of distinguishing between (i) membrane-active peptides that induce bacterial membrane disruption (MDPs), (ii) those that solely penetrate through the membrane to reach one or more intracellular target(s) (MPPs), and (iii) peptides binding to larger proteins (PBPs). Aware of the possible over-representation of α-helical AMPs in our training datasets, we studied the predictive power of our models against different secondary structures and identified structural bias. We reduced our training sets to tackle imbalanced structural classes and construct models that could predict the three mechanisms of action (i–iii), invariably from their structural diversity.
Methods
Figure 1 summarizes the general workflow to predict the membrane or protein activity of peptide sequences with or without taking into account their structural information. In sequence-first modeling, we first collected the 1057 peptide sequences from multiple public databases before measuring 8437 features (i.e., compositions in amino acid/dipeptides/tripeptides, global physicochemical properties) for each sequence. Most features were filtered out, and only the most important properties were used to construct machine learning predictors. We evaluated the performances of 12 algorithms for binary classification and 9 algorithms for ternary classification (to distinguish between MDPs, MPPs, and PBPs). After optimizing our models with tenfold cross-validation, oversampling methods, and hyperparameter tuning, we tested them with an external validation set. We explored the structural landscapes of our datasets to discover a strong bias towards a specific structural class; α-helices. We revised our training sets to either limit our predictive model to that structural class or develop a more generalized model by generating native-like sequences from minor structural classes.
Datasets
Model datasets
We searched primary peptide sequences from four publicly available databases; DBAASP v3 (Database of Antimicrobial Activity and Structure of Peptide, https://dbaasp.org/)41, APD3 (Antimicrobial Peptide Database)42, PDBe (Protein Data Bank in Europe)43 and CPPsite 2.0 (Cell Peptide Penetrating site)44 using the specified keywords. All peptides were in their monomeric forms, without unusual amino acid modifications, targeted “gram-positive and gram-negative bacteria” and were expected to fold into “α-helix, β-sheet, random coil, and mixtures thereof”. We categorized and assigned all sequences into subsets (labels) based on their mechanisms of action: membranolytic/membrane-disrupting peptides (MDPs) target only “lipid bilayer(s)” of the aforementioned binding targets, while non-membranolytic/membrane-penetrating peptides (MPPs) bind to “DNA, RNA, cytoplasmic protein”. MPPs also include sequences reported as cell-penetrating peptides (CPPs) targeting microbial membranes. MDPs and MPPs form membrane-active peptides (MAPs). The last subset comprises protein-binding peptides (PBPs) that bind to larger eukaryotic protein receptors. For MDPs and MPPs, we utilized the DBAASP v3 and APD3 databases, selecting sequences with activities against both “anti-Gram + and anti-Gram– bacteria”, ensuring that peptides were under 10 kDa and using keywords like “cell-penetrating peptide”, “DNA”, and “RNA” for binding targets. The CPPsite 2.0 database, which has a unique mechanism, was employed for certain MPPs, applying filters like sub-cellular localization type “nucleus” or “cytoplasm”. We only considered those without N and C termini modifications, L-type chirality, and no chemical modifications. Finally, using the PDBe interface, we searched for peptides with terms such as “neuropeptides” and “immunomodulators”, focusing on those that bind to proteins. This comprehensive search also included a literature review to verify activities and mechanisms, thus enhancing our collection of sequences. In total, we gathered 1057 peptides, including 415 MDPs, 334 MPPs (749 MAPs), and 308 PBPs.
External validation dataset
We collected 262 peptide sequences that belong to one of the aforementioned classes (72 MDPs, 57 MPPs, or 133 PBPs), but they were not present in the model datasets.
Features
Amino acid composition (AAC)
The Amino Acid Composition (AAC) is the fraction of each amino acid type within a peptide sequence of length N. The AAC values were measured using iFeature web server45.
Dipeptide and Tripeptide composition (AAs and AAAs)
We measured a total of 400 dipeptide features and 8000 tripeptide features from peptide sequences using iFeature web server45.
Physicochemical properties (PCPs)
We measured a total of 118 PCPs from primary peptide sequences; 76 properties with the R package Peptide (v.2.4.1)46, 33 with the Python package modlAMP (v.3.7.3)47, and 8 with DBAASP v341. For the definitions of all PCPs, see Lists S1-S3.
Data pre-processing
All peptide sequences with duplicated information and/or missing values were removed. All model and validation datasets were normalized as X values. The external validation and α-helical datasets were normalized as Xvalidation values relative to the model dataset in use for predictions—see Supporting Information Eqs. (S1)–(S2).
Feature elimination
Multicollinearity
We reduced the number of variables/features (e.g., physicochemical properties) associated with each class to keep only the most informative and non-redundant ones using multicollinearity48. Multicollinearity (MC) excludes all highly correlated features to keep only non-redundant properties. In our study, we eliminated redundant properties using a Pearson correlation coefficient cut-off of 0.90.
Statistical tests
For statistical analysis of the datasets, we used the methodology described by our group in 202049. To assess and compare the statistical distribution between model datasets through their physicochemical features. We measured the normality of dataset distributions for each class of binary classification models using Shapiro–Wilk and Lilliefors tests, according to the size of the samples, before evaluating which dataset(s) had the same distribution in both groups. We determined the variance with either the F-test for a normally distributed dataset (ND) or the Fligner-Killen test for an abnormally distributed dataset (AD). We compared the means of physicochemical properties between the two classes by applying the three respective statistical tests; (1) Welch’s t-test to NDs with different variances, (2) Wilcoxon test (also known as Wilcoxon rank-sum) to ADs with the same variance and (3) Kolmogorov–Smirnov test to ADs with different variances, using a significance level α of 0.05. We controlled the false discovery rate with the Benjamini and Hochberg method using the same value α. All tests were performed using R (v.3.6.3)50 and RStudio51. The statistical pipeline is visible in Figs. S3 and S4.
Building classifiers
Machine learning algorithms
We evaluated 12 binary classification algorithms and 9 ternary (multi-label) classification algorithms to discriminate between membrane-disrupting peptides and membrane-penetrating peptides and/or protein-binding peptides. The binary classification algorithms include RFC: Random Forest Classifier52, GBC: Gradient Boosting Classifier53, ABC: Adaptive Boosting Classifier54, LDA: Linear Discriminant Analysis55, LR: Logistic Regression56, DT: Decision Tree57, K-NN: K-Nearest Neighbors58, GNB: Gaussian Naïve Bayes59, and SVC: Support Vector Classifier (with the 4 kernels: linear, radial basis function, polynomial, sigmoid)60. The ternary classifiers include ETC: Extra-Trees Classifier61, MNB: multinomial Naïve Bayes62, RNC: Radius Neighbors Classifier63 in addition to the aforementioned RFC, GBC, LDA, DT, KNN, and GNB. All the models were computed using Python package scikit-learn 0.23.164.
Performance metrics
We used different metrics to compare the performance of our classification models: accuracy (Acc.), precision (Prec.) or positive predictive value (PPV), recall or true positive rate (TPR), F1 score, Matthews correlation coefficient (MCC), Cohen's Kappa statistic (CK or κ), and the area under the curve Receiver Operating Characteristic (AUC-ROC) value—see Supporting Information Eqs. (S3–S8).
Class membership and class probabilities
Each peptide sequence was output a class membership and a class probability P. For binary classifiers, the class membership is either class 0: MPP or 1: MDP), and the class probability P to belong to that same class varies between 0.00 and 1.00 (e.g., PMDP = 0.78). For multiclass classifiers with 3 classes, class membership is either class 0: MPP or 1: MDP or 2: PBP), and the class probability P to belong to that same class varies between 0.00 and 1.00. For each sequence, the sum of class probabilities in each model is equal to 1 (i.e., binary: PMPP + PMDP, ternary: PMPP + PMDP + PPBP).
Tenfold cross-validation
All model datasets were split into two subsets; 1 training dataset (80%) used for model building and 1 smaller testing dataset (20%) used for internal validation. We evaluated the performances of our classification models using a tenfold (k = 10) cross-validation with each subset where sequences are randomly divided into 10 subsets (folds); 9 sets train the models, and the remaining set is for evaluation.
Oversampling methods
The three subsets (MDPs, MPPs, and PBPs) form imbalanced classes in their respective model datasets. We implemented three oversampling methods SMOTE (Synthetic Minority Oversampling Technique)65, ROSE (Random Over-Sampling Examples)66 and ADASYN (Adaptive Synthetic Sampling)67 to correct the imbalance by duplicating sequences from the minority class(es).
Secondary structure prediction
We predicted the tridimensional structures of all peptide sequences (model and external validation datasets) using ColabFold68, a handy interface implementing the Alpha Fold 2 (AF2)69 technology within the Google Colab environment—version 1.4. The batch mode allows the prediction of multiple peptide sequences in a single session. We requested a single model per sequence through 3 recycles. We kept the AF2 predictions with the highest pLDDT (local distance-dependent transition) and pTM (predicted template modeling) scores. Most models presented pLDDT scores above 80. We submitted all resulting models (.pdb files) to STRIDE70 to assign the secondary structure states—% helix (H), % sheet (E), and % coil (C).
Structural landscape representation and segmentation
We displayed the collective secondary structures of all peptide sequences as single points on a ternary representation—the structural landscape, as previously described29. The three axes represent the three secondary structure states—% helix (H), % sheet (E), and % coil (C). We used ggtern and ggplot libraries71 with R (v. 4.3.1) in R Studio environment51. In addition to the plots, we quantified the distribution of the secondary structures using a kernel density estimate. To ease readership and quantification, we subdivide the structural landscape into arbitrary regions based on their percentage composition of secondary structure states. In the present study, we created four regions (I) predominantly α-helical peptides, (II) predominantly stranded (β-sheet) peptides, (III) predominantly coiled peptides, and (IV) mixed structures.
Results
Identifying key features between membrane-active and protein-binding peptides
Machine learning (ML) models primarily rely on several factors, including the quality of their input data and independent features. First, we carefully curated 1,057 peptide sequences from four public databases (DBAASP v341, APD342, PDBe43, CPPsite 2.044) before labeling each sequence with one of the following classes: membrane-disrupting peptides (MDPs, n = 415), membrane-penetrating peptides (MPPs, n = 334), and protein-binding peptides (PBPs, n = 308). The first two classes formed the membrane-active peptides (MAPs, N = 749). Before building models, it was essential to identify and select critical properties that may distinguish between membrane-active peptides and protein-binding peptides. Therefore, we measured 8,537 sequence-derived properties: 20 amino acid fractions, 400 dipeptides, 8000 tripeptides, and 117 physicochemical properties from different sources for all peptides.
We first compared the differences in amino acid composition (AAC) between MDPs, MPPs, and PBPs. In Fig. 2A, membrane-penetrating peptides (MPPs, in peach yellow) showed higher contents in proline (0.12) and arginine (0.22) than membrane-disrupting peptides (MDPs, in teal green), which are characteristic of peptides with non-lytic mechanisms, such as cell-penetrating peptides. Both MPPs and MDPs displayed elevated levels of lysine, reminiscent of MAPs. Membrane-disrupting and protein-binding peptides (PBPs, raspberry red) contained many small and aliphatic residues, i.e., alanine, glycine, leucine, and valine. PBPs are enriched in polar and negatively charged residues, i.e., serine, threonine, glutamic and aspartic acids. These polar and charged amino acids indicate the hydrogen bonds and ionic interactions the peptides form with larger protein domains. In contrast, lysine and arginine in MDPs and MPPs suggested that the peptides primarily interacted with negatively charged lipid heads across cell membranes. We repeated the exercise by comparing the composition of dipeptides (DPC) and tripeptides (TPC) across MDPs, MPPs, and PBPs. Due to their enormous proportions of dipeptides (400) and tripeptides (8000), we only kept the crucial differences between membrane-active and protein-binding peptides, as illustrated in Fig. 2B. The presence of positively charged dipeptides “KK”, “RP”, and “RR” in MPPs (in peach yellow) further supported their roles in interacting with lipid membranes. Proline-rich dipeptides “RP” and “PP” are also distinctive features of these membrane-active peptides. Membrane-disrupting peptides (in teal green) are characterized by high levels of alanine-rich and leucine/isoleucine-rich dipeptides—“AG”, “AL”, “IL”, “KI”, “LA”. Finally, protein-binding peptides (in raspberry red) presented the polar and negatively charged dipeptides—“EA”, “EE”, “EK”, “EL”, “ER”, “GS”, “LE”, “LQ”, “LT”, “PS”, “SD”, “SE”, “SS”, “TL”, “TP” and “TS”, crucial to their interactions with larger protein domains. The same analysis was conducted across the 8000 tripeptides; none appeared relevant between the three peptide classes.
Considering the differences in amino acid and dipeptide compositions between the three classes, we measured 117 global physicochemical properties for the 1,057 sequences, 76 properties with the R package Peptide (v.2.4.1)46, 33 with the Python package modlAMP (v.3.7.3)47, and 8 with DBAASP v341. The properties are defined in Lists S1-S3. Many of these features encoded for the same peptide property (e.g., hydrophobicity); therefore, we performed correlation analyses using the Pearson correlation coefficient using the datasets from 2 or 3 classes (Figs. S1, S2). Several properties like hydrophobicity scales presented strong positive (in red) or negative correlation (in blue) with one another, suggesting highly redundant information. We eliminated all properties showing a correlation coefficient > 0.90, reducing the number of global physicochemical properties to 53 for the two classes and 57 for the three classes. Moreover, we eliminated irrelevant properties by keeping those that differ between membrane-active and protein-binding peptides. We conducted pairwise statistical analyses across two datasets (MDPs and MPPs, MAPs and PBPs)—see Fig. S3. The first pair should indicate properties linked to membrane activity, whereas the second pair should lead to properties for membrane or protein recognition.
We identified 49 significant properties between MDPs and MPPs, and 56 significant properties between MAPs and PBPs, summarized in Table S1. Two of the 49 properties (i.e., the hydrophobic moment and the hydrophobicity index on the Wilson scale) are exclusively associated with membrane activity, distinguishing between MDPs and MPPs—see properties 1 and 2 in Table S1. The other 47 properties played a role in discerning between membrane-active and protein-binding peptides. Nine properties (i.e., the prevalence of basic and aromatic residues, differences in charged residues, the linear moment, the isoelectric point (pI), and hydrophobicity indices from different scales) participated in differentiating between MAPs and PBPs—see properties 3–11 in Table S1. We illustrated some of these differences in Figs. 2 and S4. Higher isoelectric points, electrophilicity indices, and net charges of membrane-active peptides evoked their abundant basic residues and dipeptides. Membrane-disrupting peptides (in teal green) are generally more hydrophobic than membrane-penetrating peptides (in peach yellow); they displayed higher means for most indices or scales linked to hydrophobicity. These observations are somehow associated with the nature of the peptides, which reside longer within cell membranes before aggregating to form pores. Finally, protein-binding peptides (in raspberry red) are heavier and contain more acidic, aliphatic, and aromatic residues than MPPs, explaining their lower pI (Fig. 2). They also showed higher means for penetration depth and in vitro aggregation propensity (Fig. S4).
Building baseline binary and ternary classification models
Multiple global physicochemical properties or residues and dipeptides could differentiate between membrane-active and protein-binding peptides, endorsing the development of machine-learning predictors for membrane disruption, membrane penetration, and protein recognition. We selected 12 binary classification algorithms to predict the membrane activity of peptide sequences between MDPs and MPPs and 9 ternary classification algorithms to distinguish between the MAP classes and PBPs. We also corrected the imbalance between the two or three classes (MDPs, MPPs, and PBPs) by duplicating or generating synthetic sequences from the minority class(es) using the three oversampling methods ROSE (Random Over-Sampling Examples)66, SMOTE (Synthetic Minority Oversampling Technique)65, and ADASYN (Adaptive Synthetic Sampling)67. Our best initial results are summarized in Table 1. Additional performances of all classifiers under the three oversampling methods are listed in Supporting Information Tables S2–S4 for binary classification and Tables S5–S7 for ternary classification.
Overall, binary and ternary models based on the Random Forest Classification (RFC) algorithm outperformed all other classification models, irrespective of the oversampling technique employed. Table 1 showed that combining RFC with the ROSE method led to the highest prediction accuracies to distinguish between the two classes (88.0% and 83.3%) and three classes (86.7% and 83.5%) for training and testing datasets, respectively. Other classifiers based on tree-based algorithms, including Gradient Boosting (GBC), Adaptive Boosting (ABC), Extra Tree (ETC), and Decision Tree (DT) followed suit. GBC-based predictors achieved the second-highest performances with cross-validated training accuracies of 86.7% and 85.5% for binary and ternary classification. DT-based models showed the lowest performances between classifiers using tree-based algorithms. ABC-based binary and ETC-based ternary classifiers displayed intermediate accuracies. With other algorithms, models based on K-nearest neighbor (KNN) and Support Vector Classification (SVC) with polynomial kernel demonstrated relatively good performances in binary classification tasks, with training accuracies ranging between 82.7 and 85.2%. These observations support the recent statement that tree-based models RFC and GBC performed very well in classifying tabular data72. Previous studies have previously demonstrated that tree-based models outperformed other algorithms in classification or regression tasks using modlAMP descriptors49,73,74,75 or other features76.
Oversampling methods ROSE, SMOTE, and ADASYN are widely used to address the class imbalance in classification tasks. ROSE selects and duplicates sequences from the minority class (e.g., MPPs in binary classifiers)66. In SMOTE, the method generates synthetic sequences interpolated from the minority class65. ADASYN is an extension of the SMOTE oversampling method that adapts the number of synthetic sequences generated from the minority class based on varying degrees of imbalance67. The performances of our classifiers appeared to consistently follow the same order; RFC-based models outperformed all other classifiers, followed by GBC-based and KNN-based models, regardless of the classification task and oversampling method. Binary models (Tables S2–S4) and ternary models (Tables S5–S7) using the oversampling method ROSE yielded the best results through tenfold cross-validation. The performances of classifiers using SMOTE and ADASYN remained relatively close. For example, our best binary classifier using the RFC algorithm showed cross-validated training accuracies of 88.0% with ROSE and 86.5% with both SMOTE and ADASYN methods (Tables S2–S4). ROSE randomly selects minority class(es) sequences and generates new synthetic samples within the space. This randomness helps mitigate biases in the target minority class(es)66 that may arise with the other two oversampling methods. This feature was sufficient to achieve a good balance. In general, our classifiers showed a good fit, with training accuracy slightly higher than testing accuracy. In some cases, the models using Logistic Regression (LR) and SVC with radial basis function, linear or sigmoid kernels presented training accuracies lower than testing accuracies under any oversampling method (Table 1, Tables S2–S7). Therefore, we considered binary and ternary classifiers using the RFC algorithm, our models of choice, to further advance our study.
Feature importances extracted from tree-based algorithms are essential for model interpretability and improvement in predictive science. In our hands, they provided valuable insights into the underlying relationships between peptide descriptors and their mechanisms of action (i.e., classes). We can visualize the most relevant descriptors for our binary and ternary RFC models in Fig. 5 and Tables S8, S9. In Fig. 5A (model 1.0), elements of hydrophobicity (hydrophobic ratio, H. index), size (molecular weight), amphiphilicity (aliphatic amino acids, % tiny residues, flexibility, ABHPRK), and charge (charge density, electrophilicity) were among the key features separating membrane-disrupting peptides (MDPs) from membrane-penetrating peptides (MPPs). This is consistent with recent studies highlighting the role played by amphiphilicity in the membrane activity of α-helical AMPs34,35. In Fig. 5B (model 2.0), molecular weight and charge density were the most critical descriptors to differentiate the two classes mentioned above and PBPs. Differences in isoelectric points and hydrogen bonding played minor roles between the two or three classes. These observations are reminiscent of the differences in Figs. 2C and S4.
Auditing the datasets for structural bias
Aware of the possible over-representation of α-helical peptides in our models, we evaluated the structural diversity of the peptide datasets. We recently developed a fast and reliable approach to estimate the structural landscape of any sizable peptide dataset, using protein structure predictors, PEP2D and AlphaFold2 (AF2)29. Many of the sequences in our datasets contained 50 or more residues, guiding our preferences for AF2. Thus, we predicted the tridimensional structures of all peptide sequences using the ColabFold environment with AF2 in batch mode through 3 recycles68. A few structures were incorrectly predicted and were ignored, leading to a final model dataset of 412 MDPs, 326 MPPs, and 307 PBPs. Our resulting predicted structures were submitted to STRIDE70 to assign the secondary structure states—% helix (H), % sheet (E), and % coil (C). We displayed the global and class-specific structural compositions of our model and external validation datasets in Fig. 3. To ease readership and countability, we divided the ternary plots into 4 structural regions, namely (I) predominantly helical peptides, (II) predominantly stranded (β-sheet) peptides, (III) predominantly coiled peptides, and (IV) mixed structures. The sizes of peptide subsets across structural regions and classes are summarized in a table (Fig. 4A).
In Fig. 3A, we observed the structural landscapes of the model dataset with 1046 peptides (purple) and the external validation dataset, including 262 AF2 + STRIDE predictions (orange). Both datasets are distributed across three of the four structural regions (I, III, and IV), none displaying predicted stranded (β-sheet) peptides (II). For the model dataset, most predictions assumed helical structures with varying coiled levels in the region (I), i.e., 564 peptides (53.9%), followed by 346 coiled peptides (III: 33.1%) and 136 mixed structures (IV: 13.0%). The external validation set presented nearly the exact proportions of α-helices (I: 91, 34.7%) and coils (III: 102, 38.9%), and mixed structures accounted for the rest (IV: 69, 26.4%). In Fig. 3B,C, we reported the respective structural compositions of model and external validation datasets per class—MDPs (teal green), MPPs (peach yellow), and PBPs (raspberry red). Ternary plots in Fig. 3B confirmed that most sequences in the model dataset would fold into α-helical peptides (564, 53.9%) and coils (346, 33.1%). A minority of sequences adopted mixed structures (136, 13.0%). In Fig. 4A, structural region (I) presented α-helices among its three classes with 265 MDPs (64.3%), 114 MPPs (34.9%) and 185 PBPs (60.3%). A quarter of MDP sequences (108, 26.2%), more than half of MPPs (188, 57.7%), and some PBPs (50, 16.3%) would be predicted as coiled structures (III). Finally, the three classes also included sequences that would fold into mixed structures: 39 MDPs (9.5%), 25 MPPs (7.7%), and 72 PBPs (23.4%).
These observations suggested that our classification models (Table 1) might predict the mechanisms of action of peptide sequences that would likely fold into α-helices or coils with greater accuracy. We tested this hypothesis by splitting the external validation dataset according to its secondary structures (Fig. 3C) and comparing the performances of RFC models against the whole dataset and its structural subsets (I, III, and IV). The external validation dataset also contained sequences that would adopt folds located in regions (I) and (III), counting 91 α-helices (34.7%) and 102 coiled structures (38.9%) among its three classes (Fig. 4A). In contrast, the dataset was devoid of mixed structures with MDP activity. Most mixed structures were within the class PBPs (64, 92.7%) and few MPPs (5). The results showed the confusion matrices resulting from the binary and ternary classification of 129 peptides in Fig. 4B,C. We defined the misclassification rate as the fraction of incorrectly labeled sequences. For example, in Fig. 4B, our binary classifier (model 1.0) correctly classified 88 peptides as either MDPs (54) or MPPs (34), leading to a misclassification rate of 0.318 (41 out of 129) on the complete dataset. In the same figure, the misclassification rate among α-helical peptides (subset I) was lower, with a value of 0.265, whereas the fraction of misclassified coiled sequences (subset III) reached 0.422. Four-fifths of mixed structures were correctly classified in subset IV. In Fig. 4C, our ternary classifier (model 2.0), roughly a third (0.305) of all sequences were misclassified. The misclassification rate peaked at 0.560 in subset I, predominantly from α-helical MPPs (21 out of 51 misclassified peptides). Among coiled and mixed structures (subsets III and IV), most PBPs (52 out of 57, 45 out of 64) were correctly labeled; the misclassified rates of 0.421 and 0.347 resulted from misclassified membrane-active peptides. The structural imbalance between model subsets partly explained these values; most MDPs were α-helices (265), and half of the PBPs (72) folded into mixed structures, as depicted in Figs. 3B and 4A. Consequently, our models correctly assigned most α-helical MDPs and most coiled and mixed PBPs from the external datasets (Figs. 4B and 4C). In contrast, most coiled peptides in the external validation were misclassified (Fig. 4B: 13 out of 24, Fig. 4C: 20 out of 24) despite 188 coiled MPPs in the model sets. Both imbalances among structures and classes affected the model performances.
Mitigating the structural bias by subset selection and data reduction
To tackle the effects imbalanced structural regions have upon the performances of our models, we developed new binary and ternary classifiers that could predict the three mechanisms of action (MDPs, MPPs, and PBPs), invariably from their structural diversity. These models were either trained from specific structural subsets—predicted α-helices (1.1 and 2.1), predominantly coils (1.3 and 2.3), and mixed structures (1.4 and 2.4)—or the structural subsets I, III, and IV in their training sets were balanced out, giving the new training sets V, and the models 1.5 and 2.5. In the latter, we processed by randomly reducing the number of folding sequences to (loose) α-helices from the primary peptide class(es). We repeated the procedure five times; the final classes were selected by majority vote (mode). All three classes were balanced using the ROSE oversampling method. The performances of all models are summarized in Table 2 (binary models 1.1–1.5) and Table 3 (ternary models 2.1–2.5). We added our reference models 1.0 and 2.0 performance metrics for direct comparison.
We noted that most structure-specific models outperformed their unspecific parent classifiers (models 1.0 and 2.0). For example, the α-helix-specific binary classifier (1.1) presented respective training and testing accuracies of 95.5% and 89.5%, far better than model 1.0 with 88.0% and 83.3% accuracy values. This improvement was also observed across the other performance metrics, i.e., precision, recall, F1, MCC, CK, and ROC AUC values. Likewise, the coil-specific binary classifier (1.3) was slightly improved, with training and testing accuracies of 90.5% and 83.3%. On the contrary, model 1.4, trained on a handful of peptide sequences with AF2-predicted mixed structures, demonstrated poorer performance—see Table 2. Looking at model 1.5, removing representative sequences from subset I (AF2-predicted α-helices) at random led to information loss in the training process, translating to lower accuracies of 85.4% and 83.2%. However, the recall indicated stronger sensitivity. The other classification metrics also supported this observation. The structure-specific models 2.1–2.4 outperformed their parent ternary classifier 2.0—see Table 3. Unlike model 1.4, the ternary model 2.4 trained on many PBP sequences with AF2-predicted mixed structures, leading to better classification metrics.
A benefit of building predictive models from random forest and other tree-based algorithms is the built-in estimation of feature importances. We hypothesized that importance scores of physicochemical properties as features would be sensitive to the structural awareness of our models. For example, the critical features involved in classifying α-helical sequences would differ from those classifying coiled or β-stranded membrane-active peptides. We measured the importance scores of 49 physicochemical properties for binary models 1.0–1.5 and 56 properties for ternary models 2.0–2.5 in Tables S8 and S9, respectively. To ease readership, we showed the 10 most common features and their importance scores (colored circles) in Fig. 5A,B.
In Fig. 5A, we recall that hydrophobicity (hydrophobic ratio, H. index), size (molecular weight), amphiphilicity (aliphatic amino acids, % tiny residues, flexibility, ABHPRK), and charge (charge density, electrophilicity) were among these features that distinguish membrane-disrupting peptides (MDPs) from membrane-penetrating peptides (MPPs). With model 1.1, the two global peptide descriptors cougar and charge density were essential to classify α-helical MDPs and MPPs, with importance scores (IS) of 0.053 and 0.043—Table S8. In contrast, flexibility was the main physicochemical property to classify coiled membrane-active peptides (model 1.3—ISflexibility 0.094). Finally, hydrophobicity, size, and amphiphilicity played important roles in membrane-active peptides with AF2-predicted mixed structures. Structure-agnostic model 1.5 shared at least the 10 most important features as parent model 1.0 with importance scores in the same order of magnitude, except for the ABHPRK property (ISABHPRK 0.030 vs. 0.044, Table S8). Diminishing α-helical sequences in the training process reduced the predictive power of model 1.5, but it did not induce changes in feature importance. In other words, structure-specific models and their feature importance are sensitive to the (partial or complete) proportions of structural subsets (I-IV), supporting our hypothetical statement.
In Fig. 5B, ternary model 2.0 estimated that molecular weight and charge density were the two most essential descriptors to differentiate MDPs, MPPs, and PBPs, with respective importance scores of 0.069 and 0.052 (Table S9). Other elements related to charge (electrophilicity, hydrogen bonding, charge density), hydrophobicity (Janin scale, ASHR), and amphiphilicity played minor roles among the three classes. Not surprisingly, molecular weight (0.065) and charge density (0.084) were crucial for classifying α-helical MDPs, MPPs, and PBPs in model 2.1, with ~ 54% sequences in model 2.0 predicted to adopt α-helices. Hydrophobicity and elements of charge and molecular weight (to a lesser degree of importance—0.035) distinguished coiled membrane-active and protein-binding peptides (model 2.3). Finally, size (molecular weight) is the leading property that separates membrane-active and protein-affine mixed structures (model 2.4—ISMW 0.071). Like model 1.5, structure-agnostic model 2.5 shared the same essential features as its parent model (2.0), such as molecular weight, acidic residues, and charge density, with respective scores of 0.059, 0.047, and 0.039 (Table S9). Unlike model 1.5, removing α-helical sequences in the training process did not induce a loss in predictive power.
We evaluated the performances of the structure-specific models (1.1–1.4, 2.1–2.4) and structure-agnostic models (1.5 and 2.5) against structural subsets of the external validation dataset (I, III, and IV) and its “balanced” form (V), as detailed in Fig. 5C,D. From the confusion matrices, we derived the five key performance indicators—accuracy, precision, recall, specificity, and F1 score, as per Eqs. (S3)–(S6) and (S9), and we compiled the results in Table S10. Overall, our analysis of the external validation subsets revealed three distinct trends in class predictions based on the structure awareness of the models: an increase, a decrease, or no change.
Narrowing the training process to sequences folded into α-helices has improved model 1.1 predictive power, compared to model 1.0 (Table 2). Applying these models to the α-helical external validation subset (I) also increased across all binary classification metrics, indicative of the correct assignment of MDPs and MPPs, i.e., precision, recall, and F1 scores of 0.804, 0.891 and 0.845 (model 1.1—Fig. 5C and Table S10). Similarly, the ternary model 2.4 learned to classify peptides folding into AF2-predicted mixed structures, particularly PAP sequences. Consequently, the many PAP sequences within the external validation subset (IV) were correctly identified with a classification accuracy of 80.4% (model 2.4—Fig. 5D and Table S10). These two examples illustrated that training and external validation subsets share structures and classes to ensure the correct classification of sequences.
Without shared structures and classes, the external validation subsets would score identical metrics or lead to misclassification. For example, the application of binary model 1.4 to subset IV resulted in the same accuracy of 80.0% despite an improved training process to classify AF2-predicted mixed structures (models 1.0 and 1.4—Tables 2 and S10). This result is associated with the quasi-absence of MDP/MPP sequences within the external validation subset (IV); see Figs. 3C and 4A. Likewise, many sequences from subsets (I) and (III) were misclassified, as indicated by the lower accuracies and recall values in Table S10, despite the improved performances of structure-specific models 1.3, 2.1, and 2.3. The other classification metrics (precision, specificity, and F1) vary from the imbalance between structural subsets and classes. As such, model 1.3 to subset (III) incorrectly assigned MDPs (actual positives) and misclassified many MPPs (actual negatives) as MDPs, leading to higher precision, lower specificity, and lower F1 scores from the underrepresented coiled MDPs. Applying model 2.3 to subset (III) struggled to distinguish between MDPs and MPPs classes due to the abundance of coiled PBPs, leading to lower precision, recall, specificity, and F1 score—see Fig. 5D and Table S10. Conversely, the abundance of α-helical MDPs over the other two classes in the subset (I) drove higher precision, specificity, and F1 scores. In other words, few MDPs (positives) were misidentified, so more α-helical sequences from underrepresented MPPs and PBPs could be predicted correctly (model 2.1—Table S10).
Finally, removing random α-helical sequences during training resulted in sub-optimal models 1.5 and 2.5 with similar or reduced performances, see Tables 2 and 3. Applying these models to the balanced subset (V) showed identical or worsened classification metrics (models 1.5 and 2.5—Table S10). In Fig. 3B, removing several α-helical sequences randomly would lead to more balanced yet less informative structural landscapes of MDPs and MPPs. In contrast, mixed structures would dominate among PBPs. Consequently, removing sequences has accentuated class and structural imbalances. Our ternary model 2.5 is less effective at identifying between the three classes and structural subsets (I-IV).
Discussion
The primary focus of our study aimed to identify critical features that distinguished membrane-active peptides (MAPs) from protein-binding peptides (PBPs). Our research is an extension of two seminal statistical analyses deciphering the relationships between peptides and their behaviors when interacting with cell membranes34,35. In the first study, Lee and co-workers developed a sequence-first Support Vector Classifier between antimicrobial peptides (AMPs). The authors identified that their model was good at predicting the membrane activity of AMPs rather than their antimicrobial nature34,77. Two years later, Brand and co-workers classified membrane-active peptides by combining differential scanning calorimetry results and circular dichroism experiments using unsupervised learning35. Although limited to α-helical AMPs, both original studies uncovered that physicochemical properties (i.e., amphiphilicity, helical propensity) could help predict the peptides’ antimicrobial nature and membrane activity. In our hands, membrane-active peptides presented higher levels of lysine and positively charged dipeptides, suggesting that the peptides primarily interact with negatively charged lipids across cell membranes. Regarding their global physicochemical properties, MAPs were characterized by higher amphiphilicity and net charges than protein-binding peptides, corroborating the findings above. Membrane-disrupting peptides exhibited higher levels of hydrophobicity, correlating with their prolonged time within cell membranes before aggregating and pore formation. In contrast, protein-binding peptides showed more significant indices of penetration depth and a greater tendency to aggregate in vitro.
Encouraged by our initial findings, we developed machine learning classifiers to predict membrane or protein activity (recognition) between the two or three classes. We employed 12 binary classification algorithms to distinguish between MDPs and MPPs and 9 ternary classification algorithms to differentiate between the MAP classes and PBPs. Overall, the Random Forest Classification (RFC) emerged as the most effective algorithm for both binary and ternary classifiers, achieving the highest accuracies (86.7–88.0% for training and 83.3–83.5% for testing) when combined with the oversampling method ROSE. Our study further supported the evidence that tree-based models outperformed other algorithms in classification tasks using tabular data such as modlAMP descriptors49,73,74,75 or other features76.
Aware of the possible over-representation of α-helices among AMPs29, we estimated the structural landscapes of our model datasets, revealing a majority of peptides assuming α-helical and loose structures (subset I—53.9%), two minorly represented coiled structures (III—33.1%) and mixed structures (IV—13.0%), and the complete absence of β-stranded peptides (II). An objective evaluation of the structural awareness demonstrated that our preliminary models were biased toward predicting the likelihood of sequences forming α-helical structures by assigning them high-class probabilities. In other words, our models were structurally biased. Noteworthy, most ML models are sequence-based and have not incorporated structural information in their predictions for peptide design19,20,21,22,23,24,25. Some researchers have, however, reduced their training sequences to a specific amino acid distribution, inducing involuntary structural constraints. For instance, Dean and co-workers have developed tree-based regression models that predict the minimum inhibitory concentration (MIC) against strains such as E. coli, S. aureus, and P. aeruginosa, specifically for peptides exempted from cysteine and proline that exclusively fold into α-helices73. Our study is the first assessment of structural bias in predictive models and of the structural effects upon prediction. Additional independent studies are strongly encouraged.
In the latter part of our study, we tackled the structural bias in our predictions by developing new classification models tailored to specific structural subsets, denoted as structure-specific models 1.1–1.4 and 2.1–2.4. We rectified the datasets by adjusting class proportions with the oversampling method ROSE. Additionally, we explored the impact of reducing the number of sequences from the dominant subset, i.e., loose α-helices (I), leading to structure-agnostic models 1.5 and 2.5. Generally, the structure-specific models surpassed their non-specific parent classifiers (1.0 and 2.0) when the training sets contained abundant sequences folding into those structures. However, it became apparent that the sequences to be tested must adopt the same structures to ensure accurate predictions. Notably, models 1.1 and 2.4 exhibited significant increases in training and testing accuracies compared to models 1.0 and 2.0, with similar improvements observed across other performance metrics. Conversely, the absence of sequences sharing the same structural landscape hindered model performance. Similarly, the random removal of α-helical sequences within the over-represented structure subset (I) exacerbated class and structural imbalances, resulting in underperforming structure-agnostic models. While balancing the training data is essential, it must be done thoughtfully to avoid information loss and compromise the model’s predictive power. In terms of feature importance, our investigation revealed distinct features that played crucial roles in differentiating between structure-specific models. The models built on α-helical sequences prioritized hydrophobicity, size, and charge, whereas those tailored for coiled structures value flexibility. These findings prompt us to reconsider whether the amphiphilic nature is an exclusive characteristic of membrane-active peptides with α-helical folds.
Conclusion
This study delved into the critical features (i.e., dipeptides and global physicochemical properties) distinguishing between membrane-disrupting peptides, membrane-penetrating peptides, and protein-binding peptides, leading to the development of robust machine-learning classifiers. While our focus was on predicting these mechanisms of action, our investigation also unveiled the rapid assessment of structural landscapes in peptide datasets and a structural bias in our initial models. They were inclined to predict accurately the mechanisms of action of sequences forming α-helical structures. To address this bias, we explored two strategies: creating new predictive models trained on specific structural subsets (referred to as structure-specific models) or utilizing structurally balanced datasets (referred to as structure-agnostic models). Overall, the structure-specific models tended to outperform the non-specific ones, particularly when the training and test sequences shared similar structural landscapes. Conversely, removing α-helical sequences randomly worsened the model performances, indicating the need for thoughtful data balancing to avoid information loss. Moreover, our analysis highlighted the sensitivity of important features across these models to the structure classes, underlining the significance of considering structural nuances in peptide classification. This work is one of the first studies to assess structural bias in predictive models and to suggest that structural effects are crucial for accurate predictions, an issue not commonly addressed in sequence-based machine learning models. It is a significant step forward in helping our understanding of the sequence-structure–function relationships and the design of peptide-based therapeutics using artificial intelligence.
Data availability
Supporting data in this article are provided in Supporting Information. Data and scripts can be downloaded from the public GitHub repository https://github.com/Puga8Ma/Structure-aware-ML-for-AMP-discovery.
References
de Oliveira, E. C. L., da Costa, K. S., Taube, P. S., Lima, A. H. & Junior, C. de S. de S. Biological membrane-penetrating peptides: computational prediction and applications. Front. Cell. Infect. Microbiol. 12, (2022).
Ali, F., Kumar, H., Alghamdi, W., Kateb, F. A. & Alarfaj, F. K. Recent advances in machine learning-based models for prediction of antiviral peptides. Arch. Comput. Methods Eng. 30, 4033–4044 (2023).
Melo, M. C. R., Maasch, J. R. M. A. & de la Fuente-Nunez, C. Accelerating antibiotic discovery through artificial intelligence. Commun. Biol. 4, 1–13 (2021).
Aguilera-Puga, M. d. C., Cancelarich, N. L., Marani, M. M., De La Fuente-Nunez, C. & Plisson, F. Accelerating the discovery and design of antimicrobial peptides with artificial intelligence. In Computational Drug Discovery and Design (Springer, 2023).
Grisoni, F. et al. Designing anticancer peptides by constructive machine learning. ChemMedChem 13, 1300–1302 (2018).
Hwang, J. S. et al. Development of anticancer peptides using artificial intelligence and combinational therapy for cancer therapeutics. Pharmaceutics 14, 997 (2022).
Zakharova, E., Orsi, M., Capecchi, A. & Reymond, J.-L. Machine learning guided discovery of non-hemolytic membrane disruptive anticancer peptides. ChemMedChem 17, e202200291 (2022).
Martinez-Hernandez, C., Del Carmen Aguilera-Puga, M. & Plisson, F. Deconstructing the potency and cell-line selectivity of membranolytic anticancer peptides. ChemBioChem 24, e202300058 (2023).
Guo, Z. & Yamaguchi, R. Machine learning methods for protein-protein binding affinity prediction in protein design. Front. Bioinf. 2, 1065703 (2022).
Gupta, A. et al. Generative recurrent networks for de novo drug design. Mol. Inform. 37, (2018).
Bennett, N. R. et al. Improving de novo protein binder design with deep learning. Nat. Commun. 14, 2625 (2023).
Akbar, R. et al. In silico proof of principle of machine learning-based antibody design at unconstrained scale. mAbs 14, 2031482 (2022).
Kim, J., McFee, M., Fang, Q., Abdin, O. & Kim, P. M. Computational and artificial intelligence-based methods for antibody development. Trends Pharmacol. Sci. 44, 175–189 (2023).
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Clifton, B. E., Kozome, D. & Laurino, P. Efficient exploration of sequence space by sequence-guided protein engineering and design. Biochemistry 62, 210–220 (2023).
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
Mazurenko, S., Prokop, Z. & Damborsky, J. Machine learning in enzyme engineering. ACS Catal. 10, 1210–1223 (2020).
Feehan, R., Montezano, D. & Slusky, J. S. G. Machine learning for enzyme engineering, selection and design. Protein Eng. Des. Sel. 34, gzab019 (2021).
Fjell, C. D. et al. Identification of novel antibacterial peptides by chemoinformatics and machine learning. J. Med. Chem. 52, 2006–2015 (2009).
Fjell, C. D., Hiss, J. A., Hancock, R. E. W. & Schneider, G. Designing antimicrobial peptides: Form follows function. Nat. Rev. Drug Discov. 11, 37–51 (2012).
Yoshida, M. et al. Using evolutionary algorithms and machine learning to explore sequence space for the discovery of antimicrobial peptides. Chem 4, 533–543 (2018).
Cardoso, M. H. et al. Computer-aided design of antimicrobial peptides: Are we generating effective drug candidates?. Front. Microbiol. 10, 1–15 (2020).
Xu, J. et al. Comprehensive assessment of machine learning-based methods for predicting antimicrobial peptides. Brief. Bioinform. 22, bbab083 (2021).
Wang, G., Vaisman, I. I. & van Hoek, M. L. Machine learning prediction of antimicrobial peptides. In Computational Peptide Science (ed. Simonson, T.) vol. 2405 1–37 (Springer US, New York, NY, 2022).
Fernandes, F. C. et al. Geometric deep learning as a potential tool for antimicrobial peptide prediction. Front. Bioinf. 3, 1216362 (2023).
Hancock, R. E. W., Haney, E. F. & Gill, E. E. The immunology of host defence peptides: Beyond antimicrobial activity. Nat. Rev. Immunol. 16, 321–334 (2016).
Haney, E. F., Straus, S. K. & Hancock, R. E. W. Reassessing the host defense peptide landscape. Front. Chem. 7, 43 (2019).
Mookherjee, N., Anderson, M. A., Haagsman, H. P. & Davidson, D. J. Antimicrobial host defence peptides: Functions and clinical potential. Nat. Rev. Drug Discov. 19, 311–332 (2020).
Aldas-Bulos, V. D. & Plisson, F. Benchmarking protein structure predictors to assist machine learning-guided peptide discovery. Digit. Discov. 2, 981–993 (2023).
Hancock, R. E. W. & Sahl, H. G. Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nat. Biotechnol. 24, 1551–1557 (2006).
Zasloff, M. Mysteries that still remain. Biochim. Biophys. Acta BBA Biomembr. 1788, 1693–1694 (2009).
Torrent, M., Andreu, D., Nogués, V. M. & Boix, E. Connecting peptide physicochemical and antimicrobial properties by a rational prediction model. PLoS ONE 6, e16968 (2011).
Torrent, M., Valle, J., Nogués, M. V., Boix, E. & Andreu, D. The Generation of antimicrobial peptide activity: A trade-off between charge and aggregation?. Angew. Chem. Int. Ed. 50, 10686–10689 (2011).
Lee, E. Y., Fulan, B. M., Wong, G. C. L. & Ferguson, A. L. Mapping membrane activity in undiscovered peptide sequence space using machine learning. Proc. Natl. Acad. Sci. 113, 13588–13593 (2016).
Brand, G. D., Ramada, M. H. S., Genaro-Mattos, T. C. & Bloch, C. Towards an experimental classification system for membrane active peptides. Sci. Rep. 8, 1194 (2018).
Brogden, K. A. Antimicrobial peptides: Pore formers or metabolic inhibitors in bacteria?. Nat. Rev. Microbiol. 3, 238–250 (2005).
Sengupta, D., Leontiadou, H., Mark, A. E. & Marrink, S.-J. Toroidal pores formed by antimicrobial peptides show significant disorder. Biochim. Biophys. Acta BBA—Biomembr. 1778, 2308–2317 (2008).
Wimley, W. C. Describing the mechanism of antimicrobial peptide action with the interfacial activity model. ACS Chem. Biol. 5, 905–917 (2010).
Hollmann, A., Martinez, M., Maturana, P., Semorile, L. C. & Maffia, P. C. Antimicrobial peptides: Interaction with model and biological membranes and synergism with chemical antibiotics. Front. Chem. 6, 204 (2018).
Juhl, D. W., Glattard, E., Aisenbrey, C. & Bechinger, B. Antimicrobial peptides: Mechanism of action and lipid-mediated synergistic interactions within membranes. Faraday Discuss. 232, 419–434 (2021).
Pirtskhalava, M. et al. DBAASP v3: Database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Res. 49, D288–D297 (2021).
Wang, G., Li, X. & Wang, Z. APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Res. 44, D1087–D1093 (2016).
Armstrong, D. R. et al. PDBe: Improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 48, D335–D343 (2020).
Agrawal, P. et al. CPPsite 2.0: A repository of experimentally validated cell-penetrating peptides. Nucleic Acids Res. 44, D1098–D1103 (2016).
Chen, Z. et al. iFeature: A Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34, 2499–2502 (2018).
Osorio, D., Rondón-Villarreal, P. & Torres, R. Peptides: A package for data mining of antimicrobial peptides. R J. 7, 4 (2015).
Müller, A. T., Gabernet, G., Hiss, J. A. & Schneider, G. modlAMP: Python for antimicrobial peptides. Bioinformatics 33, 2753–2755 (2017).
Alin, A. Multicollinearity. Wiley Interdiscip. Rev. Comput. Stat. 2, 370–374 (2010).
Plisson, F., Ramírez-Sánchez, O. & Martínez-Hernández, C. Machine learning-guided discovery and design of non-hemolytic peptides. Sci. Rep. 10, 16581 (2020).
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. (2020).
RStudio Team. RStudio: Integrated development for R. RStudio, PBC, Boston, MA (2020).
Breiman, L. Random forests. Mach. Learn. https://doi.org/10.1023/A:1010933404324 (2001).
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, (2001).
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997).
Fisher, R. A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–188 (1936).
Cramer, J. S. The origins of logistic regression. SSRN Electron. J. https://doi.org/10.2139/ssrn.360300 (2003).
Breiman, L., Friedman, J. H., Stone, C. J. & Olshen, R. A. Classification and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability series Wadsworth statistics/probability series (Taylor & Francis, 1984). https://doi.org/10.1201/9781315139470.
Cunningham, P. & Delany, S. J. k-nearest neighbour classifiers—a tutorial. ACM Comput. Surv. 54, 1–25 (2022).
Current Trends in Knowledge Acquisition. (IOS Press, Amsterdam, 1990).
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).
Manning, C. D., Raghavan, P. & Schütze, H. Introduction to Information Retrieval. (Cambridge University Press, 2008). https://doi.org/10.1017/CBO9780511809071.
Bentley, J. L. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509–517 (1975).
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
Lunardon, N., Menardi, G. & Torelli, N. ROSE: A Package for Binary Imbalanced Learning. R J. 6, 79 (2014).
He, H., Bai, Y., Garcia, E. A., & Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) 1322–1328 (IEEE, Hong Kong, China, 2008). https://doi.org/10.1109/IJCNN.2008.4633969.
Mirdita, M. et al. ColabFold: Making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Heinig, M. & Frishman, D. STRIDE: A web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res. 32, W500–W502 (2004).
Hamilton, N. E. & Ferry, M. ggtern: Ternary Diagrams Using ggplot2. J. Stat. Softw. 87, (2018).
Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? https://doi.org/10.48550/ARXIV.2207.08815 (2022).
Dean, S. N., Alvarez, J. A. E., Zabetakis, D., Walper, S. A. & Malanoski, A. P. PepVAE: Variational autoencoder framework for antimicrobial peptide generation and activity prediction. Front. Microbiol. 12, 725727 (2021).
Grafskaia, E. N. et al. Non-toxic antimicrobial peptide Hm-AMP2 from leech metagenome proteins identified by the gradient-boosting approach. Mater. Des. 224, 111364 (2022).
Sequeira, A. M., Lousa, D. & Rocha, M. ProPythia: A Python package for protein classification based on machine and deep learning. Neurocomputing 484, 172–182 (2022).
Bhadra, P., Yan, J., Li, J., Fong, S. & Siu, S. W. I. AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci. Rep. 8, 1697 (2018).
Lee, E. Y., Lee, M. W., Fulan, B. M., Ferguson, A. L. & Wong, G. C. L. What can machine learning do for antimicrobial peptides, and what can antimicrobial peptides do for machine learning?. Interface Focus 7, 20160153 (2017).
Acknowledgements
Authors disclose support for this research from the Mexican research council Consejo Nacional de Humanidades Ciencias y Tecnologías (CONAHCYT) Basic Science [grant number A1-S-32579, 2020-2023]. F.P. discloses support from the Rosenkranz Medical Research Award 2021 [biotechnology category] (FunSalud–Fundación para la Salud, A.C., and Roche, AG). Mariana d. C. Aguilera-Puga (M.d.C.A.-P.) discloses support from a national CONAHCYT PhD scholarship.
Author information
Authors and Affiliations
Contributions
M.d.C.A.-P. carried out data curation, methodology, and programming algorithms. F.P. conceptualized the investigation, including methodology, and acquired funding. Both authors prepared Figs. 1, 2, 3, 4, 5, Tables 1, 2, 3, and supporting information. They both wrote, edited, and reviewed the manuscript text and literature search.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Aguilera-Puga, M.D.C., Plisson, F. Structure-aware machine learning strategies for antimicrobial peptide discovery. Sci Rep 14, 11995 (2024). https://doi.org/10.1038/s41598-024-62419-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-62419-y
Keywords
This article is cited by
-
Immunomodulation in Non-traditional Therapies for Methicillin-resistant Staphylococcus aureus (MRSA) Management
Current Microbiology (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.