Introduction

Cancer therapy faces the challenge of resistance to chemotherapeutics and receptor-targeted anticancer agents. Several cell resistance mechanisms, such as drug inactivation or efflux, target protein alteration, DNA damage repair and signaling cascade alteration have been identified1,2. Moreover, the indiscriminate action of most chemotherapeutics towards all rapidly dividing cells causes a variety of severe side effects3,4. Membranolytic anticancer peptides (ACPs) represent a new class of potential cancer therapeutics. Their receptor-independent mechanism of action may hinder the development of cellular resistance3,4,5. Nevertheless, the underlying structure-activity relationship that explains the membranolytic properties of these peptides is not completely understood. Peptide amphipathicity, moderate overall hydrophobicity, and a positive net charge are known requirements for ACP activity6,7,8,9. However, no simple combination of these properties has been found sufficient to fully explain the activity and selectivity of ACPs towards cancer cells10. Producing novel peptides lacking toxicity against nonneoplastic cells also remains challenging11. Various machine learning methods have been successfully applied to guide the rational design of both ACPs12,13,14,15,16,17,18 and antimicrobial peptides (AMPs)19,20, as well as other membrane-active peptides21. The lack of a systematic annotation of the selectivity of ACPs towards cancer cells in the literature and in peptide databases has hindered the development of predictive models that take selectivity into account. There is a need for innovative methods that do not require selectivity data for peptide optimization.

Simulated molecular evolution (SME) is a stochastic optimization algorithm pioneered in the 1990s for computational peptide design22,23,24. SME belongs to the class of evolutionary algorithms, which also includes genetic algorithms, and enables the optimization of peptide properties that are encoded in a theoretical fitness function or in combination with an experimental fitness evaluation when structure-activity relationships cannot be determined a priori. We have recently applied this design concept to generate innovative membrane-targeting peptides25,26. Here, we present a peptide design approach that is based on a novel ACP prediction model and on SME for the optimization of ACP selectivity for cancer cells. The predictive machine learning model led to the discovery of four novel synthetic ACPs with low-micromolar activity (1–20 µM) against A549 lung cancer and MCF7 breast cancer cells. One of these peptides was then subjected to SME. After the first iteration of the optimization process, we obtained a novel ACP that showed micromolar activities against a range of cancer cell types with significantly reduced activity towards human dermal microvascular cells (HDMEC) and human erythrocytes. The results of this study advocate for machine-learning models in combination with computational sequence generators for designing and optimizing functional peptides in silico.

Results and Discussion

ACP classifier model

We developed a machine learning model to classify peptides into ACPs and non-ACPs based on their amino acid sequence representations. The machine-learning classifier was trained on “positive” (ACPs, active) and “negative” (non-ACPs, inactive) peptides. We retrieved alpha-helical ACPs from the CancerPPD database27 as positive examples (N = 339). For the negative class, we retrieved alpha-helices from nontransmembrane proteins in the PDB database28 (N = 680). All amino acid sequences were represented numerically in a computer-readable form by the use of molecular descriptors. For this purpose, we utilized a combination of PEPCATS pharmacophore feature descriptors29 and four global properties, namely, Eisenberg’s hydrophobicity, Eisenberg’s hydrophobic moment30, charge density, and peptide length (number of residues). The PEPCATS descriptor represents the amino acid sequences as binary vectors indicating cross-correlated pharmacophore features of the individual amino acids (hydrophobic, aromatic, hydrogen-bond acceptor, hydrogen-bond donor, positively ionizable, negatively ionizable). The cross-correlation of pharmacophoric feature pairs is determined within a sliding sequence window encompassing seven residues. The resulting 151-dimensional descriptor vector was reduced to an 18-dimensional feature vector by covariance elimination and sequential feature elimination (Fig. S1, Supplementary Information). The dataset was split into a training set (2/3) and an independent test set (1/3) by stratified sampling, preserving the proportion between the positive and negative classes. Two machine learning algorithms were considered for model development: random forests31 and support vector machines (SVM)32. We optimized the SVM model’s hyperparameter by 10-fold cross-validation on the training data and chose a linear kernel for SVM training to enable straightforward feature interpretation. The performance of both classifiers exceeded 0.9 for both the training and the test data for all calculated metrics (Table 1). The SVM model was selected for further analysis due to the robustness of its decision function, which is determined solely by the support vectors and therefore unaltered by the addition of new data points that lie outside the decision margin32. Additionally, an analytical decision function as a linear combination of the model features can be extracted from linear support vector machines, whose weights indicate feature importance for the classification problem32.

Table 1 Performance of support vector machine and random forest models for ACP prediction.

We then compared the performance on the test dataset for our SVM model to online available ACP prediction tools, specifically the AntiCP models 1 and 213, the iACP model33, and the MLACP model18. These ACP prediction models are also based on an SVM classifier but utilize different descriptors and training data (Table S1, Supplementary Information). The prediction performance of the four classifiers and our SVM model was assessed on the independent test dataset (Table 2). In this experiment, the performance of our SVM model on the independent test set was superior to all four publicly available ACP prediction models in terms of all performance metrics, except for precision. The MLACP model showed higher precision but lower Matthews correlation coefficient (MCC), accuracy and recall than the other models. Therefore, the MLACP model is better at avoiding false positives but retrieves a higher number of false negatives compared to the SVM model developed in this study.

Table 2 Comparison of the model performance (Test score) with other online available ACP prediction tools calculated by using the independent test set. The Matthews correlation coefficient (MCC), accuracy, precision and recall were used as metrics (Methods Eqs 14).

Feature importance for ACP activity

We analyzed the feature weights of the SVM classifier to gain an understanding of important discriminatory features for distinguishing between ACPs and non-ACPs (Table 3, Fig. S2, Supplementary Information). Features were ranked by their absolute weight values as a measure of their relative importance for ACP classification. The global hydrophobicity (H), hydrophobic moment (µH) and the frequency of positively charged amino acid pairs separated by one residue (PPd2) were identified as important features of the classifier (weight values w = 1.65, w = 0.5 and w = 0.39, respectively). This finding is in accordance with previous reports on ACPs that highlight the relevance of the hydrophobicity, the hydrophobic moment and a net positive charge for anticancer activity7,34. The peptide length was also identified as a discriminatory feature (w = 0.4), indicating that longer peptides were considered more likely to be active. Two features that take into account the frequency of amino acids with hydrogen-bond donor and acceptor groups (ADd0, DDd0) were identified as bearing the greatest absolute weights (w = −1.94 and w = 1.67, respectively), emphasizing their role in distinguishing ACPs from inactive peptides (Table 3).

Table 3 The 18 features obtained after covariance elimination and sequential feature selection.

De novo design of ACPs

To make use of the SVM model for the in silico design of novel ACPs, we generated three virtual peptide libraries of 100,000 peptides each, based on different design principles (Fig. S3, Supplementary Information):

  1. (1)

    The Helical library contains peptides with the position-dependent amino acid distribution of alpha-helical ACPs11.

  2. (2)

    The Amphipathic Arc library contains amphipathic peptides with differently sized hydrophobic arcs and a high probability of being cationic.

  3. (3)

    The Gradient library contains amphipathic peptides that possess a linear hydrophobic gradient.

We predicted the activity of the peptides from each library with our SVM model (Fig. S4, Supplementary Information). More than 80% of the peptides from the Amphipathic Arc and Gradient libraries and more than 60% of the peptides from the Helical library received an SVM score >0.5, indicating potential actives. In contrast, only 10% of peptides with random sequences were predicted to be active. The design principles, therefore, enriched the libraries with potentially active peptides in contrast with random peptide generation.

The similarity of the peptides in the training data was analyzed to consider the applicability domain of the SVM model35; this domain is the chemical space in which the model predictions may be considered reliable. The SVM model was utilized to estimate the pseudo-probabilities (i.e., the probabilities predicted by the model) of the peptides to belong to the active and inactive classes. These scores were subsequently weighted by the similarity to the training data to obtain similarity-weighted scores that consider the model’s applicability domain (ϕACP, ϕNeg, Eqs 5 and 6).

From each peptide library, we selected the two peptides with the highest ϕACP and ϕNeg scores. None of the peptides were found in the training data or the CancerPPD database. No peptides were retrieved from the CancerPPD database with >95% similarity to the selected ones, as determined by the CD-HIT program36. We finally synthesized the 12 peptides and determined their half-effective concentration (EC50) values against the MCF7 and A549 cancer cell lines. For 10 of the 12 synthesized peptides, the predictions were correct (Table 4). All of the peptides predicted to be inactive did not kill more than 50% of the cancer cells at a concentration of 50 µM. Of the six peptides predicted to be active, two were determined to be false positives (inactive at 50 µM) (Figs S10 and S11, Supplementary Information). Of the four correctly predicted active peptides, three were active in a low-micromolar range against both of the tested cancer cell lines, and the fourth (Gradient2) showed activity solely against MCF7 cells (Table 4).

Table 4 Experimental validation of the SVM prediction model.

The AmphiArc2 peptide, the shortest peptide of the low micromolar active peptides, has a high hydrophobic moment (µH = 0.87) and a 180° arc of hydrophobic residues in an idealized helical structure (Fig. 1a). As determined by circular dichroism (CD) spectroscopy, the AmphiArc2 peptide is unstructured in pure water but adopts an alpha-helical structure in a hydrophobic environment (in 50% v/v water:2,2-trifluoroethanol, TFE) (Fig. 1b). Helix formation in a hydrophobic, membrane-like environment has been shown to be a characteristic of certain alpha-helical AMPs and ACPs37,38. To further investigate its membranolytic action, we observed the activity of AmphiArc2 on single MCF7 cells entrapped in a microfluidic chip. Video recordings showed morphological changes in the cell membrane and leakage of the cytosolic components as early as 30 seconds after initial contact with the peptide in the cells (Fig. 1c, Supplementary Information, Video SV1). After 95 seconds, the dye encapsulated in the cancer cell had leaked out, and the cell membrane showed deformations and blebbing.

Figure 1
figure 1

Characterization of the AmphiArc2 peptide. (a) Helical wheel plot of the peptide sequence with annotated hydrophobic moment direction and magnitude (µH). Polar residues are shown in light blue, positively charged residues in dark blue, hydrophobic residues in yellow, and aromatic residues in orange. (b) Circular dichroism spectra of the peptide in water (blue) and in a 50% v/v TFE:water solution (red). (c) Time sequence of cell death of a single MCF7 cell trapped in a microfluidic chamber after exposure to the AmphiArc2 peptide. The cells were fluorescently labeled with calcein-AM dye in the cytosol, and their membrane was stained with fluorescently labeled EpCAM antibody. The scale bar represents 10 µm. (d) EC50 values of the peptide activity against the A549 and MCF7 cancer cells, noncancer HDMEC primary cells and the hemolytic activity (HC50) value of the peptide activity against human erythrocytes are shown. Error bars show the standard deviation of N = 3 independent experiments.

After characterizing the anticancer activity of the AmphiArc2 peptide, we tested its cell-type selectivity. We determined its EC50 value against the noncancer HDMEC primary cell line and half-effective hemolytic concentration (HC50) against human erythrocytes (Fig. 1d). Both values were found to be in the same low-micromolar range as the EC50 against cancer cell lines, indicating toxicity of this peptide against noncancer cells.

Selectivity optimization of a de novo designed ACP

We applied the SME algorithm to improve the selectivity of the AmphiArc2 peptide towards noncancer cells. SME contained a variation (mutation) and a selection operator (Fig. 2a). By variation, a series of offspring was generated from a parent sequence. The fittest offspring of a generation was selected and used as a parent in the next SME iteration. In this study, parents were selected among the offspring that maintained anticancer activity but showed enhanced selectivity for cancer cells (selection operator). The mutations in the sequence variation step were performed according to a normalized Gaussian probability distribution of pairwise amino acid similarity (dij) (Fig. 2b). As a similarity measure, we utilized the Grantham matrix, which takes into account the atom composition, the polarity and the molecular volume of the residues39. The probability of substitution of residue i to residue j decreases with decreasing pairwise amino acid similarity. The degree of similarity of the offspring peptides to the parent sequence (offspring diversity) was controlled via the sigma (σ) parameter (Fig. 2b). A higher sigma value allowed the generation of sequences further away from the parent peptide (Fig. S5, Supplementary Information).

Figure 2
figure 2

Peptide selectivity optimization by simulated molecular evolution (SME). (a) Principle of the iterative variation or mutation and selection steps in SME, starting with the model parent peptide “ANTICANCER”. (b) Probability of the mutation of amino acid residue i in the parent sequence to residue j in the offspring as a function of the amino acid pairwise similarity (dij). The sigma (σ) parameter controls the sequence diversity among the offspring. (c) Comparison of the 10 generated offspring sequences and their Euclidean distance to the parent sequence according to the Grantham similarity matrix. The [0, 1] normalized Shannon entropy (in bit in the graph) of each residue position is shown below. Residue coloring is as follows: light blue: polar, dark blue: positively ionizable, red: negatively charged, yellow: hydrophobic, orange: aromatic, green: proline. (d) Peptide activity towards the A549 and MCF7 cancer cell lines (EC50), the noncancer HDMEC primary cells (EC50), and the human erythrocytes (HC50). The error bars give the standard deviation of N = 2 independent measurements with six technical replicates each.

We performed a total of three SME iterations, starting from the AmphiArc2 peptide. In the first iteration, we generated 10 offspring peptides with σ = 0.1 (Fig. 2c). The mutations introduced by this sigma value were conservative amino acid changes that maintained the overall amphipathicity of the peptide. We synthesized and tested all ten offspring peptides of the three SME generations against the MCF7 and A549 cancer cell lines to determine their anticancer activity. For selectivity assessment, we tested their activity against the noncancer HDMEC primary cell line and measured their hemolytic activity on human erythrocytes (Fig. 2d).

The results obtained demonstrate that small conservative amino acid replacements affected the activity and selectivity of these ACPs while conserving their overall amphipathic helical structure in a lipophilic environment. Offspring n.2 (Off2) maintained the low-micromolar activity of the AmphiArc2 peptide against the A549 and MCF7 cancer cells but showed a 12-fold reduction of hemolytic activity against human erythrocytes and a ten-fold reduction of activity against HDMEC cells (Fig. 2d). Therefore, we selected Off2 as the parent for the next SME iteration (Fig. 3), in which ten new peptides were generated (Off2.1 to Off2.10).

Figure 3
figure 3

Characterization of the parent peptides and the most selective offspring peptides from three subsequent SME generations. (a) Amino acid sequences; red residues denote sequence changes from the respective parent sequence. (b) Circular dichroism spectra in water (blue) and a mixture of 50% v/v TFE:water. (c) Helical wheel plots with hydrophobic moment direction and magnitude (µH). Residue coloring: polar residues in light blue, positively ionizable residues in dark blue, hydrophobic residues in yellow, and aromatic residues in orange. (d) Peptide activity towards the A549 and MCF7 cancer cell lines (EC50), the noncancer HDMEC primary cells (EC50), and the human erythrocytes (HC50). The error bars represent the standard deviation of three independent measurements. **p-value < 0.01, ***p-value < 0.001 of the mean differences (Welch t-test).

The second generation of peptide variation did not achieve meaningful selectivity improvements with respect to HDMEC cells (Fig. S7, Supplementary Information). Five of the offspring peptides (Off2.1, Off2.3, Off2.4, Off2.9, Off2.10) were inactive. This loss of activity correlated with the introduction of a proline residue in the sequence (Fig. S7, Supplementary Information). Prolines affect alpha-helical conformation by introducing helix kinks and breaks40. We corroborated this secondary structure disruption with circular dichroism analysis of Off2.1, Off2.3, Off2.4 and Off2.9 (Fig. S9, Supplementary Information).

In the third SME generation (Off2.2.1 – Off2.2.10), we actively omitted proline residues and reduced the sigma value from 0.1 to 0.06 to explore close analogs of Off2 and Off2.2 (Fig. S8, Supplementary Information). Off2.2.10 showed decreased activity towards the noncancer HDMEC primary cells (Fig. 3d). This increase in selectivity was accompanied by a decreased activity against both the A549 and MCF7 cell lines.

The most active, but nonselective, AmphiArc2 parent peptide and the most cancer-cell selective Off2.2.10 peptide possess several differences and commonalities in their physicochemical properties. Even though both peptides display a hydrophobic arc of 180°, the hydrophobic moment of Off2.2.10 (µH = 0.64) is lower than that of AmphiArc2 (µH = 0.87) (Fig. 3c). The parent peptide bears eight positive charges, while Off2.2.10 contains seven positively ionizable residues caused by the N-terminal K1Q mutation. This moderate reduction of both the hydrophobic moment and the net positive charge improved the peptide selectivity for cancer cells and reduced the risk of killing non-transformed cells. To further explore these sequence features, we analyzed the ratios of the EC50 in the noncancer cells and in the cancer cell lines of all tested peptides. The more selective peptides (higher EC50 ratio) are characterized by moderate hydrophobic moments and charge densities (Supplementary Information, Fig. S10), suggesting a guideline for optimizing the cancer-cell selectivity of ACPs. This observation is in accordance with reports stating that decreasing the hydrophobic moment of helical ACPs reduces both their hemolytic potential and anticancer activity7,8,9.

NCI-60 cancer cell panel testing

The ACP candidates AmphiArc2 (parent), Off2 and Off2.2.10 were tested on the NCI-60 cancer cell panel41. The three tested peptides inhibited the growth of all the cancer cell lines in the NCI-60 panel at a low micromolar concentration (Table 5, Supplementary Information Table S3). This result corroborated the wide-spectrum effect of the anticancer peptides across a range of cancer types. Both the activity of Off2 and Off2.2.10 peptides on the cell lines tested were significantly lower than the anticancer activity of the AmphiArc2 peptide (p-value = 4.9 × 10−13 and 1.7 × 10−12, respectively, Welch two sample t-test), suggesting that the initial increased cancer cell selectivity comes at a cost of an activity loss. No significant anticancer activity difference was found between Off2 and Off2.2.10 peptides (p-value = 0.66, Welch two sample t-test), indicating that the additionally improved anticancer selectivity does not affect the average anticancer activities of these two peptides.

Table 5 Cellular growth inhibition of 60 cell lines in the NCI-60 cancer cell test for the AmphiArc2 (Parent), Off2 and Off2.2.10 peptides.

Conclusions

In this study, the combination of a machine learning model and the SME algorithm resulted in ACPs with low-micromolar potency against a wide variety of cancer cells (NCI-60 panel) and selectivity with respect to non-transformed cells (HDMEC) and human erythrocytes. The machine-learning classifier alone was able to identify active peptides but was insufficient to identify cancer cell selective peptides. Virtual screening of computationally designed peptide libraries with the implemented machine-learning classifier led to the discovery of four novel ACPs as the starting point for selectivity optimization by SME. In the first design-synthesize-test cycle, peptide hemolysis was reduced ten-fold, and after three cycles, peptide activity towards noncancer cells was reduced more than 20-fold while retaining anticancer activity compared to the parent peptide (AmphiArc2). The results of this study advocate for the SME method for experiment-guided peptide design and for exploration of the ACP structure-activity landscape. SME is applicable to all kinds of experimental readouts and provides an alternative to more conventional peptide optimization techniques, e.g., alanine scanning. At the same time, the results suggest that additionally increased cancer cell selectivity of membranolytic ACPs might come at the price of reduced peptide potency. This working hypothesis provides a basis for future study.

Methods

Machine learning model

Both machine learning models were constructed in Python v2.7 using the Scikit-Learn v0.18 library. For model training, the peptide dataset was split into 2/3 training and 1/3 testing subsets. Random forest classifier: the number of trees (“n_estimators”) was set to 500, and the number of features to be considered by each tree (“max_features”) was set to the squared root of all features (“sqrt”). SVM classifier: a linear kernel was employed and hyperparameter C was optimized by a ten-fold cross-validation in which the model is trained on 90% of the training data and validated on the remaining 10% in ten repetitions of training. The obtained mean of the 10 repetitions (cross-validation MCC score) was used to evaluate the performance of the models. The test scores were obtained with the independent test set.

Scoring metrics

The Matthews correlation coefficient (MCC, Eq. 1), accuracy (Eq. 2), precision (Eq. 3) and recall (Eq. 4) were calculated. TP, FP, TN and FN correspond to the number of true positives, false positives, true negatives and false negatives predicted by the model, respectively.

$$MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$
(1)
$$Accuracy=\frac{TN+TP}{TN+TP+FN+FP}$$
(2)
$$Precision=\frac{TP}{TP+FP}$$
(3)
$$Recall=\frac{TP}{TP+FN}$$
(4)

Data weighted scoring functions

To appropriately consider the applicability domain of the SVM classifier, the final scoring function for ACPs (ϕACP, Eq. 5) and inactive (negative) peptides (ϕNeg, Eq. 6) considers both the pseudo-probability of the peptide to be an ACP (PACP) as predicted by the SVM model and the similarity of the predicted peptides to the training data (Sim. score). k-means clustering with k = 3 was performed with Python v2.7 and the Scikit-Learn v0.18 library package. The similarity score is calculated as the inverse of the Euclidean distance in descriptor space of the peptides to the three centroids.

$${\varphi }_{ACP}=\frac{{P}_{ACP}+Sim.\,score}{2}$$
(5)
$${\varphi }_{Neg}=\frac{(1-{P}_{ACP})+Sim.\,score}{2}$$
(6)

Virtual peptide libraries

Three virtual peptide libraries were generated according to three different design principles. For each library, the peptide length was restricted to a range of 11 to 30 amino acids, as peptides able to fold in an alpha-helix are typically inside this range42. Duplicate sequences were eliminated, and the similarity of the sequences was restricted with the CD-HIT36 program to a threshold of 0.8 similarity. A total of 106 peptides were selected from each of the libraries.

  • Helical library. The Helical library was generated with the position-dependent amino acid distributions of 62 anuran and hymenopteran alpha-helical ACPs11 in amino acid positions 1–18 (exactly 5 helical turns). For longer peptides, the pattern was repeated. The method to generate this library is included in the modlAMP43 Python package (modlamp.sequences.HelicesACP).

  • Amphipathic Arc library. The design principle of the Amphipathic Arc library was amphipathic peptide sequences, which would potentially be alpha-helical with a preference for positively charged amino acids in the polar phase of the helix and varying hydrophobic arcs in the range 100–260°. The method to generate this library was included in the python package modlAMP as the class AmphipathicArc (modlamp.sequences.AmphipathicArc).

  • Gradient library. The Gradient library was designed using the same procedure as the Amphipathic Arc library but with an additional hydrophobic gradient in the peptide structure from the N- to the C-terminus. For this, the amino acids in the C-terminal third of the peptide sequence were substituted with hydrophobic amino acids. In the modlAMP package, this was achieved by the method make_H_gradient in the modlamp.sequences.Amphipathic Arc class.

Simulated molecular evolution

The simulated molecular evolution (SME) algorithm is based on the (1, λ) evolution strategy44 in which λ mutated sequences (offspring) are generated from a parent sequence22,23,25. The offspring was scored according to a fitness function, which was defined as the experimentally determined peptide anticancer activity and selectivity with respect to non-transformed cells. The best offspring were selected as a parent for the following optimization iteration. The amino acid mutations were generated according to an amino acid similarity matrix that has been row-normalized (dij) to allow for a pseudo-probability calculation of the amino acid transitions (Eq. 7). Here, the Grantham amino-acid similarity matrix was utilized39. The amino acids cysteine and methionine were excluded from the mutation matrix to avoid potential peptide cyclization and facilitate peptide synthesis.

$$P\,(i\to j)=exp(-\frac{{d}_{ij}^{2}}{2{\sigma }^{2}})/\sum _{j}\,exp(-\frac{{d}_{ij}^{2}}{2{\sigma }^{2}}).$$
(7)

where σ is a strategy parameter that controls the distance of the offspring sequences to the parent sequence and, thus, the sequence diversity among the offspring. The σ strategy parameter was set to 0.1 for the two initial SME iterations. Sequence diversity was characterized by the Shannon entropy45 (H) of the residue distribution among the offspring (Eq. 8), where pi corresponds to the frequency of amino acid i in a certain sequence position. The Shannon entropy values were normalized to [0, 1]. The simulated molecular evolution strategy and Shannon entropy calculation were programmed with Python v2.7.

$$H=\sum _{i=1}^{20}\,{p}_{i}lo{g}_{2}{p}_{i}.$$
(8)