In silico design and optimization of selective membranolytic anticancer peptides

Membranolytic anticancer peptides represent a potential strategy in the fight against cancer. However, our understanding of the underlying structure-activity relationships and the mechanisms driving their cell selectivity is still limited. We developed a computational approach as a step towards the rational design of potent and selective anticancer peptides. This machine learning model distinguishes between peptides with and without anticancer activity. This classifier was experimentally validated by synthesizing and testing a selection of 12 computationally generated peptides. In total, 83% of these predictions were correct. We then utilized an evolutionary molecular design algorithm to improve the peptide selectivity for cancer cells. This simulated molecular evolution process led to a five-fold selectivity increase with regard to human dermal microvascular endothelial cells and more than ten-fold improvement towards human erythrocytes. The results of the present study advocate for the applicability of machine learning models and evolutionary algorithms to design and optimize novel synthetic anticancer peptides with reduced hemolytic liability and increased cell-type selectivity.


Supplementary figures and tables
Freq. of amino acids. Amino acid pairs. Ordered amino acid composition of 5-20 residues in the N and C-terminal positions.
iACP 6 ACPs in APD database Non-secretory proteins from Uniprot 5 database Composition of amino acid pairs separated by a gap of 0 to 4 residues, and a feature selection procedure. MLACP 7 ACPs in CAMP and DADP databases Random peptides from SwissProt database Composition of amino acids and amino acid pairs, atomic composition and amino acid physicochemical properties.

Feature importance analysis
The surface normal vector to the hyperplane of an SVM with a linear kernel can be analytically described as a linear combination of the input features. The magnitude of the weights attributed to each feature corresponds to their importance for the classification problem, and their positive or negative sign indicates whether the feature is found in the positive or the negative data class, respectively. Positive weights were assigned by the model for global hydrophobicity (H), hydrophobic moment (µH) and positive charge (PPd2). The peptide length was also positively weighted by the model. The two features with the greatest weight values take into account the frequency of amino acids with hydrogen-bond donor and acceptor groups (ADd0, DDd0). Figure S2. Feature importance of the 18 features obtained after covariance elimination and sequential feature selection. The weight magnitudes and signs of the applied to the feature matrix as obtained from the SVM classifier model are plotted. Red: weight attributed to the negative class, blue: weight attributed to the positive class (here: ACP). The abbreviations are explained in Table S2. Characterization of virtual peptide libraries Figure S3. Amino acid distribution and physicochemical properties of the three generated virtual peptide libraries (Amphipathic Arc, Helical and Gradient) in comparison with a peptide library with a random amino acid distribution. Hydrophobicity and hydrophobic moment were calculated with the Eisenberg hydrophobicity scale and the charge density at pH 7 are shown. Descriptor calculations were performed with the modlAMP package 9 .

Helical
Helical Helical Helical Distribution of ACP scores for the virtual peptide libraries Figure S4. Distribution of the data-weighted scores (sACP) for the three generated virtual peptide libraries (Amphipathic Arc, Helical and Gradient) in comparison with the scores for a peptide library generated with a random amino acid distribution (Random). Influence of the strategy parameter on the offspring diversity Figure S5. Influence of sigma in the diversity of the offspring sequences. Three simulated molecular evolution iterations were performed from the same parent peptide "ANTICANCER", using three different values for the strategy parameter (s) (a = 0.05, b = 0.1, c =0.5). For higher sigma values, the position Shannon entropy increases, indicating higher offspring diversity. Comparison of the ten generated offspring sequences (σ = 0.1) and their Euclidean distance to the parent sequence according to the Grantham similarity matrix. The Shannon entropy variation (in bit) of each residue position is shown below. Color coding for the amino acids is the following: hydrophobic (yellow), aromatic (orange), polar (blue), positively charged (dark blue), negatively charged (red). b. Anticancer activity of the peptides on the A549 and MCF7 cancer cell lines, and the HDMEC non-cancer primary cell line (EC50). Hemolytic activity on human erythrocytes (HC50). The error bars represent the standard deviation of N = 2 independent measurements with six technical replicates each for the anticancer activity determination and three technical replicates for hemolysis determination.  Comparison of the ten generated offspring sequences (σ = 0.06) and their Euclidean distance to the parent sequence according to the Grantham similarity matrix. The [0,1] normalized Shannon entropy (bit) of each residue position is shown below. Color coding for the amino acids is the following: hydrophobic (yellow), aromatic (orange), polar (blue), positively charged (dark blue), negatively charged (red). Proline residues were additionally removed from the mutation matrix to avoid secondary structure disruption. b. Anticancer activity of the peptides on the A549 and MCF-7 cancer cell lines, and the HDMEC non-cancer primary cell line (EC50). Hemolytic activity on human erythrocytes (HC50). The error bars represent the standard deviation of N = 2 independent measurements with six technical replicates each for the anticancer activity determination and three technical replicates for hemolysis determination. Figure S9. Circular dichroism recording for the offspring peptides in the three simulated molecular evolution generations. The circular dichroism spectra are shown for measurements in water (blue) and a 50% v/v water:2,2-trifluoroethanol (TFE) solution (red) for the peptides of the first (a), second (b), and third (c) iterations. The bar plots show the intensity of the maxima and minima for each of the peptides. Wavelength (λ)

Circular dichroism measurements
Off2. Wavelength (λ) Off2. Wavelength (λ) Off2.4 Off2.9  60 60 TFE max TFE min Water min Figure S10. Dependence of ACP selectivity on the peptide hydrophobic moment (µH) and charge density. The ratio of anticancer activity (EC50) on the MCF7 and A549 cell lines with respect to hemolysis of human erythrocytes (HC50) (a and c), or HDMEC primary non-cancer cells (b and d) is represented by a color gradient. The peptides with higher ratios (red), and therefore more selective for cancer cells, cluster together in a region of moderate hydrophobic moment and charge density.

Data set construction and feature calculation Data set construction
• Positive dataset. ACP sequences were retrieved from the CancerPPD 10 database (database accessed 02/06/2016) and duplicates were eliminated, obtaining a total of 539 peptides. Only linear, cysteine-free sequences were kept avoiding potential cyclization and dimerization in vitro. This set was further filtered to restrict the peptide length to 7-30 amino acids and to eliminate peptides containing non-canonical amino acids. The final set consisted of 339 ACPs. • Negative dataset. Alpha-helical sequences from non-transmembrane proteins contained in the Protein Data Bank (PDB) 11 were used as the negative dataset for machine learning. The PDB IDs from "all-alpha" proteins representatives at 70% pairwise similarity were downloaded (PDB accessed on 10/12/2015), excluding transmembrane proteins. The individual alpha helices from these proteins were extracted with an R script 12 , using the R packages bio3d 13 and seqinr 14 . The representative helices with a similarity threshold of 0.8 in cd-hit 15 were kept. The set was filtered in the same way as the positive data set to restrict the peptide length from seven to 30 amino acids and to eliminate peptides containing noncanonical amino acids and cysteine residues. From the resulting set of 12,277 peptides, 680 samples were randomly selected to obtain a data set with double the number of negative peptides than positive peptides. 25 in-house peptides that had previously been synthesized and tested inactive against cancer cells in our laboratory were included in the negative data set, resulting in a total of 705 negative, i.e. inactive peptides.

Feature calculation
The peptide descriptor matrix was calculated in Python v2.7 (www.python.org) by the use of the package modlAMP 9 . The following descriptors were included to obtain a 151-dimensional vector of molecular descriptors.
• pepCATS descriptor. The pepCATS descriptor 16 is based on a binary vector representing each canonical amino acid in terms of lipophilic (L), aromatic (R), hydrogen-bond acceptor (A), hydrogen-bond donor (D), positively (P) and negatively (N) ionizable states at physiological pH. These features are convoluted by the use of the Moreau-Broto crosscorrelation 17 over a sliding window of seven amino acids to achieve a length-independent peptide descriptor. For the six pharmacophore features and a window of seven amino acids, the descriptor vector contains 147 elements. • Hydrophobicity. The peptide hydrophobicity was calculated as the sum of the hydrophobicity values of the individual amino acids as given by the Eisenberg hydrophobicity scale 8 . • Hydrophobic moment. The largest hydrophobic moment for any window of 11 amino acids of the peptide was calculated as previously described by Eisenberg 8 . • Charge density. The peptide charge density was calculated as the net charge at pH 7 divided by the molecular weight. • Length. The number of amino acid residues in a peptide.

Feature selection
A series of feature selection steps were sequentially applied to the descriptor matrix to reduce its feature redundancy and dimensionality, as described below.
• Variance threshold. Features with zero variance in the training set were eliminated using the "VarianceThreshold" class from scikit-learn 18 . This feature selection step reduced the dimensionality from 151 to 146 dimensions. • Covariance elimination. Features with a covariance > 0.5 were identified, and for each group one feature was retained that contributed the most to the variance of the first principal component (PC1) of a principal component analysis (PCA) 19 . The covariance elimination method was programmed in Python v2.7, employing the class "decomposition.PCA" for the covariance matrix calculation and to assess the variance contribution of each features to PC1. This selection step reduced the feature matrix dimensionality to 38. • Sequential forward and backward feature elimination. In sequential forward selection, starting from an empty feature matrix, a feature is added at a time, maximizing the model score. In sequential backward selection, starting from the full feature matrix, a feature is eliminated at a time, which minimally reduces the model MCC score. Sequential forward and backward selection were both applied to the feature matrix resulting from the covariance elimination step, and the features selected by both methods were combined to form the 18-dimensional final feature matrix.

Circular dichroism
The circular dichroism (CD) spectra of peptides was measured with a Chirascan spectrometer (Applied Photophysics, UK). Peptide concentrations of 30 μm were measured in water and in a 1:1 v/v TFE:water mixture (TFE: 2,2,2-trifluoroethanol, 99.8% extra pure, AcrosOrganics, USA) using a 1 mm thick glass cuvette (Helma Analytics, Type No. 110-QS). CD was measured in a wavelength range from 180 to 260 nm with a step size of 1 nm. Triplicate measurements were averaged, the solvent baseline absorption was subtracted and the curves were smoothed with a window size of 4 in the Pro-Data Viewer software (Applied Photophysics, UK, version 4.2.15).

Cancer cell cultures
Michigan Cancer Foundation 7 (MCF7) human breast adenocarcinoma and A549 human adenocarcinoma alveolar basal epithelial cancer cells were provided by Prof. Dr. Cornelia Halin-Winter at the Institute of Pharmaceutical Sciences, ETH Zurich 21 . Cells were grown in Dulbecco's Modified Eagle Media (DMEM) containing 4.5 gL −1 D-glucose supplemented with 10% fetal bovine serum and 1% penicillin-streptomycin (Thermo Fisher Scientific, Waltham, MA, USA). The cultures were kept at 37°C in a 5% CO2 atmosphere incubator (Thermo Fisher Scientific, Waltham, MA, USA) and split every two to three days to new culture flasks. Cells no older than 20 passages were used for experiments. For splitting, the cell culture media was aspirated, the cells were washed with 0.01 M phosphate buffered saline (PBS, Gibco, Thermo Fisher Scientific, Waltham, MA, USA) and trypsin-EDTA (trypsin -ethylene diamine tetraacetic acid, Fisher scientific, UK) was added to the cell layer to detach them from the flask surface. After incubation for 2 min at 37°C, cells were resuspended with an equal volume of DMEM media and centrifuged at 310 g for 5 min at 4°C. After removal of the supernatant the cells were resuspended in DMEM cell media and split into new cell culture flasks at a 1/3 to 1/5 ratio.

Cell viability assay
Determination of the anticancer peptides half effective concentration (EC50) was performed by a 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide (MTT) assay, as described elsewhere 21 . Briefly, 10 4 cells were seeded in each well of a 96-well plate and left to adhere and grow for 24 h. For EC50 determination, peptides were dissolved in PBS with 1% dimethyl sulfoxide (DMSO, Fisher Scientific, UK) at an intermediate concentration of 1 mM. The peptides were then diluted in a 2-fold dilution series starting at 100 μM with DMEM culture media. Cells were incubated with the peptide solutions for 24 h. After this time, the cell media was removed and the cells were incubated for 1 h with a 0.5 mg mL −1 MTT solution to allow enough time for formazan crystal formation (Sigma Aldrich, Steinheim, Germany). The supernatant was removed and 100 μL of DMSO were added into each well to resuspend the formazan crystals. After manually shaking the plates to completely dissolve the crystals, the absorbance was measured at 540 nm in a Tecan Infinity M1000 spectrophotometer. The cell viability was determined based on the quantification of the color intensity in each culture well, considering the absorbance obtained by the control media without peptides as a 100% survival. The resultant sigmoidal curves were analyzed with the drc 22 R package using a log-logistic model with 3 fixed parameters to obtain the peptide concentration values that corresponded to 50% cell survival (EC50).

Hemolysis assay
Human erythrocytes were freshly ordered at the Blutspende Zürich. Erythrocytes were washed 3 times with a 10 mm phosphate buffered saline (PBS, Gibco, Thermo Fisher Scientific, Waltham, MA, USA) solution centrifuging at 800 g for 10 min or until the supernatant appeared clear to remove the blood plasma. A 5% v/v erythrocytes : PBS solution was prepared for direct use. For HC50 determination, peptides were dissolved in PBS at an intermediate concentration of 1 mM. A dilution series of the peptide from 200 to 6.25 μM was plated in a round-bottomed 96-well plate in triplicates, with a volume of 50 μL per well. PBS buffer and a solution of 0.1% Triton-X-100 (Fisher Chemical, Loughborough, UK) in PBS were employed as negative and positive controls, respectively. 50 μL of the 5% erythrocyte solution were added in each well and the plates were incubated at 37°C for 1 h. After this time, 100 μL PBS were added to each well and plates were centrifuged at 800 g for 10 min. 100 μL of the supernatant were transferred to a flat-bottomed 96-well plate and the absorbance was measured at 540 nm in a Tecan Infinity M1000 spectrophotometer plate reader. The hemolysis percentage was calculated, considering the absorbance values of the Triton and PBS wells as 100% and 0% hemolysis, respectively.
The resultant sigmoidal curves were analyzed with the drc 22 R package using a log-logistic model with 3 fixed parameters to obtain the HC50 values.

Microfluidics chip experiments
Single cell ACP action was observed on MCF7 cells entrapped in a microfluidic chip following the previously described methodology 23 . For visualization, cells were stained with calcein-AM at a 1 µM concentration. The calcein-AM penetrates into the cytosol of the cells and it is converted there by intracellular esterases to a fluorescent form that remains encapsulated in the cytosol. After calcein staining, cells were incubated with a fluorescently-labelled anti-EpCAM (human CD326) antibody (Mouse IgG2b, κ Isotype Ctrl with Alexa Fluor 647 dye, Biolegend, San Diego, CA, USA) at a ratio of 5 μL per one million cells. Cells were incubated in the dark for 30 min. Afterwards, cells were twice with complete DMEM medium by centrifugation at 500 g for 5 min. Cells were stored until use at 37°C under constant rotation on a MACSmix tube rotator (12 rpm, Miltenyi Biotec, Bergisch Gladbach, Germany) for a maximum of 2 h. Prior to insertion to the chip, the cells were filtered with a 20 μm gravity filter (Partec GmbH, Jettingen-Scheppach, Germany) to remove large cell agglomerates.

NCI-60 screening procedure
The NCI-60 screening procedure was performed by the Developmental Therapeutics Program of the National Cancer Institute. The procedure was performed as previously described 24 in 96well microtiter plates. Each peptide was tested in a 5-concentration assay, starting at 10 -4 M and diluting 10-fold to the lowest 10 -8 M concentration. The peptides were incubated for 48 h and the cell concentration was assayed by using the sulforhodamine B assay 25 and optical density measurements. The growth inhibition (GI50) value 24 , measures the growth inhibitory power of the drug agent. The GI50 value is the concentration of the test drug for which the condition in Eq. S1 is met, where T is the optical density of the test well after a 48-h exposure to the peptide, T0 the optical density before applying the peptide drug and C the control optical density of a well where no peptide was applied.

In vitro dose-response data
Dose-response data for peptides in Table 1 Helical1 AmphiArc1 AmphiArc2 Gradient2 Figure S11. Dose-response curves for the peptides in Table 1  AmphiArc2 Gradient2 Figure S12. Dose-response curves for the peptides in Table 1  Dose-response data for the SME offspring Dose-response data on the NCI-60 cell panel