HAPPENN is a novel tool for hemolytic activity prediction for therapeutic peptides which employs neural networks

The growing prevalence of resistance to antibiotics motivates the search for new antibacterial agents. Antimicrobial peptides are a diverse class of well-studied membrane-active peptides which function as part of the innate host defence system, and form a promising avenue in antibiotic drug research. Some antimicrobial peptides exhibit toxicity against eukaryotic membranes, typically characterised by hemolytic activity assays, but currently, the understanding of what differentiates hemolytic and non-hemolytic peptides is limited. This study leverages advances in machine learning research to produce a novel artificial neural network classifier for the prediction of hemolytic activity from a peptide’s primary sequence. The classifier achieves best-in-class performance, with cross-validated accuracy of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$85.7\%$$\end{document}85.7% and Matthews correlation coefficient of 0.71. This innovative classifier is available as a web server at https://research.timmons.eu/happenn, allowing the research community to utilise it for in silico screening of peptide drug candidates for high therapeutic efficacies.

www.nature.com/scientificreports/ Model validation. It is critical that any classifier model created by machine learning is thoroughly validated. For that reason, tenfold cross-validations and validation by an external test set were employed to evaluate the performance of all models presented herein. The HAPPENN dataset was split into twelve parts, ten of which were used for cross-validation, whereby one of the subsets was selected for use in validation while the other nine were employed for training. The resultant models were ensembled and evaluated with an independent test set, which consists of the remaining two of the twelve parts. To avoid possible bias arising from the choice of a randomly selected test set, the procedure is repeated six times in total, allowing for a rigorous assessment of the model's overall performance.
Validation comparison with HemoPI and HemoPred. The different dataset construction and validation procedure employed by HAPPENN compared to other available tools, namely HemoPI and HemoPred, prevents a direct comparison of their respective validation statistics. To facilitate a more direct comparison with the HemoPI and HemoPred classifiers, a model was trained and tested under equivalent conditions. An altered dataset, HAPPENN-HemoPI3-equiv was created, wherein all the peptide sequences present in the HAPPENN dataset that form part of the HemoPI-3 test dataset were set aside as the test dataset, and the remaining non-test set sequences were used for training and validation as part of a fivefold cross-validation.
Amino acid composition analysis. An analysis of the amino acid composition of the hemolytic and nonhemolytic peptides was carried out, completed by an analysis of peptides randomly extracted from proteins in Swiss-Prot 39 . The analysis comprises the peptides' full sequences, the 10 N-terminal residues, and the C-terminal 10 residues.
Residue position preference analysis. Enrichment depletion logos (EDLogo) 40 were created to identify preferences for certain amino acid residues at certain positions in the hemolytic peptides' sequences. The logo plots were constructed using the experimentally validated non-hemolytic peptide sequences as the baseline.

Motif analysis.
Motif analysis was carried out on the HAPPENN dataset to identify motifs occurring exclusively in hemolytic and non-hemolytic peptides. Motifs with a length between 2-5 amino acids which occurred in at least ten peptides were considered.
features extraction. A large selection of features was extracted from the peptides' primary sequences, which can be divided into two subcategories, amino acid composition based features and physicochemical descriptors.
Physicochemical descriptors. The modlAMP 41 , ChemoPy 42 and RDKit packages were used for the calculation of global physicochemical descriptors, as well as amino acid scale-based descriptors. Global physicochemical descriptors include sequence length, molecular formula, molecular weight, sequence charge, charge density, isoelectric point, instability index, aromaticity index 43 , aliphatic index 44 , Boman index 45 and the hydrophobic ratio.
Additionally, physicochemical descriptors were calculated from the amino acid properties in the AAindex 66 . Secondary structure related descriptors were calculated based on the turn 67 , helical 47,68 , coil 69 and amphiphilic 70 propensities. The sequence hydrophobicity was quantified using the amino acids' hydropathies 49,71 , Table 1. Criteria for designating a peptide as hemolytic or non-hemolytic. Peptides which satisfied neither or both of these criteria were excluded.  [78][79][80] and retention coefficients in HPLC 81,82 . Similarly, the sequence hydrophilicity was characterised using properties based on the amino acids' hydrophilicity 50 , charges 83 , polarities 84,85 , free energies of solution in water 77,86 , numbers of hydrogen bond donors 87 and fractions of site occupied by water 88 . Descriptors relating to the amino acids' sterics were calculated based on their residue volume 89-92 , residue flexibility 54 , steric hindrance 93 , bulkiness 52 , 8Å and 14Å contact numbers 94,95 , average  reduced side-chain distance 96 and accessible molar fractions ratio 97 properties. As membrane interaction plays  an important role in peptides' hemolysis mechanism of action, features based on membrane-propensities 98 ,  membrane-buried preference parameters 47,99 and the side-chain 100,101 and electron-ion interactions 102,103 were calculated. Descriptors were also calculated based on the number of full non-bonding orbitals 87 , SWIGM index 51 , and the IFH 104 and z1 105 scales.
Composition descriptors. Amino acid, dipeptide, and tripeptide compositions were calculated for the conventional 20-amino acid alphabet, as well as the reduced alphabets of Veltri et al. 106 , Thomas and Dill 107 , and the conjoint alphabet 108 . To account for the three-dimensional structure of the peptides, g-gap dipeptide and tripeptide compositions were calculated 109 . Finally, pseudo amino acid composition 110 , conjoint triad, composition, transition and distribution 111 descriptors were also calculated.
Machine learning approaches. Support vector machine (SVM) 112 , random forest (RF) 113 , principal component analysis (PCA) 114 , t-distributed Stochastic Neighbour Embedding (t-SNE) 115 and dense fully connected neural networks 116 are employed in this study. Both a linear and non-linear (RBF) kernel were employed with SVMs. SVM and RF hyperparameters were tuned using a grid search in conjunction with the previously described cross-validation. feature selection. Only features which were non-zero for at least 100 samples were retained. Furthermore, features were selected for retention by SVM and random forest.
Features importances were calculated individually for each of the splits during tenfold cross-validation using both support vector machines and random forests. Features which had SVM absolute weights near-zero ( < 0.05 ) were excluded, as practised by Brank et al. 117 . Features which an ensemble of random forests decided were important (importance > 0.0005 ) were included. neural network architecture. All input features are scaled to have minimum and maximum values of 0 and 1, respectively.
Both a randomized grid search and genetic algorithm were employed to identify the optimal neural network architecture and hyperparameters. The optimized neural network applies a Gaussian noise layer with a standard deviation of 0.03 to the input layers, which mitigates overfitting. The first hidden layer has 1024 nodes and the second hidden layer has 64 nodes. Batch normalization 118 is applied before the ReLU activation function. Each hidden layer is followed by a Dropout layer, with a rate of 0.93, which aids in the prevention of overfitting 119 .
The final output layer consisted of a single node with a sigmoid activation function. A summary of the overall architecture described is shown in Fig. 2. implementation. The neural network was implemented with Keras, a popular deep learning framework, using a Tensorflow 120 back-end. The binary cross-entropy loss function was employed, which is defined as: whereby y i is the true value of the i th sample, and ŷ i is the predicted value of the i th sample.
This loss function is commonly used in binary classification problems. As the predicted labels of all training data approach their respective true values, the value of the function approaches zero.
The optimizer employed is Adaptive Momentum (Adam), which updates the neural network weights according to the following formula 121 : whereby the m t and v t are the bias-corrected estimates of the mean and the variance of the gradients, respectively.
The neural network was trained for 600 epochs, without stopping criteria. The model with the highest validation accuracy encountered during training was saved for each of the cross-validation splits.
During training, the loss function was weighted to adjust for the slightly unequal number of positive and negative samples. performance evaluation. The robustness of the predictor is evaluated by a number of standard parameters, namely accuracy (Acc), sensitivity (Sn), specificity (Sp), the Matthews correlation coefficient (MCC), and by the receiver operating characteristic (ROC) curve.
The first four of these are defined by the following equations: www.nature.com/scientificreports/ • TP = True positives: the number of correctly predicted positive (hemolytic) peptides.
• FP = False positives: the number of non-hemolytic peptides incorrectly predicted as being hemolytic.
• TN = True negatives: the number of correctly predicted negative (non-hemolytic) peptides.
• FN = False negatives: the number of hemolytic peptides incorrectly predicted as being non-hemolytic.

Results
The HAPPENN dataset was constructed from peptide sequences whose hemolytic activity, or lack thereof, has previously been evaluated. Peptides were separated into a positive (hemolytic) class and a negative (nonhemolytic) class based on criteria outlined in Table 1. The peptide sequences were subjected to an amino acid composition analysis, residue position preference analysis and motif analysis. Peptides were then represented by feature vectors composed of physicochemical and compositional descriptors of the peptides. The feature vectors are visualised using principal component analysis (PCA) and t-stochastic neighbour embedding (t-SNE) plots, both of which show an incomplete separation of the positive and negative classes. Finally, the feature vectors are used to train support vector machine, random forest and neural network hemolytic activity classifiers, the prediction results of which are evaluated.
Amino acid composition analysis. To determine whether a preference exists for certain residues in hemolytic peptides compared to non-hemolytic peptides, an amino acid residue composition analysis was performed, the results of which are shown in Fig. 3. It is apparent that hemolytic peptides are most enriched in the hydrophobic leucine and isoleucine residues, and to a lesser extent phenylalanine, tryptophan and glycine. Meanwhile, non-hemolytic peptides are enriched in the positively charged lysine and arginine residues. Interestingly, both hemolytic and non-hemolytic peptides are depleted in the negatively charged aspartic and glutamic acid residues compared to the sequences randomly extracted from Swiss-Prot, with the hemolytic peptides exhibiting greater depletion.
Residue position preference analysis. To ascertain whether or not there exists a preference for certain residues at certain positions in the peptide sequence, an enrichment-depletion logo plot was created ( Fig. 4) for the hemolytic peptide class, with the non-hemolytic peptide class serving as the baseline for the plot. Enriched residues, therefore, are those which are more common at that position in hemolytic peptides, relative to the nonhemolytic class, and depleted residues are those which are less common. The first inspection of the EDlogo plot reveals information that is consistent with the amino acid composition analysis: hemolytic peptides are enriched in hydrophobic residues, and predominantly depleted in negatively charged residues. On further inspection, position-specific enrichments become apparent. Hemolytic peptides are enriched in the negatively charged aspartic acid residue at position 4, and at the last, third-and fourth-and eleventh-last position, despite being depleted in this residue for the remainder of the sequence. Hemolytic peptides are depleted in the positively charged arginine residue throughout the sequence, but enriched in lysine at positions 7, 8, 11, 12 and 15. A preference exists at the positions 2 and 3 for tryptophan, position 3 for proline, and positions 3 and 4 for serine. A notable preference exists at position 14 for proline, which is indeed a common feature in the brevinin-1 family of peptides, which do possess hemolytic activity 122 . Hemolytic peptides are also enriched in glutamine exclusively at their C-terminus, while being depleted in glutamine throughout the remainder of the sequence.

Data visualisation. Principal component analysis (PCA). Principal component analysis (PCA) was un-
dertaken for the full computed dataset, the dataset with only the physicochemical features and the dataset with only composition descriptors (Fig. 5). Inspection of the results of all three indicates that while a separation exists between the hemolytic and non-hemolytic classes, the separation is not clear-cut and a significant overlap exists between the classes. The overlap between classes is most significant for the set consisting of only the physicochemical descriptors, while a greater separation between classes exists in the composition descriptor plot.

T-distributed stochastic neighbour embedding (t-SNE).
Similarly to the aforementioned PCA analysis, a t-distributed Stochastic Neighbour Embedding (t-SNE) analysis was undertaken for the full computed dataset, the dataset with only the physiochemical features and the dataset with only composition descriptors (Fig. 6). As is the case with the PCA results, there exists an incomplete separation between the hemolytic and non-hemolytic classes in all three datasets. In many cases, positive and negatives peptides are near-coincident in the plots, and appear, therefore, to be physicochemically and compositionally similar.
Hemolytic activity prediction. This novel study employed a number of popular machine learning classifiers for predicting peptides' hemolytic activity on the basis of features calculated from their primary sequence. The predictive power was evaluated using tenfold cross-validation, and the final ensemble of ten neural networks was evaluated by means of external validation. Accuracy, sensitivity, specificity, Matthews correlation coeffi- www.nature.com/scientificreports/ cient statistical parameters are reported with their confidence intervals. A receiver operating characteristic curve (ROC) with a calculated area under the curve (AUC) is also reported.
To the authors' knowledge, three machine-learning based classifiers for the prediction of hemolytic activity peptides are described in the literature, namely HemoPI 123 , HemoPred 124 , and HemoPImod 125 . The former two predict the hemolytic activity of natural amino-acid-based peptides, while the latter specializes in predicting the hemolytic potency of chemically modified peptides. The results of the present study are compared to those of the former two classifiers, HemoPI and HemoPred. As HemoPImod specifically addresses chemically modified peptides, and therefore differs in its aims to HAPPENN, it is excluded from the comparisons.  Table 2.
The SVM hyperparameters were optimised using a grid search. The linear kernel SVM achieved its highest performance with the regularization parameter C = 0.1 . The non-linear RBF kernel SVM achieved its highest performance with the regularization parameter C = 10 and the kernel coefficient γ = 2 × 10 −4 . Both the RBF and linear kernel SVM approaches achieve the worst level of performance of the three methods studied, with a validation accuracies of 78% and 81%, and MCCs of 0.54 and 0.61, respectively.
The RF hyperparameters were also optimised using a grid search. The highest performance, with an accuracy and MCC of 83% and 0.65 was achieved with the number of estimators set to be 1024, with unrestricted tree depth. The optimal value for max_features was found to be 70.
The neural network approach, meanwhile, achieves the highest accuracy and MCC score, with scores of 86% and 0.71, respectively, marking it as the most capable predictor. Furthermore, the neural network approach www.nature.com/scientificreports/ achieves the best balance between sensitivity and specificity. As it was the most capable, the neural network approach was selected as the classifier of choice for the prediction of hemolytic activity. The predictive power of HAPPENN was further evaluated by means of the receiver operating characteristic (ROC) curve, and its associated area under the curve (AUC) (Fig. 7), which is equivalent to the probability that the predictor will rank a randomly selected positive instance higher than a negative one. We note that the performance is nearly excellent on both the validation and test sets, with both yielding an AUC of 0.90.
Comparison to HemoPI and HemoPred. HemoPI and HemoPred are in silico peptide hemolytic activity prediction models previously reported in the literature, to which the HAPPENN model is compared.
The former approach employs a support vector machine (SVM) trained on a combination of single residue-, dipeptide-and property-based features, while the latter employs a random forest (RF) trained on a combination of amino acid and dipeptide composition features. Both models achieve similar cross-validated accuracies and MCC scores not exceeding 78% and 0.56 when trained on the HemoPI-2 and HemoPI-3 datasets.   In order to facilitate a more direct comparison, an altered dataset was created, termed HAPPENN-HemoPI3equiv, wherein the test dataset consists exclusively of HAPPENN dataset peptides which also form part of the HemoPI-3 test set. The remaining non-test set peptides were used for training and validation. Using this altered dataset, a neural network sharing the architecture and hyperparameters of the main HAPPENN neural network was trained under fivefold cross-validation, achieving a test set accuracy of 84.96% and an MCC of 0.70. While these results again exceed those of the available classifiers, it should be noted that the test set in this case is not truly independent as its constituent peptides had been previously used in optimising the hyperparameters of the main HAPPENN neural network.  www.nature.com/scientificreports/ Relationship between prediction and hemolytic activity values. Peptides were classified as hemolytic or nonhemolytic based on criteria given in Table 1, which relates hemolytic activity (H) to concentration (c). A peptide's concentration can be expressed as a multiple (x) of the threshold concentration c. For instance, a peptide which exhibits 65% hemolytic activity at 195 μM can be said to have x = 0.5 , as 0.5 × 390 μM = 195 μM. As x is less than 1, the concentration is lower than the threshold concentration (390 μM for 65% hemolytic activity), and the peptide is considered hemolytic. The neural network's output is obtained from the sigmoid activation function of its final layer. As the sigmoid function produces values ranging between 0 and 1, these output values can be interpreted as the probability of a peptide being non-hemolytic (0) and hemolytic (1).
The relationship between the neural network's output values and x, the multiple of the threshold concentration, are shown in Fig. 8. The upper left quadrant shows the true positives, the lower right quadrant shows the true negatives, and the upper right and lower left quadrant show the false negatives and false positives, respectively.
It can be seen from Fig. 8, that not many peptides are found at the y = 0.5 hemolytic-non-hemolytic prediction boundary. While there are many peptides at the x = 1.0 activity boundary, most peptides are seen to be correctly predicted.
Descriptor-set specific results. Several approaches were trialled for constructing the input feature space (Table 4), namely dipeptide and tripeptide composition, the corresponding g-gap compositions, N-and C-terminus composition and physicochemical features.
Dipeptide and tripeptide composition. Dipeptide composition is defined as the proportion of a given dipeptide in the sequence, while similarly the tripeptide composition is defined as the proportion of a given tripeptide in the sequence. These composition features capture both the chemical nature of the peptide composition while retaining information about the local sequence order. The models achieved respectable accuracies of 82.56% and 82.62%, respectively, and MCC values of 0.65 and 0.64, respectively. g-gap composition. g-gap dipeptide composition is described as the proportion of a given pair of amino acids separated by 1, 2 or 3 residues, which corresponds to residues which are adjacent in three-dimensional space, especially in regular secondary structures where such non-adjoining residues may be connected by hydrogen bonds. Interestingly, models trained on these features perform better than those trained on the more conventional dipeptide and tripeptide compositions. This can be attributed to these features better capturing the chemical environment that the peptide exposes to the membrane upon contact. For instance, the g-gap feature can represent the spatial adjacency separated by one turn of the α-helix, which has a turn of 3.6 residues.
Termini. A feature set was created to capture the information on the residues specific location in the sequence. This feature set consists of a binary profile for the first and last 15 residues of each peptide. Each peptide is there- Of the single feature-set approaches trialled, none outperform the model trained on the full feature set. The compositional descriptor-based models are seen to benefit from sequence similarity to an extent, and exhibit somewhat reduced performance on the redundancy-reduced dataset. The physicochemical descriptor-based model, conversely, maintains a comparable performance even on the redundancy-reduced dataset. feature importance analysis. Random forest feature importance. Random forests have the advantage of being easily interpretable and provide an easy method of ranking the importance of input features. The most useful features as determined by cross-validated random forests are the Eisenberg direction of the hydrophobic moment (EISD860103) 77 , the Eisenberg normalized hydrophobicity scale (EISD840101) 73 , hydropathy (NADH010102, NADH010103, NADH010104) 71 , the hydrophobic parameter pi (FAUJ830101) 72 , the Boman index 45 , apparent partition energies (GUYH850105) 80 , membrane-propensity (PUNT030102) 98 , transmembrane propensity 63 , side-chain hydrophobicity values (BLAS910101) 76 and Hopp-Woods hydrophobicity (HOPT810101) 50 .
Effectively all of these features directly or indirectly quantify hydrophobicity, which points to it being important for hemolytic activity.
Analysis of neural network weights using Garson's method. In order to understand the basis for the neural network's predictive power, we analysed the importance assigned to various input features by Garson's method 126 , iteratively reducing the feature input space by approximately halving the number of input features, until 300 composition features were retained in each split. 24 features were identified as having large weights in each of the 10 splits.
The occurrences of the FS, LH, KIK, VAK, VLK dipeptides and tripeptides in the peptide sequence were found to be important. Additionally, the LLL Veltri reduced alphabet tripeptide and the RGV and VCR Thomas and Dill (length 3) reduced alphabet tripeptides were found to contribute strongly to the final classification. www.nature.com/scientificreports/ Additionally, the occurrence of g-gap i,i+3 residue pairs FG, FL, LK, WV, the occurrence of g-gap i, i+4 residue pair FK and the occurrence of g-gap i, i+2 residue pairs AR, FL, FS, LF were found to be meaningful.
Furthermore, the occurrence of the g-gap i,i+3 Thomas and Dill (10) reduced alphabet residue pairs PS and WW and g-gap i,i+3 Veltri reduced alphabet residue pair QQ were also important.
Finally, the network weights associated with the EstateVSA3 and EstateVSA4 (MOE-type descriptors using Estate indices and surface area contributions), Geary autocorrelation-lag8 weight by atomic polarizabilities, and the acetylation of the N-terminus inputs were also large.
Interestingly, the predictive power of the reduced feature space neural network is nearly as strong as the main network.

Discussion
A decreasing number of drug approvals and a rising research and development cost base has contributed to a resurgence of interest in peptide therapeutics. An ideal peptide drug should possess a high therapeutic index, specifically high activity against the biological target and limited toxicity. The therapeutic potential of peptides, however, is highly dependent on it possessing little to no hemolytic activity. Minimizing hemolytic activity is important for improving the therapeutic index of a peptide.
Many research groups have studied the structures of natural peptides as well as engineered peptide analogues in order to characterise how their structure determines their biological activities [127][128][129] . A comprehensive understanding of the relationship between structure and function, however, remains elusive. A computational method that can provide information about a peptide's biological activity from its primary structure prior to chemical synthesis, however, would allow for rapid and efficient exploration of the chemical space and present a significant cost and-time saving.
To accelerate the lead molecule design and optimization pipeline, this study aimed to create an in silico method for classifying therapeutic peptides as hemolytic or non-hemolytic based on their primary sequence. The prediction task is challenging, however, as it requires distinguishing between desirable activity at the peptide's target, the prokaryotic plasma membrane in the case of most antimicrobial peptides, and activity at the membrane of eukaryotic erythrocytes. The task is further complicated by the varying extent to which many peptides display hemolytic activity, which makes arbitrarily classifying them as hemolytic or non-hemolytic challenging. Indeed, there is limited consensus on the most appropriate metric to quantify hemolysis, with many articles reporting only one metric recorded at a single concentration, which consequently precludes a regression approach instead of a classification approach. The topic is complicated further by a lack of consensus on the definition of a key metric, MHC. Most studies define it as the minimum hemolytic concentration, but differ on the specific criteria, with different studies defining it as the concentration at which 5%, 10%, 50% or even 100% hemolysis occurs. Some even define it as the maximum concentration that does not cause any hemolysis 130 . Ideally, studies would present the analysis of hemolytic activity as a series of measurements undertaken at several concentrations, which would allow for a fuller understanding of the toxicity-concentration profile. Until such a time, however, using the MHC values for training classifiers requires investigating its actual meaning on a case by case basis. In the course of verifying the activity of the sequences in our dataset, and comparing our dataset to the HemoPI dataset, we identified a number of instances of misclassified sequences, sequences whose hemolytic activities were not clear, and sequences whose presence in the literature we were unable to independently verify. These sequences were not included in the HAPPENN dataset.
The success and validity of a machine learning classifier are predicated on the correct definition of the problem at hand, which in this case encompasses the definition of positive and negative datasets. Both the positive and negative datasets consist of experimentally validated peptide sequences which exhibit antimicrobial or other biological activities. Unlike HemoPI and HemoPred, we chose not to conduct a machine learning experiment where the negative dataset consists of peptides randomly extracted from proteins in Swiss-Prot, as a machine learning classifier is most dependable when only one property of interest is varied. Using randomly extracted sequences as the negative dataset, and hemolytic antimicrobial peptides as the positive set, likely results in the classifier learning to predict general membrane activity, rather than specifically activity against eukaryotic erythrocytes. The authors believe that the HAPPENN dataset represents a major improvement on the HemoPI datasets, both in terms of size and reliability, as it contains 3738 peptides with confirmed biological activities, compared to the 904 and 1623 sequences present in the HemoPI-2 and HemoPI-3 datasets, and therefore has been made available for download both as supplementary information and on the server's website.
Once a reliable dataset was constructed, the peptide sequences were translated into vectors of physicochemical and composition features, and a number of different machine learning approaches, namely support vector machines, random forests and neural networks, were trialled for relating the peptides' features to their hemolytic activities. The neural network approach proved most promising, and was therefore retained, further optimised, and had its predictive power thoroughly evaluated by means of tenfold cross-validation and external validation on an independent test set.
The final neural network model achieved a tenfold cross-validated accuracy, sensitivity and specificity of 85.66%, an MCC of 0.71, and an AUC of 0.90. The validation statistics demonstrated that the model is capable of discriminating between hemolytic and non-hemolytic peptides, and that it exhibits minimal bias towards one class or another. The model performs very well compared to the existing methods, with a 35.3% decrease in cross-validated error relative to HemoPI and HemoPred. The model's residual prediction error rate can likely be attributed to a limited sample size for neural networks to accurately learn from, as well as the fine boundary between the definition of hemolytic and non-hemolytic peptides combined with the margin of error associated with the experimental determination of hemolytic activity. Further improvements to the predictive power of www.nature.com/scientificreports/ the neural network approach are possible and indeed expected, as the number of peptides in the literature with characterised hemolytic activity increases. The main HAPPENN dataset was not redundancy-reduced, as even a single amino acid substitution can affect a peptide's bioactivity. Nonetheless, to determine to what extent sequence similarity contributes to the model's performance, the experiments were repeated with a redundancy-reduced dataset, which achieved a crossvalidated accuracy of 82.73% and MCC of 0.65. While these values are lower than the non-redundancy-reduced dataset, they illustrate that the majority of the predictive power of the model is not derived from sequence similarity, especially considering that neural networks perform best when trained with larger datasets, and redundancy reduction significantly reduces the amount of data available for training. Finally, to ascertain the model's power in distinguishing between similar peptides with different hemolytic activities, a model was trained on the HAPPENN-hard dataset, which contains the positive examples from the HAPPENN-RR90 dataset, and for each positive example, the most compositionally similar non-hemolytic peptide, as measured by the Euclidean distance between their amino acid composition vectors. Despite the similarity between the positive and negative peptides, the model achieves a respectable accuracy of 77.54% and MCC of 0.55.
Interestingly, the results of training a neural network with reduced feature spaces are in close agreement with the unsupervised principal component analysis (PCA) (Fig. 5) and t-distributed Stochastic Neighbour Embedding (t-SNE) (Fig. 6) analysis. The neural network trained on just the physicochemical features had the least predictive power among the networks trained, which coincides with the physicochemical features' PCA plot having the least separation between the hemolytic and non-hemolytic classes. The relatively lower predictive power of the physicochemical descriptors highlights the need for the development of novel, peptide-specific descriptors that account for their capacity to adopt complex three-dimensional structures. When trained on the redundancy-reduced dataset, the difference in predictive power between the compositional and physicochemical descriptors is reduced.
To ascertain the source of misclassification of the wrongly predicted peptides, the main model's false positives and false negatives were highlighted on the PCA and t-SNE plots. It is apparent that many of the misclassifications occur due to the peptides' compositional and/or physicochemical similarity to peptides with differing hemolytic activity. To gain further insight into the source of misclassification, the peptide with the most similar percentage amino acid composition but opposite hemolytic activity was identified for each wrongly predicted peptide. For 16% of misclassified peptides, a compositionally identical peptide with opposite hemolytic activity was identified, compared with just 5% for correctly classified peptides. Overall, misclassified peptides had a smaller Euclidean distance to their most compositionally similar opposite-activity peptide than correctly classified peptides did.
To gain insight into which features were most important for hemolytic activity, the importance assigned to features by random forests was investigated. Hydrophobicity, as quantified by a selection of different metrics, appears to be critical for hemolytic activity, with more hydrophobic sequences generally being found to be more hemolytic than less hydrophobic sequences. These findings are not surprising, and are consistent with the available literature 131,132 . A number of compositional descriptors were also found to be indicative of hemolytic propensity, with FS, LH, KIK, VAK and VLK being ranked as important.
HAPPENN's power is demonstrated by an alanine scan applied to maximin 3, a non-hemolytic peptide 127,133 . Interestingly, the classifier predicts maximin 3 and all of its alanine scan mutants to be non-hemolytic, with the single exception of [E20A]maximin 3, which is consistent with the literature, which acknowledges the relationship between a peptide's net charge and its hemolytic activity 134 .
This study presents a significant improvement in the area of in silico hemolytic activity classification, with its results forming the new state-of-the-art. The novel application of a neural network combined with the HAPPENN dataset's superior data quality and quantity has facilitated a 35% decrease in classification error, compared to the results achieved by the best currently available tools.
To conclude, accurate prediction of hemolytic activity of antimicrobial peptides can facilitate in silico design of novel peptide-based therapeutics, thereby accelerating the design phase and reducing its cost. HAPPENN distinguishes itself from existing methods through its focus on antimicrobial peptides, more accurate prediction and incorporation of novel features.
Although HAPPENN displays advantages compared to competing methods, it is limited by the lower interpretability of the neural network's hidden layers. Prediction of hemolytic activity from primary sequence remains a challenging problem, as it is characterised by a complex interplay between numerous features, which also contribute to the desirable antimicrobial activities. Nonetheless, HAPPENN possesses an error rate 35% lower than the most accurate existing classifiers, and we believe that this work will aid future studies focused on the identification and design of novel peptide therapeutics.

Web server implementation
To best serve the scientific community, we have made the classifier algorithm available online at https ://resea rch.timmo ns.eu/happe nn in the form of an easy to use web-server, which is available for free use by academic researchers. The web server is capable of predicting the hemolytic activity of peptides' based on their primary sequence, as well as the presence or absence of N-terminal acetylation or C-terminal amidation modifications. Prediction is limited to peptides composed of the 20 natural amino acids; non-natural amino acids are not supported. The web server possesses many features. Neural network models trained on the HAPPENN, HAPPENN-RR90 and HAPPENN-hard datasets are available for prediction.
Hemolytic activity prediction. Hemolytic activity prediction is available for both single and multiple sequences. The user should submit the peptide sequence or sequences in FASTA format, select the neural network model they wish to use for prediction, and the server will return the probability of the peptide being hemo-Scientific RepoRtS | (2020) 10:10869 | https://doi.org/10.1038/s41598-020-67701-3 www.nature.com/scientificreports/ lytic, based on the neural network's prediction. This probability is on a scale of 0-1, where 0 is most probably non-hemolytic and 1 is most probably hemolytic.

Mutation analysis.
Mutation analysis is available for single sequences, provided in FASTA format. After inputting the sequence, the user should select the mutation analysis option, input the residue number that they wish to mutate, and run the prediction. The server will predict the hemolytic activity of each of the peptide's mutants attained by substituting the residue at the selected position with each of the other natural 20 amino acids.
Residue scan. A residue scan, for instance an alanine-scan, is available for single sequences provided in FASTA format. After inputting the sequence, the user should select the residue scan option, choose the residue they wish to scan with and run the prediction. The server will predict the hemolytic activity of each of the peptide's mutants attained by substituting successive residue positions with the selected residue.

Data availability
All data generated or analysed during this study are included in this published article's supplementary data sets.