Predicting cell-penetrating peptides using machine learning algorithms and navigating in their chemical space

Cell-penetrating peptides (CPPs) are naturally able to cross the lipid bilayer membrane that protects cells. These peptides share common structural and physicochemical properties and show different pharmaceutical applications, among which drug delivery is the most important. Due to their ability to cross the membranes by pulling high-molecular-weight polar molecules, they are termed Trojan horses. In this study, we proposed a machine learning (ML)-based framework named BChemRF-CPPred (beyond chemical rules-based framework for CPP prediction) that uses an artificial neural network, a support vector machine, and a Gaussian process classifier to differentiate CPPs from non-CPPs, using structure- and sequence-based descriptors extracted from PDB and FASTA formats. The performance of our algorithm was evaluated by tenfold cross-validation and compared with those of previously reported prediction tools using an independent dataset. The BChemRF-CPPred satisfactorily identified CPP-like structures using natural and synthetic modified peptide libraries and also obtained better performance than those of previously reported ML-based algorithms, reaching the independent test accuracy of 90.66% (AUC = 0.9365) for PDB, and an accuracy of 86.5% (AUC = 0.9216) for FASTA input. Moreover, our analyses of the CPP chemical space demonstrated that these peptides break some molecular rules related to the prediction of permeability of therapeutic molecules in cell membranes. This is the first comprehensive analysis to predict synthetic and natural CPP structures and to evaluate their chemical space using an ML-based framework. Our algorithm is freely available for academic use at http://comptools.linc.ufpa.br/BChemRF-CPPred.


Results and discussion
In this study, we proposed the BChemRF-CPPred, an ML-based framework that applies an artificial neural network, a support vector machine, and a Gaussian process classifier to predict CPPs structures using structurebased descriptors (physicochemical and structural properties) related to the permeability of these structures into the cell membranes and the presence of polar charged groups [62][63][64] ; and sequence-based descriptors obtained from the primary structure of the peptides. We compared the overall performance of our proposed framework with four state-of-the-art methods and validated the results using statistical analysis to evaluate the feature correlation, spatial distribution of peptide properties, and information gain of the applied properties. Moreover, we evaluated the chemical space of these peptides using statistical methods and correlated them with previous conventional filters applied to predict cell permeability.
Cell-penetrating peptides present chemical space beyond the intervals dictated by conventional filters. Over the years, the pharmaceutical industry and medicinal chemists have determined principles for drug-like molecules and predicted their permeability in biological membranes 42,[65][66][67] . Efficiency in membrane permeation has been pointed out as a crucial factor for the bioavailability of therapeutic molecules 68 . Different studies have demonstrated that physicochemical and structural properties of peptides are outside the traditional chemical space present in the approved drugs [69][70][71] . These findings have helped to drive the design and discovery of novel compounds that occupy the chemical space beyond the intervals dictated by the Lipinski rules-of-five (RO5) filter 42,64 .
The structural flexibility of compounds might influence their translocation in the mobile aqueous phase due to the reduced entropic environment of the cell membranes 62 . In contrast, the flexibility might increase the entropic barriers of molecules, impairing or decreasing their affinity with the molecular targets, when compared with their restrained and cyclic counterparts 63,72 . High molecular weight (MW), topological polar surface area (tPSA), and the number of rotatable bonds (NRB) have been reported as the main limitations of some molecules to cross the cell membrane by passive permeation due to the increased molecular volume, and complexation with water molecules 62,65 .
Comparing our results with those of clinically approved peptides for oral use, we identified that CPPs have an increased MW (331. 48-3750.51) and tPSA (101.29-1782.83) 71 . Due to the different reported mechanisms of cell membrane penetration, these discrepancies could be related to other mechanisms not related to passive diffusion, such as pore formation or endocytosis representative of TP-10 and caveolin-1, respectively [73][74][75] . The MW and tPSA values found for the analyzed CPP structures are better correlated with values previously found for linear and cyclic pentapeptides 69 .
The tPSA is correlated with the H-bond pattern of an investigated molecule in an aqueous solvent 76  www.nature.com/scientificreports/ passive diffusion, such as pAntp have been described with some chameleon-like properties, i.e. can change their conformation by exposing polar groups in an aqueous medium, but hiding them when traversing the cell membranes 78 . It is interesting to note that a previous study identified that highly permeable peptoids and peptides showed an average tPSA value of 335.30 Å 2 and 358.80 Å 2 , respectively 79 . These results are different from those found for our analyzed CPP datasets that showed an average of 852.42 Å 2 . It has been demonstrated that flexible molecules can form intrachain H-bond interactions, thus adaptively reducing their polarity surface and improving the permeation into the cell membranes 80 . In this study, the molecular flexibility and complexity were measured by two structural properties: the fraction of sp 3 -hybridized carbon atoms (Fsp 3 ) and the number of rotatable bonds (NRB) (Table S1). Recently, Doak et al. (2014) extended the NRB value previously found by Veber rules and indicated that bioavailable drugs present NRB < 20 62,64 . Our analyses demonstrated that CPPs exceed the maximum value of molecular properties indicated for oral drugs and peptides 69,71 , showing a range from 9 to 137 (90th percentile equal to 98.60, Table S2). Regarding Fsp 3 , studies have demonstrated that it is an important molecular property related to both solubilities in the aqueous phase and melting point 63 . We identified that for CPPs, Fsp 3 is not inferior to 0.37 and does not exceed 0.84 (90th percentile 0.784). Our results are consistent with orally available peptides that showed 90th percentile equal to 0.79 71 .
Regarding lipophilicity, we investigated this property using the 1-octanol/water partition coefficient (cLogP). High cLogP values are related to the high lipophilicity of the molecule, thus indicating a better membrane cell penetration. Doak et al. (2014) indicated that cLogP in available drugs varies in the range − 2 ≤ cLogP ≤ 10. Here, we found that the evaluated CPP dataset showed − 42.12 ≤ cLogP ≤ 2.97, which is consistent with previous findings for cyclic pentapeptides.
Hydrogen bond acceptors (HBA) and hydrogen bond donors (HBD) are relevant factors for cell permeability by RO5. Our results showed a consistent correlation with previous values found for linear and cyclic pentapeptides 69 . However, regarding HBD, CPPs showed a high discrepancy related to clinically approved drugs (see Table S1) 64 .
Regarding the number of aromatic rings (NAR) our study found a 95th percentile equal to 6, with maximum and minimum limits equal to 0 and 10, respectively. Despite previous studies no reported its value in the analyzes of the chemical space of peptides 69,71 , it is a relevant structural property related to the lipophilicity of compounds, and studies have demonstrated that the addition of an aromatic ring usually results in a statistically significant increase in the clogP value of the compound 81 . This value represents a statistically significant component of a molecule's overall properties in the context of the membrane permeability (the average NAR in oral drugs is equal to 1.6) 81 . Furthermore, this property is present in some molecular filters that analyze the permeability and drug-likeness of compounds 82,83 .
Analyzing the 90th percentile calculated for the physicochemical properties, the results reinforce that the CPPs structures are beyond the previously established chemical rules. Thus, indicating that these molecular intervals applied to predict the permeability of peptides into the cell membrane by passive diffusion, could not be correctly applied for this class of peptides, consequently, leading to recognize bias and hindering of the computer-aided design of CPP-like structures. The histograms of these structure-based descriptor distributions of all analyzed CPPs structures are shown in support information Figure S1.
BChemRF-CPPred performed better using an optimized combination of structure-and sequence-based descriptors. In the present study, we investigated two class of molecular descriptors: (1) the structure-based descriptors that include structural and physicochemical properties related to the permeation of molecules into the biological membranes which are obtained from the molecular structures of peptides-MW, tPSA, Fsp 3 , cLogP, HBA, HBD, NAR, NRB, and net charge (NetC)- 64,84 , as well as, some properties related to the polar charged groups-primary amine groups (NPA), number of guanidine groups (NG), and number of negatively charged amino acid groups (NNCAA)-that could influence in their permeability; and, (2) sequencebased descriptors, i.e., information calculated from the primary structure of the peptide-amino acid composition (AAC), pseudo-amino acid composition (PseAAC), and dipeptide composition (DPC) 29,33,85 . Regarding the sequence-based descriptors, two amino acid compositions related to arginine (f[Arg]) and lysine (f[Lys]) fractions were analyzed in our algorithm due to their relevance in the characterization of this class of peptides 14,15 . We also analyzed two other descriptors in the ML-based framework: the DPC to evaluate the presence of motifs in the CPP sequences that are relevant to their mechanism uptake into the cell 86,87 ; and the PseAAC to predict the overall peptides attributes 29,33,61 . The PseAAC is a theoretical molecular descriptor formed by a combination of discrete sequence correlation factors and twenty components of the conventional amino acid composition 88 . Our algorithm uses as input datasets both primary and tertiary structures of peptides in FASTA or PDB formats, respectively. To train the ML-based frameworks that use the tertiary structure of the peptides (PDB format), we selected two datasets, that were divided into training (600 peptide structures) obtained from curated databases and an independent test (150 structures) obtained from the literature. In contrast, to train the ML-based frameworks that used the primary structure of the peptides (FASTA format), we considered only peptides containing natural residues in the training dataset that were accounted for a total of 241 CPPs and 300 non-CPPs, and for the independent test, we considered only the natural peptides from the original dataset, which account 60 CPPs and 75 non-CPPS.
To understand the influence of structure-and sequence-based descriptors on framework performance, we first formed four FCs: FC-1 containing only sequence-based descriptors (AAC, PseAAC, and DPC); FC-2 containing only structure-based descriptors (structural and physicochemical properties); FC-3 containing the best correlated sequence-based descriptors and structure-based descriptors; and FC-4 containing an optimized selection of structure-and sequence-based descriptors according to Kendall's correlation analysis (see Figures  www.nature.com/scientificreports/ S4): AAC, PseAAC, the 10 most-well correlated DPC, and the 9 better correlated structure-based descriptors (excluding tPSA, NRB, and HBD). Second, we evaluated the prediction performance of the BChemRF-CPPred and its classifiers an ANN, GPC, and SVM using tenfold cross-validation in the training dataset ( Fig. 1). The hyper-parameters of each classifier by FC are listed in Table S3.
In Fig. 1, we observed the performance of each estimator using tenfold cross-validation analyses. FC-1 and FC-2 reached the worst results, where BChemRF-CPPred obtained an average accuracy level of 86.5%, while their ML algorithms achieved values between 85.5 and 86.5% for the FC-1, and between 82.6 and 88% for the second one.
The framework that used FC-3 obtained an average accuracy of 87.83%, while ANN, GPC, and SVM achieved 88%, 86.5%, and 88.83%, respectively. Considering FC-4, the BChemRF-CPPred achieved an accuracy equal to 87.66%, and these classifiers obtained an average accuracy of 87.5%, 84.16, and 89%, respectively. Although the FC-3 had reached a slightly better average accuracy than FC-4, the Kruskal-Wallis H test (p value = 0.820) showed no statistically significant difference between the accuracies obtained by the frameworks that used these FCs. Furthermore, the framework that uses the FC-4 (43 descriptors) is less complex than those that use FC-3 (73 descriptors).
It is important to note that, although the FC-1 (containing only sequence-based descriptors) and the FC-2 (only structure-based descriptors) have shown relevant correlation to CPPs' prediction, according to Kendall's correlation analysis, these descriptors isolated do not provide enough information to predict satisfactorily the permeability of these peptides into the cell membranes. Our results showed that the optimized combination of structure-and sequence-based descriptors (FC-4) better predict natural and synthetic CPPs than other analyzed FCs.
The receiver operating characteristic (ROC) curves and their area under curve (AUC) metric revealed the impact of each descriptor composition in our proposed framework (Fig. 2B). Although the molecular properties have shown a satisfactory contribution in FC-2 and FC-3, reaching AUC values 0.9372 and 0.933, respectively, when compared to FC-1 that obtained an AUC value of 0.8985 and has only AAC, DPC, and PseAAC, the descriptors present in FC-4 achieved AUC value of 0.9536, providing more information for the BChemRF-CPPred to predict the cell membrane permeation of CPPs.
The behavior of the ROC curves observed in Fig. 2B corroborates previous results, since the curve associated with the FC-4 based framework (orange curve) is closer to the left corner of the graph, which indicates a higher true positive rate and a lower false-positive rate in the prediction of CPPs and non-CPPs compared with the other FCs. Table 1 shows a detailed analysis of FC-4 in terms of accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC). These results show that the framework showed an improved ability to correctly differentiate non-CPPs from CPPs. Furthermore, the highest MCC and one of the greatest accuracies and F1-score with values of 0.813, 0.906, and 0.905, respectively, proved that BChemRF-CPPred is the best classifier among the four analyzed ones. www.nature.com/scientificreports/ To compare our FC-4 based framework with state-of-the-art methods for CPP prediction, we divided this analysis into two experiments. The first one analyses our method with tools that were trained with only natural peptides, such as MLCPP 31 , CPPred-RF 33 , and SkipCPP-Pred 89 . This group was analyzed with 60 CPPs (chemically unmodified peptides) and 75 non-CPPs from the independent test dataset. The second experiment compared our framework with Kelm-CPPpred 29 , an algorithm trained with synthetic peptides (chemically modified), using the original independent dataset. Table 2 compares the performance of previous ML-based frameworks trained and non-trained with synthetic peptides, respectively. These results show that by using an imbalanced dataset (first experiment) with only natural peptides, BChemRF-CPPred obtained an accuracy value of 89.62%, while MLCPP, CPPred-RF, and SkipCPP-Pred reached 86.66%, 68.88%, and 62.58%, respectively. Moreover, our framework obtained the  www.nature.com/scientificreports/ highest values of F1-score and MCC when compared with other tools, which indicates that the structure-based descriptors provided more information to predict cell membrane permeability of natural peptides compared with sequence-based tools. The second experiment also revealed that the proposed ML-based framework achieved better outcomes in terms of accuracy, F1-score, and MCC when compared with Kelm-CPPpred, which demonstrates a highperformance prediction of CPPs by BChemRF-CPPred, including the synthetic (chemically modified) peptides containing methyl, glycyl, and other chemical groups. An accuracy of 90.66% demonstrated that the proposed framework using an optimized combination of structure-and sequence-based descriptors satisfactorily differentiated CPPs and non-CPPs from natural and synthetic origins.
Accessing the performance of BChemRF-CPPred using FASTA as input format. To evaluate the performance of BChemRF-CPPred in the prediction of CPPs using chemical data obtained from the primary and tertiary structures, we used both FASTA and PDB formats, respectively, to calculate the four FCs using the tenfold cross-validation (Fig. 3). To train the framework, using FASTA format, we considered only peptides containing natural residues in the training dataset that were accounted for a total of 241 CPPs and 300 non-CPPs. Figure 3 shows the performance of each classifier using the FASTA format as input according to crossvalidation analyses. The framework that used the FC-3 reached the best performance with an average accuracy of 86.9%, while FC-1, FC-2, and FC-4 achieved values between 84.13 and 86.71%, respectively. When compared with the performance of BChemRF-CPPred that used as input the PDB format, the cross-validation of the framework that used FASTA as input showed a lower performance for FC-2, FC-3, and FC-4, whose accuracy values for PDB format were 86.5%, 87.83%, and 87.66%, respectively. Our analyses of the performance of BChemRF-CPPred using FC-1, composed only by sequence-based features in the independent test, revealed that the use of only natural peptides in FASTA format as input obtained an accuracy equal to 86.56%, while the FC-2, FC-3, and FC-4 achieved values of 85.07%, 85.82%, and 85.2%, respectively (Fig. 4). Table 3 compares the performance between the FASTA-input-based framework, using all FCs (FC-1 to FC-4), and the PDB-input-based one with FC-4. This independent test uses the same testing dataset described in experiment 1 (see Table 2), which has only natural peptides. Table 2. Comparison of the performance of previous ML-based frameworks (MLCPP, CPPred-RF, and SkipCPP-Pred) and FC-4 based BChemRF-CPPred using only natural peptides from the independent dataset (1st experiment); as well as, the evaluation of the performance of Kelm-CPPpred and FC-4 based BChemRF-CPPred from all independent dataset (2nd experiment).  The framework that uses FC-1 obtained the best prediction results in the independent test using the FASTA format as input, i.e., the framework trained only with the sequence-based features showed higher values of accuracy, F1-score, and MCC when compared with the other frameworks that used FC-2, FC-3, and FC-4. It is important to note that both the framework based on FC-4 that uses PDB as input and the BChemRF-CPPred based on the FC-1 that uses FASTA as input performed better in the prediction of natural CPPs than previous tools CPPred-RF and SkipCPP-Pred, which reached accuracy values between 62.5 and 68.8%, F1-score values between 73.7 and 75.3%, and MCC values between 49.5 and 52.5%, respectively ( Table 2).
The results also revealed that when compared the framework based on the FC-4 that uses the PDB as input with the framework based on FC-1 that uses FASTA, the Kruskal-Wallis H test (p value = 0.622) showed no statistically significant difference between the accuracies obtained by these two frameworks in the tenfold crossvalidation. However, the PDB-based model achieved better performance in an independent test for all the metrics (Table 3).

An optimized combination of structure-and sequence-based descriptors improved the prediction of CPPs' structures.
To analyze the influence of the sequence-based (AAC, DPC, and PseAAC) and structure-based (MW, tPSA, Fsp 3 , cLogP, HBA, HBD, NAR, NRB, NPA, NG, NetC, and NNCAA) descriptors on the performance of CPP prediction in our ML-based framework, we extracted information entropy using the extremely randomized trees (ERT) algorithm and applied principal component analyses (PCA) in all peptide datasets.
The presence of cationic residues, such as lysine and arginine, in peptides sequences, has been shown to play an important role in membrane permeation. These residues form non-covalent interactions with the anionic groups of the membrane surface. The highly basic polar groups from these residues remain protonated under physiological pH conditions, acting as hydrogen-bond donors in CPP-lipid interactions 90,91 .
Our study demonstrated that AAC, DPC, and PseAAC provided 0.650 and 0.664 of normalized cumulative information entropy (CIE), while the structure-based descriptors supplied 0.350 and 0.336 of CIE for training and independent test, respectively (Fig. 5).
Although the sequence-based features have several descriptors better correlated according to Kendall's correlation when compared to structure-based descriptors, the CIE of physicochemical and structural properties  www.nature.com/scientificreports/ showed a better contribution to CPP prediction than the AAC and DPC contributions taking together. Structural and physicochemical properties give significant information for ML algorithms, which can be verified by accuracies achieved in the independent test by the framework that used FC-4. The 3D PCA analyses of all datasets showed that FC-1 (Fig. 6A) and FC-2 ( Fig. 6B) did not provide a clear differentiation between the CPPs and non-CPPs, which can be verified with the high level of overlap in the two groups of peptides. The normalized Bhattacharyya coefficient (BC) obtained values for FC1 equal to 0.361 (PC1), 0.234 (PC2), and 0.130 (PC3) and for FC2 values equal to 0.033 (PC1), 0.374 (PC2), and 0.045 (PC3).
The Kruskal-Wallis H test applied among the three principal components of each 3D PCA also showed that there is no significant difference between FC-3 and FC-4, where the statistical hypotheses comparing the distribution of samples in PC1, PC2, and PC3 achieved p value of 0.826, 0.920, and 0.101, respectively, which indicates that the three PCs have similar distributions. These results confirmed that the optimized composition of structure-and sequence-based descriptors (FC-4) provided more significant information when compared with the other FCs, which directly impacted their cell membrane permeability prediction.
In contrast to previous ML-based approaches 31,34 , our findings demonstrated that the combination of sequence-and structure-based descriptors related to molecule bioavailability improved the prediction of CPPs' structures. Structural factors, such as the presence of cyclic chains 92,93 , the secondary structure composition 94 , as well as, the shape, structure complexity, and 3D-pattern of constituting atoms 95 have been shown to have a considerable influence on membrane penetration. Our analyses demonstrated that the membrane penetration of CPPs is better predicted using hybrid features composition containing structural and physicochemical properties, as well as, information from the primary structure.

Conclusions
We demonstrated that the proposed BChemRF-CPPred, with FC composed of an optimized combination of sequence-, and structure-based properties, has superior accuracy compared to FCs composed of only sequenceor only structure-based descriptors. The accuracy achieved by the proposed framework, using PDB input and www.nature.com/scientificreports/ sequence-and structural-based features (FC-4), was 90.66% in the independent test with natural and nonnatural peptides, while in the test with only natural peptides, the models based on FASTA input, which used only sequence-based descriptors (FC-1), and based on PDB input, which used (FC-4), achieved accuracy values of 86.5% and 89.6%, respectively. These performances were better than the reached by some other ML-based tools that applied as input data only the sequence-based properties of the peptides. However, the framework based on PDB input and FC-4 achieved better performance than the model based on FASTA input and FC-1 in the prediction of natural peptides as CPPs in the independent test. These results not only proved that our tool has a greater ability to correctly predict CPPs, as employing the optimized combination of the analyzed properties has more significant information for the ML-based algorithms applied to the CPP prediction problem than sequence-or structural-based descriptors analyzed separately. Finally, in addition to the Trojan metaphor applied for CPPs in drug delivery, in the present study, we demonstrated that these peptides, due to a highly diverse mechanism of membrane permeation that includes pore formation and endocytosis, also break some well-established chemical rules applied to predict the bioavailability of drugs. Similarly, the mythical Trojan horse broke the war rules.

Material and methods
Datasets of CPPs and non-CPPs structures. Our datasets of peptide structures were obtained from two curated and validated CPP databases. The CPP structures were obtained from CPPsite2.0, a chemo-structural database with more than 1700 validated experimental CPPs with different structural properties (linear/cyclic; and modified/non-natural residues) and a wide range of application for cargo transportations into the cell 96 . Moreover, 411 CPPs and 411 non-CPPs were obtained from the C2Pred server 35 . Additionally, we also obtained 112 CPP and 37 non-CPP structures from previous published works and pharmaceutical catalogs 32,97,98 . The BCheRF-CPPred algorithm was trained and tested with datasets composed of primary and tertiary structure of peptides in FASTA (only natural peptides) and PDB (natural and synthetic peptides) formats, respectively. Peptides without resolved structures in PDB were predicted using the PEP-FOLD3 server 99 , and the peptides' features were extracted to compose the CPP and non-CPP datasets. www.nature.com/scientificreports/ The PEP-FOLD has been reported with high accuracy in the prediction of peptide structures obtaining the lowest energy conformations differing by 3.3 Å of RMSD-Ca from the experimental structures 99 . In addition, it is important to highlight that the structure-based descriptors (NRT, NAR, cLogP, HBA, HBD, etc.) analyzed in the present study are not related to the peptide folding, i.e., formation of secondary (α-helices and β-strands) and tertiary structures.
In the pre-processing stage, the general dataset was filtered regarding peptide length, which was limited to between 5 and 30 amino acid residues, and the duplicates and outliers (z-score ≥ 3 in peptide features) structures were removed using the Python data analysis library (Pandas) for Python language 100 . Finally, we organized a training dataset with 300 CPPs and 300 non-CPPs and an independent test dataset with 75 CPPs and 75 non-CPPs (Tables S5 and S6). Both datasets were balanced with a random selection of the structures.
Calculation of sequence-and structure-based descriptors. The molecular properties related to cell membrane permeation were calculated for CPPs and non-CPPs libraries using both PDB and FASTA format.
We selected the following twelve structure-based descriptors: The second one refers to 22 descriptors of the pseudo-amino acid composition (PseAAC) 88 , which are related to the hydrophobicity ( H 1 ), hydrophilicity ( H 2 ), and side-chain mass ( M ) along with the local sequence order, and can be calculated according to Eqs. (5) and (6), where L is the total residues content in peptide, λ is the correlation factor that reflects the sequence order of all the most contiguous residues along a protein chain, and R i is the ith amino acid. These properties were selected based on the general composition of CPP sequences 14 . Table 4 shows how all the descriptors were grouped into four different feature compositions, named FC-1 to FC-4. FC-1 grouped only amino acid composition and sequence-based descriptors, FC-2 used the twelve structure-based properties, FC-3 is the grouping of all analyzed descriptors, and the FC-4 grouped the most well-correlated sequence-and structure-based descriptors, according to Kendall's correlation.
The sequence-and structure-based descriptors were calculated by the RDKit 101 package that uses Python language, except for the DPC and PseAAC that were calculated using PyBioMed 102 package, and the NetC that was extracted from structures using Biopython package 103 .
To calculate some structure-based descriptors from PDB or FASTA format, the RDKit constructs a molecular structure of a peptide reading the file information. For PBD format, the package read the atoms, the sequence number, and the coordinates present in the file to form a graph with atomic bonds and dihedral angles that represents the molecule as a computational object since the vertice of the graph is an atom and the edge is the bond. To construct the 3D representation of the peptide using FASTA format, the RDKit reads the primary structure of the peptide and implements the graph theory using a list of predefined structures that matches with the conformation of the residues and their neighboring. This information can be consulted in RDKit API documentation in www. rdkit. org/ docs/ cppapi/ ROMol_ 8h_ source. html.
(1) Fsp 3 = number of sp 3 hybridized carbons total carbon count (2) f [Arg] = number of arginine residues total residues count number of lysine residues total residues count (4) DPC j = number of dipeptides(j) total number of all possible dipeptides BChemRF-CPPred to predict CPP permeability. Each ML-based algorithm received structure-and sequencebased descriptors to predict CPP and non-CPP structures using a probability scale that ranges from 0 to 1, where values > 0.5 were applied for CPPs and values ≤ 0.5 were applied for non-CPPs. The voting classifier calculates the average among the estimated probabilities, and the result provides a prediction of CPPs using binary labels, where 0 corresponds to non-CPPs and 1 to CPPs (Fig. 7). The MLs' hyper-parameters were tuned using Grid Search, a method applied for optimization of parameters using cross-validation over exhaustive search in a parameter grid. This method was applied to each algorithm by FC to obtain the best classifier model for the tenfold cross-validation and independent tests (Fig. 8). The range of the searching parameters adjusted for each ML-based algorithm and their best model are shown in Tables S3 and S4, respectively. All frameworks and their configuration processes were implemented using the Scikit-learn package for Python language.    www.nature.com/scientificreports/ Calculation of information gain. The process of data mining to explore the information gain provided by each FC in the peptide dataset was based on extremely randomized trees 107 and principal component analysis 106,108 algorithms. Extremely randomized trees are ensembles of unpruned decision trees algorithms that splits nodes by randomly-generated cut-points. This technique computes the importance of features using information entropy criterion. The higher is the entropy, the higher is the amount of information provided by the data.
Principal component analysis is an unsupervised machine learning technique used to reduce a high-dimensional dataset in a smaller dimensional representation, which is called principal components (PC). This algorithm turns out to be more feasible for the understanding of sample distribution in space.
ERT and PCA were implemented using Scikit-learn package and applied in the CPP structure library containing the peptides from training and independent test datasets.