Identifying anticancer peptides by using improved hybrid compositions

Cancer is one of the main causes of threats to human life. Identification of anticancer peptides is important for developing effective anticancer drugs. In this paper, we developed an improved predictor to identify the anticancer peptides. The amino acid composition (AAC), the average chemical shifts (acACS) and the reduced amino acid composition (RAAC) were selected to predict the anticancer peptides by using the support vector machine (SVM). The overall prediction accuracy reaches to 93.61% in jackknife test. The results indicated that the combined parameter was helpful to the prediction for anticancer peptides.


Results
The prediction of anticancer peptides. In order to predict the anticancer peptides, it is very important to choose a classifier and a set of reasonable information parameters from protein sequence. In this paper, the local amino acids composition (AAC), the average chemical shift (acACS) and the reduced amino acid composition (RAAC) were selected to predict the anticancer peptides.
The acACS vectors were formed based on protein sequence, and then the best λ and i were selected. In order to obtain the best performance of predicting anticancer peptides, the combined scheme of chemically shifted atoms and the best λ were optimized with the maximum accuracy. Results in Fig. 1 showed that the accuracy was the best when λ = 5 and in Fig. 2 showed that the prediction result was the best when the combination mode of chemically shifted atoms was + + α H C N N 1 1 3 1 5 . Therefore, the combination mode chemically shifted + + α H C N N 1 1 3 1 5 was selected and the correlation factor λ was set to 5 for generating the acACS feature vectors.
For facilitating comparison, the benchmark dataset (see Equation (5)) generated by Hajisharifi et al. 10 was employed. The predictive results of anticancer peptides based on AAC, acACS, RAAC, AAC + RAAC, AAC + acACS, RAAC + acACS and AAC + RAAC + acACS by using SVM with jackknife test were recorded in Table 1. The results showed that the combined parameter of AAC + RAAC + acACS was better than other parameters. The overall predictive accuracy (Q A ) and Matthew's correlation coefficient (MCC) in jackknife test were 93.61% and 0.867 with the combined parameter of AAC + RAAC + acACS by using of the SVM, respectively. The results indicated that the combined parameter was helpful to the prediction for anticancer peptides.
In order to estimate the effectiveness of the new prediction method, an independent dataset (see Equation (6)) generated by Chen et al. 11 was employed. The independent dataset is not absolutely needed for validating a predictor via the jackknife or K-fold cross-validation, since the outcome obtained via the jackknife or K-fold cross-validation with benchmark dataset is actually from a combination of many different independent dataset tests. The combined parameter of AAC + RAAC + acACS was selected to identify the anticancer peptides in the independent dataset. The overall predictive accuracy (Q A ) and Matthew's correlation coefficient (MCC) in   jackknife test were 89.33% and 0.787 by using of the SVM, respectively. This results showed that the new predictive method was not only able to achieve higher overall success rates, but also more stable.
Evaluation of the predictive performances. In order to evaluate the predictive capability and reliability of the algorithm, the sensitivity (S n ), specificity (S p ), overall predictive accuracy (Q A ) and Matthew's correlation coefficients (MCC) are defined by where TP denotes the numbers of the correctly recognized positives, FN denotes the numbers of the positives recognized as negatives, FP denotes the numbers of the negatives recognized as positives, TN denotes the numbers of correctly recognized negatives.

Jackknife cross-validation test.
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, sub-sampling test, and jackknife cross-validation test 12 , among the three cross-validation methods, the jackknife test is deemed the most objective, and has been used to examine the performance of various predictors [13][14][15][16] . Hence the jackknife test was used to evaluate the performance of our method. During the process of jackknife test, each protein is singled out in turn for testing and the remaining proteins are merged for training.

Discussion
Hajisharifi et al. 10 used SVM-based classification on the base of their PseAAC parameter and the local alignment kernel as string kernel method in conjunction with 5-fold cross-validation test for the same benchmark dataset (see Equation (5)). For the purpose of comparing the predictive capability of our method, the predicted results of Hajisharifi's and Chen's method are enumerated in Table 2 for the same dataset. Compared results show that the performance of our method is more superior to that of Hajisharifi's method. The s n , s p , Q A and MCC of our method are about 5.40%,3.92%,4.49% and 0.095 higher than the predictive results of Hajisharifi's method with 5-fold cross-validation test, respectively. Although only the s n obtained in our method is higher than that of Chen's method with 5-fold cross-validation test, the features of anticancer peptides are obtained more comprehensive in our method. At least our method can play a complimentary role to the existing methods in this area. The predictive result indicates that the combined parameter AAC + RAAC + acACS is effective to the prediction of anticancer peptides proteins. From the discussion above, it can be seen that our method has advantage of more comprehensive features and higher predictive success rates. The combined parameter AAC + RAAC + acACS successfully enhance the prediction quality for the anticancer peptides. This method may have broad application in protein and DNA motif identification.

Materials and Method
Datasets. The Benchmark dataset was generated by Hajisharifi et al. 10  In order to estimate the effectiveness of the new prediction method, an independent dataset was employed and can be expressed as  where ′ S anticancer consists of 150 anticancer peptides and ′ − S non anticancer consists of 150 non-anticancer peptides. The samples in ′ S anticancer and ′ − S non anticancer were fetched from the dataset used by Chen et al. 11 and none of the sequences in S′ was the same with the sequences in S.

Support Vector Machine (SVM). The support vector machine (SVM) is a widely used classification
method developed based on the statistical learning theory [18][19][20][21][22][23][24][25] . The SVM is particularly attractive to biological sequence analysis due to its ability to handle noise, large dataset and large input spaces. The SVM model is a representation of the examples as points in space, mapped by a kernel function so that the examples are divided by a clear gap that is widely enough. The new examples are mapped into the same space and predicted according to which side of the gap they fall on. The radial basis kernel function (RBF) was used to obtain the best classification hyperplane. The regularization parameter C and the kernel width parameter γ were tuned via the grid search method. For a brief formulation of SVM and how it works, see the papers 26,27 . In this paper, the LibSVM algorithm 28 has been used to predict the anticancer peptide, which can be downloaded from http://www.csie.ntu. edu.tw/~cjlin/libsvm/.
The local amino acids compositions (AAC). The information parameters are very important for predicted algorithms. In a sequence-based predictor, the most important issue is the way in which to extract features from primary sequences of proteins [29][30][31][32] . The primary sequences of proteins are composed of 20 amino acids. The absolute occurrence frequencies of the 20 amino acids in protein are important features. Hence, the absolute occurrence frequencies of the 20 amino acids in protein sequence are considered as the information parameters of a protein and can be defined as i T 1 2 20 2 20) i is the absolute occurrence frequencies of the 20 native amino acids and calculated by is the number of i-th amino acids of j-th protein in m-th group, N j m denotes the total number of amino acids of j-th protein in m-th group, k m denotes the number of samples in the m-th group (here (k 1 = 138, k 2 = 206). We calculated the average amino acids compositions of anticancer peptides and non-anticancer peptides by using of Equation (9). The calculate results indicated that the amino acids compositions of anticancer peptides and non-anticancer peptides were different. Hence the amino acids compositions were suitable as features to distinguish anticancer peptides and non-anticancer peptides. The different distribution of the amino acids compositions in anticancer peptides and non-anticancer peptides were shown in Fig. 3.
Auto covariance of the average chemical shift (acACS) algorithm. In a predictor, the most important issue is the way in which to extract features from primary sequences of protein. To achieve this, the acACS algorithm is proposed, which uses simple secondary structure information to represent the sample of a protein 33 . The average chemical shift of a protein has intrinsic correlation with the protein's secondary structure and the function of this protein. According to this point of view, there must be some relationship among the average chemical shift, protein structure and functions. So the acACS algorithm has been widely applied to predictions of protein attributes, such as predicting protein submitochondrial localization 34 , the subcellular locations of the mycobacterial proteins and DNA-binding proteins 35 , as well as for identifying acidic and alkaline enzymes 36 and discriminating between bioluminescent and nonbioluminescent proteins 37 . The acACS algorithm can be obtained from web server at http://202.207.14.87:8032/fuwu/acacs/index.asp.
For a protein P, where L is the length of the protein sequence p and j represent the 20 native amino acids residues, p is then expressed as follow: where θ i (λ) is the correlation factor of the average chemical shift for j l with the average chemical shift for λ + j l along the protein sequence. The factor λ λ < < L (0 ) reflects the rank of correlation. The factor i can be the different composition of N , 15  . In order to obtain the best result, an appropriate number for factor λ and i should be determined according to the predicting results.

The reduced amino acid composition (RAAC).
It was demonstrated that in the definition of global protein structure, the patterns of hydrophobic and hydrophilic residues have major significance. To obtain the hydropathy characteristics, the amino acids were divided into groups using their individual hydropathies according to the ranges of the hydropathy scale. Therefore, a protein sequence with 20 amino acids can be represented by a sequence with 6 characters according to following schemes: Strongly hydrophilic or polar (R, D, E, N, Q, K, H), Strongly hydrophobic (L, I, V, A, M, F), Weakly hydrophilic or weakly hydrophobic (S, T, Y, W), Proline (P), glycine (G) and Cysteine (C) 38,39 . The dipeptide composition of the six characters were chosen and represented as follow: