Prediction of HIV-1 and HIV-2 proteins by using Chou’s pseudo amino acid compositions and different classifiers

Human immunodeficiency virus (HIV) is the retroviral agent that causes acquired immune deficiency syndrome (AIDS). The number of HIV caused deaths was about 4 million in 2016 alone; it was estimated that about 33 million to 46 million people worldwide living with HIV. The HIV disease is especially harmful because the progressive destruction of the immune system prevents the ability of forming specific antibodies and to maintain an efficacious killer T cell activity. Successful prediction of HIV protein has important significance for the biological and pharmacological functions. In this study, based on the concept of Chou’s pseudo amino acid (PseAA) composition and increment of diversity (ID), support vector machine (SVM), logisitic regression (LR), and multilayer perceptron (MP) were presented to predict HIV-1 proteins and HIV-2 proteins. The results of the jackknife test indicated that the highest prediction accuracy and CC values were obtained by the SVM and MP were 0.9909 and 0.9763, respectively, indicating that the classifiers presented in this study were suitable for predicting two groups of HIV proteins.


Results
Comparison on 20 amino acid compositions. The amino acid (AA) compositions of protein sequences have been widely used in classification of various groups of proteins in recent years 28,29,[44][45][46][47] . Some studies indicated that the biological function of a protein was mainly dependent on its amino acid compositions. In this study, the overall frequencies of the 20 amino acids for 242 HIV-1 proteins and 86 HIV-2 proteins were plotted (Fig. 1). Figure 1 illustrated that the amino acids of Glu (E), Lys (K), Gln (Q), Arg (R), Ala (A), Ile (I), Leu (L), Val (V), Ser (S), Thr (T), Pro (P) and Gly (G) were preferred to have high frequencies (frequency > 5%) in both HIV-1 proteins and HIV-2 proteins. To further study the difference in amino acid usage, we compared the percentages of each amino acid, respectively, between the HIV-1 proteins and HIV-2 proteins ( Table 1). The Wilcoxon tests revealed that Arg (R), Phe (F), Ile (I), Val (V), Thr (T), Tyr (Y) and Pro (P) had significant differences in the frequencies of amino acid usage. Among these amino acids, Arg (R), Ile (I), Val (V), Thr (T) and Pro (P) had high frequencies (frequency > 5%) for both HIV-1 proteins and HIV-2 proteins. In addition to the amino acid usage, the protein lengths of two protein groups were analyzed (Fig. 2). The median protein length of 242 HIV-1 proteins were longer than the median protein length of 86 HIV-2 proteins, and the difference between them was significant (193 versus 174, P-value = 1.80E-2; Wilcoxon test).
F-scores of 20 amino acid compositions. In this study, the F-scores of 20 amino acid compositions for HIV-1 proteins and HIV-2 proteins were also calculated for roughly evaluating the differences between amino acid compositions (Fig. 3). The larger the F-score was, the more likely this feature was more discriminative. As illustrated in Fig. 3, we found that Val (V) was the most discriminative feature, whereas Met (M) was the least discriminative feature, which confirmed the P-values of the Wilcoxon test for Val (V) and Met (M). We also found that most of the F-scores of 20 amino acids were low. The low F-scores of 20 amino acids were easy to understand, as most of the differences between HIV-1 proteins and HIV-2 proteins in amino acid usage were marginally or not significant. We hope that the F-scores of 20 amino acids illustrated in Fig. 3 may give us some quantitative indices for discriminating HIV-1 proteins and HIV-2 proteins. However, we should also keep in mind that the discrimination of each property was roughly estimated by the F-score, and further investigations will be required to prove the reliability and usefulness of this method.
Prediction of HIV-1 proteins and HIV-2 proteins by the ID algorithm. In this study, the 20 amino acid compositions, 400 dipeptide compositions, 6 amino acid hydropathy compositions and 36 hydropathy dipeptide compositions were selected as the input parameters of the ID algorithm. The jackknife test was applied to examine the ID algorithm. The performances of ID algorithm for prediction of HIV-1 proteins and HIV-2 proteins were enumerated in Table 2. In this table, the best predictive results were obtained by selecting the 400 dipeptide compositions as the input parameters of the ID algorithm. For HIV-1 protein prediction, the results of jackknife test indicated that the sensitivity, specificity and CC value were 82.23%, 99.00% and 0.7215, respectively.  For HIV-2 protein prediction, the results of jackknife test indicated that the sensitivity, specificity and CC value were 97.67%, 66.14% and 0.7215, respectively.

Prediction of HIV-1 proteins and HIV-2 proteins by three different classifiers.
In order to improve the prediction accuracy, the SVM, LR and MP were also applied to predict the HIV-1 proteins and HIV-2 proteins. In this study, the 20 amino acid compositions, 400 dipeptide compositions, 6 amino acid hydropathy compositions and 36 hydropathy dipeptide compositions were selected as the input parameters of the ID algorithm, and four kinds of ID values were calculated. Four kinds of ID values were combined and selected as the input parameters of SVM, LR and MP. All the predictive results were shown in Table 3. As shown in Table 3, the predictive results were improved by using the ID values as the input parameters of the SVM, LR and MP, when compared with the predictive results of the ID algorithm. Generally speaking, for HIV-1 protein and HIV-2 protein prediction, the better sensitivity, accuracy and CC value were obtained by the SVM, LR and MP. Based on the ID values, the 242 HIV-1 proteins and 86 HIV-2 proteins were predicted by the jackknife test. In the jackknife test, when using ID(A 2 ), ID(A 1 ) and ID(H 2 ) as the input parameters of SVM for predicting the HIV-1 proteins and HIV-2 proteins, the overall accuracy of 0.9909 and the CC value of 0.9763 were obtained, which were the highest overall accuracy and CC value in this study. The same prediction results can also be obtained by using ID(A 2 ), ID(H 1 ) and ID(H 2 ) as the input parameters of MP. In the jackknife test, the sensitivity (Sn) and specificity (Sp) were 99.59% and 99.18% for HIV-1 proteins, 97.67% and 98.82% for HIV-2 proteins by using ID(A 2 ), ID(A 1 ) and ID(H 2 ) as the input parameters of SVM. All of the predictive results presented in Table 3 clearly indicated that the predictive successful rates of SVM, LR and MP were higher than those of the ID algorithm, and SVM, LR and MP were suitable for predicting two groups of HIV proteins.

Discussion
The amino acid compositions of protein sequences have been widely used in classification of various groups of proteins in recent years. In this study, we used the amino acid compositions as the input parameters of increment of diversity (ID) to predict HIV-1 proteins and HIV-2 proteins. Before using these parameters, we wanted to show difference in the overall frequencies of the 20 amino acids for 242 HIV-1 proteins and 86 HIV-2 proteins. So, the frequencies and P-values of 20 amino acids for HIV-1 proteins and HIV-2 proteins were illustrated in Table 1.
In this study, the 20 amino acid compositions, 400 dipeptide compositions, 6 amino acid hydropathy compositions and 36 hydropathy dipeptide compositions were selected as the input parameters of the ID algorithm.    Table 2 illustrated the sensitivity, specificity, accuracy, and correlation coefficient for predicting the HIV-1 proteins and HIV-2 proteins by the jackknife test. In this table, the readers can clearly found that the best prediction results were obtained by the 400 dipeptide compositions. So, in the next section, we combined the ID values of 400 dipeptide compositions with the ID values of three other compositions as the input parameters of SVM, LR and MP to predict two groups of HIV proteins. As shown in some previous work for predicting the groups of proteins 27,32,48-52 , 20 amino acid compositions, 400 dipeptide compositions, 6 amino acid hydropathy compositions and 36 hydropathy dipeptide compositions were used as the input parameters. The prediction results of these work clearly indicated that better prediction quality was obtained by the 400 dipeptide compositions than three other parameters. Compared with 20 amino acid compositions which were the single wise amino acid compositions, the 400 dipeptide compositions took into account the sequence coupling effect 49 . More accurate correlation of the structure of a protein sequence was reflected in the 400 dipeptide compositions. So, the improved prediction quality can be obtained by the 400 dipeptide compositions. Compared with 6 amino acid hydropathy compositions and 36 hydropathy dipeptide compositions which only had 6 feature vectors and 36 feature vectors, more feature vectors were contained in the 400 dipeptide compositions. Thus, more information was contained in the 400 dipeptide compositions. This may be why the better prediction results could be obtained by 400 dipeptide compositions when compared with 6 amino acid hydropathy compositions and 36 hydropathy dipeptide compositions.
For comparing the prediction results of other machine learning algorithms with those of the SVM, LR and MP, the naïve bayes (NB), IBK, J48, random forest (RF) and random tree (RT) that were implemented in Weka (version 3.8.0) were used. The ID(A 2 ), ID(A 1 ) and ID(H 2 ) were used as the input parameters of these machine learning algorithms for prediction the HIV-1 proteins and HIV-2 proteins. The performance of these classifiers for predicting two groups of HIV proteins was evaluated by the jackknife tests, and all the overall accuracies were shown in Fig. 4. As illustrated in this figure, we found that the overall accuracies of the SVM, LR and MP were higher than those of the NB, IBK, J48, RF and RT. Based on this, we can conclude that the SVM, LR and MP may be more suitable for predicting HIV-1 proteins and HIV-2 proteins.
The successful prediction of HIV-1 proteins and HIV-2 proteins indicated that the algorithms presented in this study were promising approaches. The experience gained from the above example indicated that the 400 dipeptide compositions and increment of diversity (ID) were suitable for predicting the HIV-1 proteins and HIV-2 proteins. The 400 dipeptide compositions may be used to improve the prediction quality; these predictive results were significant higher than the predictive results obtained by other parameters. It was also evidence that the primary sequences contained important information determined protein advance structure. In addition, we found that when using the ID values as the parameters of SVM, LR and MP can reduce dimension of input vectors, improving calculating efficiency and extract important classify information. We hope these algorithms will be helpful for identification of HIV proteins in the future.   With the explosive growth of biological sequences in the post-genomic era, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, yet still keep considerable sequence-order information or key pattern characteristic. This is because all the existing machine-learning algorithms can only handle vector but not sequence samples. However, a vector defined in a discrete model may completely lose all the sequence-pattern information. To avoid completely losing the sequence-pattern information for proteins, the pseudo amino acid composition 54,55 was proposed. Ever since the concept of PseAAC was proposed, it has been widely used in nearly all the areas of computational proteomics [56][57][58][59][60] . Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, the concept of PseKNC (Pseydo K-tuple Nucleotide Composition) 61 was developed for generating various feature vectors for DNA/RNA sequences and it has been found very useful in genome analysis as well 34,62 . Particularly, recently a very powerful web-server called 'Pse-in-One' 63 and its updated version 'Pse-in-One2.0' 64 have been established that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users' studies. As pointed out in the work of Chou and Shen 65 and demonstrated in a series of recent publications [34][35][36][37][38][39][40][41][42][43]66 , user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful prediction methods and computational tools. Actually, many practically useful web-servers have increasing the impacts of the relevant methods on medical science 56 , driving medicinal chemistry into an unprecedented revolution 56 , we shall make efforts in our future work to provide a web-server for the prediction method presented in this paper.

Materials and Methods
The HIV protein dataset. The dataset was downloaded from the Swiss-Prot (version 57.0) (http://www.uniprot.org/) 5 . This dataset contained 381 HIV-1 protein sequences and 109 HIV-2 protein sequences. The sequence identity was analyzed by a culling program PISCES (http://dunbrack.fccc.edu/PISCES.php) 67,68 . The distribution of their sequence identity percentage was shown in Table 4. In order to get enough number of protein sequences, HIV-1 dataset and HIV-2 dataset with ≤90% identity were used. The redundant protein sequences with more than 90% identity were deleted by a culling program: PISCES (http://dunbrack.fccc.edu/PISCES.php). In the final datasets, HIV-1 dataset consisted of 242 non-redundant protein sequences and HIV-2 dataset consisted of 86 non-redundant protein sequences.
Classifiers. In this study, the increment of diversity (ID) 26 , support vector machine (SVM) 25 , logisitic regression (LR), and multilayer perceptron (MP) were used to classify the HIV-1 proteins and HIV-2 proteins. The C++ software was used to write the ID algorithm, and the SVM, LR and MP algorithms were implemented in the Weka package 69 . Protein sample representation. The appropriate parameters were also important for the classifiers. Here, the 20 amino acid compositions, 400 dipeptide compositions, 6 amino acid hydropathy compositions and 36 hydropathy dipeptide compositions were selected as the input parameters of the ID algorithm 44,45 . Statistical analysis. In this study, the F-score 70 was used to quantify the observed difference between the 20 amino acid compositions of the HIV-1 proteins and those of the HIV-2 proteins. The Wilcoxon rank-sum test was carried out to calculate the P-values between the 20 amino acid compositions in the two HIV protein groups. The difference was considered significant if the P-value < 0.05.