Identification of S-nitrosylation sites based on multiple features combination

Protein S-nitrosylation (SNO) is a typical reversible, redox-dependent and post-translational modification that involves covalent modification of cysteine residues with nitric oxide (NO) for the thiol group. Numerous experiments have shown that SNO plays a major role in cell function and pathophysiology. In order to rapidly analysis the big sets of data, the computing methods for identifying the SNO sites are being considered as necessary auxiliary tools. In this study, multiple features including Parallel correlation pseudo amino acid composition (PC-PseAAC), Basic kmer1 (kmer1), Basic kmer2 (kmer2), General parallel correlation pseudo amino acid composition (PC-PseAAC_G), Adapted Normal distribution Bi-Profile Bayes (ANBPB), Double Bi-Profile Bayes (DBPB), Bi-Profile Bayes (BPB), Incorporating Amino Acid Pairwise (IAAPair) and Position-specific Tri-Amino Acid Propensity(PSTAAP) were employed to extract the sequence information. To remove information redundancy, information gain (IG) was applied to evaluate the importance of amino acids, which is the information entropy of class after subtracting the conditional entropy for the given amino acid. The prediction performance of the SNO sites was found to be best by using the cross-validation and independent tests. In addition, we also calculated four commonly used performance measurements, i.e. Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), and the Matthew’s Correlation Coefficient (MCC). For the training dataset, the overall Acc was 83.11%, the MCC was 0.6617. For an independent test dataset, Acc was 73.17%, and MCC was 0.3788. The results indicate that our method is likely to complement the existing prediction methods and is a useful tool for effective identification of the SNO sites.

Features selection via IG. To further improve the prediction performance, these features were optimized based on the above-mentioned IG optimization method. The four types of features PC-PseAAC, kmer1, kmer2, and PC-PseAAC_G are mainly related to the frequency of amino acids but are independent of the position of protein sequences. Hence, we optimized these four types of features based on the IG score of the amino acid residues. Firstly, we sorted the importance of amino acid composition (AAC) and the amino acid pair composition (i.e. kmer2) by IG score, and then applied the incremental feature selection to find out the best feature subset for maximizing prediction performance. According to the final performance evaluation, the application of IG score on kmer2 was especially distinguishable. The detailed prediction performances for different number of features combination on 10-fold cross-validation were shown in Fig. 1. It can be seen that when the dimension for the feature vector selected to be 180, the predictive performance achieved the highest value with Sn of 72.79%, Sp of 74.64%, Acc of 73.71%, and MCC of 0.4741. However, there was no obvious improvement for the other three types features PC-PseAAC, kmer1, and PC-PseAAC_G. This could be due to the low dimensions of these three types of features (less than 50). On the contrary, the dimension of kmer2 was 400, and the feature matrix was an extremely sparse matrix and hence having IG reflecting a good performance. The results of the IG score ranking importance of amino acid residues and dipeptide are displayed in Fig. 2 and Fig. 3, respectively and the detailed results are shown in Supplementary Table S2. It is noteworthy that the amino acid residues K, M, and C and the dipeptides MG, VK, and ML exhibited a great contribution to the prediction performance. Fig. 2 and Fig. 3 showed that the highest IG score reached 0.0156, 0.0112 and 0.0043 for the amino acid residues K, M, and C, respectively, while the highest IG score reached 0.0062, 0.0057 and 0.0047 for the amino acid dipeptides MG, VK, and ML, respectively. Before the features selection, the prediction performance with the Sn of 73.46%, the Sp of 74.94%, and the Acc of 74.28%. After removing the irrelevant feature and then determining the optimal combination of features, we then obtained the best prediction performance with the Sn of 73.60%, the Sp of 75.93% and the Acc of 74.82, respectively. As can be seen, all of the three measurements have been improved slightly. But the prediction   www.nature.com/scientificreports www.nature.com/scientificreports/ performance was not satisfied, it need us to make improvements on this work in the future. The results for the best predictive performance are shown in Table 2. An improved predictive Acc for the models that were trained with the optimized features was being seen when compared with the model with non-optimized features. As given in Supplementary Table S2, (18)] features regarded as the optimal feature set for the selected model. Based on the 425 features, the predictive Sn, Sp, and Acc were 73.60%, 75.93% and 74.82%, respectively. These results indicate that the key amino acid residues and the key dipeptide used in optimizing the models can enhance the prediction performance of the SNO sites. Consequently, the features combined with key amino acid residues were applied to implement a novel and high-performance tool for identifying cysteine S-nitrosylated sites.
Comparison with other feature selection methods. In this paper, different feature selection methods were exploited for comparison. We made several comparisons for evaluating the performance of IG with Max-Relevance-Max-Distance 31 (MRMD), a method for feature selection. MRMD contains two components, max distance and maximal relevance. The max distance selects a new feature which has the least redundancy in the residual of features, while the maximal relevance selects feature that has the strongest relevance to the target class.
We used four distance methods ED, COS, TC and Mean of MRMD to find out the best feature vectors combination through using 10-fold cross-validation. The detailed predictive performances are listed in Fig. 4. When the distance function ED was adopted, its best Acc achieved 73.32% with 356 features. And when the distance functions are COS, TC and Mean, the predictive performance is the highest with 397, 398 and 43 features, respectively, whose corresponding predictive performance is 73.14%, 73.35% and 73.34%. Suppose that the total dimension of feature vector is 400, the influence of dimension reduction is not obvious when the distance function ED, COS and TC are used (the predictive performance is the best with 356, 397 and 398 features, respectively). However, the influence of dimension reduction is prominent when the Mean distance function is used (the predictive performance is the best with 43 features), which causes a lot of information lost in the feature vector. The best performances for different feature selection methods are listed in Supplementary Tables S8-11. From Fig. 4A, we can see that although the performances of two methods, IG and four types of MRMD, are almost identical on the same datasets, Acc of IG has better advantageous. Meanwhile, its Acc is generally higher than that of MRMD method, including ED, COS, TC and Mean. Moreover, it has more advantages to achieve the dimensionality reduction of high-dimensional eigenvectors and unsure high Acc. From Comparison with other methods. To make a fair and fast comparison, we compared the prediction performance of our predictor with GPS-SNO 2 , iSNO-PseAAC 20 , iSNO-ANBPB 21 , PSNO 22 , iSNO-AAPair 19 on the Xu training dataset by running 10-fold cross-validation test 10 times. The results were shown in Table 3. Our constructed model exhibits the best prediction performance with Acc of 83.11%, which was 1.41% higher than the previous best-performing predictor iSNO-AAPair, and 7.44% higher than Acc achieved by PSNO. Our predictor also gave a MCC of 0.6617, which was 0.0317 higher than the method of iSNO-AAPair, and 0.1498 higher than PSNO. Furthermore, Sn of our predictor was 83.33%, which was 3.73% higher than Sn of iSNO-AAPair, and 9.18% higher than PSNO. This comparison indicates that the proposed model is indeed promising and could at least play a role that complements the existing state-of-the art methods in this field. In addition, we tested the predictive power of our model with the powers of the SNOSite 32 , iSNO-AAPair 19 , iSNO-PseAAC 20 , iSNO-ANBPB 21 on the Li test dataset; and we also compared our model with the GPS-SNO 2 , iSNO-PseAAC 20 , iSNO-AAPair 19 , and PSNO 22 methods on Xu test dataset. The performances of the above-mentioned models against two test datasets are summarized in Supplementary Tables S3 and S4. On the Li independent test dataset, our model captured proteins O00429 (site 367), P13221 (site 83), P43235 (site 139) as S-nitrosylation sites, while methods iSNO-AAPair and iSNO-PseAAC incorrectly predicted them as non-S-nitrosylation sites. On the Xu independent test dataset, our model captured proteins O70572 (site 176), P51174 (site 342), Q8VDG5 (site 308), Q9WVQ5 (site 146), P55060 (site 344) as S-nitrosylation sites, while models iSNO-PseAAC and GPS-SNO incorrectly predicted S-nitrosylation sites as non-S-nitrosylation sites. To show the prediction results clearly, we summarized Sn, Sp, ACC and MCC that was achieved by each model in Table 4. As it can be seen that our predictor achieved www.nature.com/scientificreports www.nature.com/scientificreports/ the performance with Sn of 60.47%, Sp of 77.69% and Acc of 73.17% on the Li test dataset. Among the other five methods, the best prediction performance was achieved by the method of Li et al., with Sn of 51.16%, Sp of 69.42% and Acc of 64.63%. Our method is obviously superior to other methods. However on Xu test dataset, our predictor achieved the prediction performance with the Sn of 64.20%, the Sp of 75.00%, and the Acc of 70.17%, which is only better than iSNO-PseAAC with Sn of 50.2%, Sp of 75.1% and Acc of 62.8%. The results show that our predictor outperformed previous methods in terms of precision. But on the Xu test set, the results are not ideal, which may be caused as a result of not considering the physical chemistry properties. In the future work, we will consider more compressive features and further optimize the feature combination approaches.   www.nature.com/scientificreports www.nature.com/scientificreports/

Conclusion
The prediction of SNO sites is essential for better understanding of the basic biological theory, clinical diagnosis as well as the pharmaceuticals. In this study, we introduce the IG which is a tool for the analysis of the importance of amino acid and its position used in feature extraction. Here, we focus on the characteristics of the amino acids with its sequence. Four out of the nine characteristics of the combination are related to the amino acid residues. The four types of features were screened using the IG method, and the best dimension of the feature vector was selected. Among these features, 180 important features were screened from the feature kmer-2 whose dimension was reduced from 400 to 180, with the best prediction performance, and Sn and Acc are reached 83.33% and 83.11%, respectively. Theoretically, there is a lot of information in the protein structure when compared with the simple sequences, and this will be considered in the future scope of the work. With the development of internet and big data era coming, constructing databases [33][34][35][36][37][38][39][40] and establishing powerful webserver are the direction of bioinformatics. Thus, making it convenient to most experimental scientists

Material and Methods
Datasets. The datasets were constructed using those of Li et al. 22 and Xu et al. 19,20 (henceforth named the Li dataset and Xu dataset, respectively). As described previously 19,20,24 , these datasets were derived on the basis of the experimental verification of the protein S-nitrosylation sites. Xu training dataset consisted of 731 positive SNO sites as positive samples and 810 non-SNO cysteine sites as negative samples from the 438 proteins with <=40% sequence similarity. These samples were used for training our prediction model. The Xu test dataset consisted of 81 SNO sites and 100 non-SNO sites, and the Li test dataset included 43 SNO sites and 121 non-SNO sites. In this study, Xu and Li test datasets were applied to test the prediction performance of our model.
Considering that we have a protein peptide sample P in our datasets, which can be generally formulated by: where the subscript t is an integer, R −t is the t-th downstream amino acid residue from cysteine(C), R t the t-th upstream amino acid residue, and so forth. The peptide was termed as SNO or non-SNO peptide depending on whether its center is a SNO or non-SNO sites, respectively. P belonged to one of two categories viz. the SNO sites (positive data) or non-SNO sites (negative data). In the current study, we selected t = 10. If the upstream or downstream in a protein was less than 10, the lacking residues were filled using the dummy code X. Thus, the training dataset S was formulated as (∪: in the set theory to formulate the union of datasets): where the positive dataset S + consisted of 731 SNO cysteine sites, while the negative dataset S − contained 810 non-SNO cysteine sites; The test dataset T Li and T Xu was formulated as: Xu Xu where the positive dataset + T Li and − T Xu contained 43 and 81 SNO peptide fragments, respectively; while the negative dataset − T Li and − T Xu contained 121 and 100 non-SNO peptide fragments, respectively. For the reader's convenience, the three datasets used in this study are given in Supplementary Tables S5-7. The schematic flowchart of our work is being shown in Fig. 5.

Features extraction. Parallel correlation pseudo amino acid composition (PC-PseAAC). PC-PseAAC 41 is
the feature extraction approach that incorporates the contiguous local and the global sequence-order information to obtain the feature vector for the protein sequence. Given a protein peptide P (Eq. 1), the PC-PseAAC feature vector for P is given by: www.nature.com/scientificreports www.nature.com/scientificreports/ where f i (i = 1,2, …, 20) is the normalized occurrence frequency of the 20 amino acids in the protein P; the parameter λ is an integer, representing the highest counted rank (or tier) of the correlation along a protein sequence; w is the weight factor ranging from 0 to 1; and Θ j (j = 1,2, …, 20) is the j-tier correlation factor reflecting the sequence-order correlation between all the j-th most contiguous residues along a protein chain, which is defined as: where the correlation function is given by: and M R ( ) i are the hydrophobicity value, hydrophilicity value, and side-chain mass, respectively, of the amino acid R i . It should be noted that before substituting the values of hydrophobicity, hydrophilicity, and side-chain mass into Eq. 7, they are all subjected to a standard conversion as described by the following equation: General parallel correlation pseudo amino acid composition (PC-PseAAC_G). The PC-PseAAC_G approach 42 , not only incorporates the comprehensive built-in indices extracted from the AAindex 43 , but also allows the users to upload their own indices to generate the PC-PseAAC_G feature vector. For a given a protein peptide P (Eq. 1), the PC-PseAAC_G feature vector of P is defined as: where f i (i = 1,2, …, 20) is the normalized occurrence frequency of the 20 amino acids in the protein P; the parameter λ is an integer, representing the highest counted rank (or tier) of the correlation along a protein sequence; w is the weight factor ranging from 0 to 1; and Θ j (j = 1,2,…,20) is called the j-tier correlation factor reflecting the sequence-order correlation between all the j-th most contiguous residues along a protein chain, which is defined as: In this case, the correlation function is given by: where µ is the number of physicochemical indices considered; H R ( ) u i is the u-th physicochemical index value of the amino acid R i ; H R ( ) u j is the u-th physicochemical index value for the amino acid R j . It should be noted that before substituting the physicochemical indices values into Eq. 14, they were also all subjected to a standard conversion as described by the following equation: Basic kmer (kmer). Basic kmer 44 is the simplest approach to represent the proteins by a numerical vector, in which the protein sequences are represented as the occurrence frequencies of k neighboring amino acids 45 . Given a protein sequence P (Eq. 1), the kmer feature vector of P is formulated as follows: where x i and y i are the normalized occurrence frequency of the 20 amino acid residues and 400 dipeptides in the protein P, respectively.
www.nature.com/scientificreports www.nature.com/scientificreports/ Bi-Profile Bayes (BPB). BPB 26 comprehensively considers the information contained in the two aspects of positive and negative samples that have been successfully applied in many fields of bioinformatics and has made effective predictions 26,[46][47][48] . Given a protein peptide P (Eq. 1), the BPB feature vector of P is defined as: Double Bi-Profile Bayes (DBPB). DBPB is an improvement of BPB that was proposed by Shao et al. 24 . As mentioned above, BPB is the posterior probability of each single amino acid at each position in the positive and negative datasets, while DBPB is the posterior probability of each two adjacent amino acids at each position in the positive and negative datasets. Given a protein sequence P (Eq. 1), the DBPB feature vector of P is defined as: where V is the posterior probability vector; Adapted Normal distribution Bi-Profile Bayes (ANBPB). ANBPB 21,49 is the improvement of BPB in another aspect. Given a protein sequence P (Eq. 1), the ANBPB feature vector of P is defined as: where ϕ(x) is the standard normal distribution function and the detailed description of the formula is given 21,49 .
Incorporating Amino Acid Pairwise (IAAPair). The posterior probability of every two adjacent amino acids and each two next nearest amino acids at each position in the positive peptide sequence datasets is subtracted from in the negative peptide sequence datasets 19 . Given a protein sequence P (Eq. 1), the IAAPair feature vector of P is defined as:   www.nature.com/scientificreports www.nature.com/scientificreports/ | P x y ( ) i j is the posterior probability of X with the value y j of Y. The amount by which the entropy of X decreases reflects the additional information about X provided by Y and is called the information gain: H(X Y) (34) According to this measure, Y has a stronger correlation with X than with Z, if IG (X|Y) > IG (Z |Y). It is obvious that Y represents the amino acid type, when extracting the IG score for positions. On the other hand, Y represents the amino acid frequency, when extracting IG score for amino acids.
Calculating IG score of positions and amino acid residues: (1) The importance of positions: The 20 amino acid residues (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y) were coded into digits from 1 to 20. The query sequences segments were coded into an X-dimension digital sequence (protein). (2) The importance of amino acid residues: The amino acid frequency in the surrounding sequence query site (the site itself is not counted) was calculated. The query sequences were also coded into a 20-dimension feature. (3) Calculation of the IG score for positions by (1) and the IG procedure was performed. Subsequently, the calculation of the IG score for amino acid residues by (2) and the IG procedure was done. Then, we ranked the corresponding positions and amino acid residues by their IG score and selected the key positions and key amino acid residues.
In this work, we used the IG score to calculate the importance of amino acid residues: (36) is divided into two parts, the former is the entropy H(X) of class X, and the latter part is the conditional entropy | H(X Y) i of X a given amino acid Y. Suppose the number of training samples is N. Initially, we count each training sample. Subsequently, if each characteristic y j is added in the training sample x, two times will be counted once: Max-Relevance-Max-Distance (MRMD). MDMR 31 is a feature selection method for reducing dimensionalities, which can be further divided into two aspects.
(1) One is the relevance between sub-feature set and target class. Here, Pearson's correlation coefficient is exploited to measure the relevance. With the increase of Pearson's correlation coefficient, the relevant between feature and target class also increases. www.nature.com/scientificreports www.nature.com/scientificreports/ The features with large sum of relevance and distance would be chosen as the ultimate sub-feature set. Finally, the sub-feature set generated by MRMD has low redundancy and strong relevance with the target class.
In order to describe the algorithm clearly, we listed some functions in following section. Given the input datasets tabled as N instances, M features = = … i M F {f , 1, , } i and the target class C, the aim is to find a subspace of M features, which is selected from the M dipeptides original space, and makes the greatest contribution to classify the target class C.

Max-relevance (MR).
Making the greatest contribution for classifying the target class condition and this often requires the maximal relevance for the target class C on the subspace, which needs us to select a feature set with the highest relevance to target class C. We use the Pearson's correlation coefficient to measure positive correlation and negative correlation. Because it is suitable for calculating continuous variables and easy to implement, Pearson's correlation coefficient is adopted as the measure of relevance between feature and target class C.
The value of MR for feature i can be defined as follows.
is a vector composed from ith features from each instance, and → C i is also a vector whose every element comes from the target class C of each instance. Their Pearson's correlation coefficient is defined as Max-Distance (MD). MDMR proposed a new approach to realize Max-Redundancy based on distance function, namely maximal distance, to measure the level of similarity between two feature vectors. There are three types of distance functions that can be chosen, which are Euclidean distance, cosine similarity and Tanimoto coefficient. Compared with the commonly used methods, Euclidean distance is easier to calculate. As compared to the Euclidean distance, cosine similarity focuses on the angle between two vectors. The last one, Tanimoto coefficient, is also called Jaccard coefficient in the broad sense. Under the binary condition, it is similar to Jaccard coefficient. For each feature, its value of distance defined as follows is based on t three types of distance functions mentioned above. According to the following formula, we can obtain their values for the feature i ( ≤ ≤ i M 1 ) ED i , COS i and TC i , respectively. We can obtain top m features which are considered to be the sub-feature set with minimal redundancy by MD.
MRMD. The criterion used for combining the two constraints above is called "Max-Relevance-Max-Distance" (MRMD). After having done all the above preparations, we could start to select the features subspace. The algorithm optimizes the following condition. For a specific problem, the condition for feature selection take into consideration that the MR is not as important as MD. Therefore, the variables w r ≤ ≤ M