Characterization and Identification of Lysine Succinylation Sites based on Deep Learning Method

Succinylation is a type of protein post-translational modification (PTM), which can play important roles in a variety of cellular processes. Due to an increasing number of site-specific succinylated peptides obtained from high-throughput mass spectrometry (MS), various tools have been developed for computationally identifying succinylated sites on proteins. However, most of these tools predict succinylation sites based on traditional machine learning methods. Hence, this work aimed to carry out the succinylation site prediction based on a deep learning model. The abundance of MS-verified succinylated peptides enabled the investigation of substrate site specificity of succinylation sites through sequence-based attributes, such as position-specific amino acid composition, the composition of k-spaced amino acid pairs (CKSAAP), and position-specific scoring matrix (PSSM). Additionally, the maximal dependence decomposition (MDD) was adopted to detect the substrate signatures of lysine succinylation sites by dividing all succinylated sequences into several groups with conserved substrate motifs. According to the results of ten-fold cross-validation, the deep learning model trained using PSSM and informative CKSAAP attributes can reach the best predictive performance and also perform better than traditional machine-learning methods. Moreover, an independent testing dataset that truly did not exist in the training dataset was used to compare the proposed method with six existing prediction tools. The testing dataset comprised of 218 positive and 2621 negative instances, and the proposed model could yield a promising performance with 84.40% sensitivity, 86.99% specificity, 86.79% accuracy, and an MCC value of 0.489. Finally, the proposed method has been implemented as a web-based prediction tool (CNN-SuccSite), which is now freely accessible at http://csb.cse.yzu.edu.tw/CNN-SuccSite/.

where n x (k) represents the number of occurrences of a specific amino acid k. Refer to the method of positional weighted matrix (PWM) of amino acids around sulfation sites 43 , the position-specific amino acid composition (PspAAC) around the succinylated sites was determined using non-homologous training datasets. The PspAAC specified the relative frequency of twenty amino acids of each position that surrounded the succinylation sites, and was utilized in encoding the fragment sequences. A matrix of m × w elements was used to represent the PspAAC of a training dataset, where m stands for 20 types of amino acids and w is the window size ranging from −15 to +15. The matrix with 20 × 30 features was represented as:  (CKSAAP) has been extensively applied in analyses of protein functions 28,33,41,[44][45][46][47][48][49] . This study transformed all training sequences into numeric vectors based on the encoding method of CKSAAP. Given k values ranging from zero to five, the number of occurrence of each k-spaced AAP can be determined from target sequences. If k is set as one,       A xA i j was used to represent the pair of amino acids A i and A j (i and j = 1, …, 20, corresponding to 20 amino acids) which are separated by one residue of any amino acid x. If k is set as two, A 20 . In order to identify the difference of occurring frequency of a KSAAP between positive and negative sequences, for instance, the diversity of a one-spaced AAP [A i xA j ] can be obtained from: i j is a more significant attribute in the positive dataset; otherwise, a smaller negative value of       C A xA i j revealed it is a more abundant attribute in negative dataset. Among a total of 2400 KSAAPs, we utilized a feature selection approach, minimum redundancy-maximum relevance (mRMR), to generate an index score for each KSPAAP 50 . A KSAAP with minimum redundancy and maximum relevance was regarded as the best attribute for classifying succinylated and non-succinylated sequences. The scoring function of mRMR was described as: n , in which S m , S n , and S were the attribute sets (m and n were the attribute sizes), and c is a classification variable with two possible classes. Additionally, the mutual information M x y ( , ) was defined as: where p x y ( , ), p x ( ), and p y ( ) were regarded as the probabilistic density functions between attributes x and y. In addition, the sequential forward selection (SFS) was employed to a final set of 400 most discriminating KSAAPs according to the ranking of mRMR index scores.
Position specific scoring matrix (PSSM). From a structural viewpoint, several amino acid residues can be mutated without changing a protein's tertiary structure, and two proteins may have similar structures with different amino acid compositions 51 . PSSM profiles, which have been extensively utilized in protein secondary structure prediction, subcellular localization, and other bioinformatics analyses [51][52][53][54] , were adopted herein with significant improvement. As presented in Fig. 2, the PSSM profile of each training sequence was generated by performing PSI-BLAST 55 against the database of non-homologous succinylated peptides. The PSSM profile was composed of www.nature.com/scientificreports www.nature.com/scientificreports/ a matrix with w × m elements, where w stands for the sequence length (ranging from −15 to +15) and m represents 20 types of amino acids, which is row-centered at modified site.
Then, the w × m matrix was transformed into a matrix with 20 × 20 features S x (i, j), where i and j range from 1 to 20, by summing up the rows that were involved in the same type of amino acid i.
Finally, each element in the 20 × 20 matrix was divided by the window length w and then normalized using a sigmoid function: x S x i j w ( , ) characterization of substrate site signatures. To investigate into the substrate-site specificity of succinylated sites, the maximal dependence decomposition (MDD) 40 was employed to divide positive training sequences into several groups with potentially conserved motifs. The MDD has been reported to having the ability to enhance the predictive effectiveness of computationally identifying substrate sites on different PTM types 31,35,56 . For reaching this purpose, a chi-squared test χ P P ( , ) 2 i j is adopted to examine the intrinsic interdependence between two positions, P i and P j , which are in the neighboring upstream and downstream regions of succinylation sites. Amino acids, 20 in total, are categorized into five groups, based on their physicochemical properties: polar, acidic, basic, hydrophobic, and aromatic. Given two positions P i and P j , the occurring frequency of the presence of each amino acid group is determined for the elements of a contingency table. The chi-squared test is defined as: where K mn is the number of positive training sequences containing amino acids in group m at position P i and containing amino acids in group n at position P j , for each pair ( ) P P , i j and ≠ i j. The expectation value Q mn is obtained from i j is a significant dependence if its value is larger than 34.3, based on the p-value of 0.01 with degree of freedom at 16 57 . When performing MDD on the dataset of all positive training sequences, the parameter of maximum cluster size should be specified with an appropriate cutoff value. The MDD clustering process will be terminated when all the group sizes are less than the specified value of maximum cluster size.

Construction of deep neural networks.
This study involved a binary classification of lysine residues into succinylated and non-succinylated sites. Due to the emergence of applying deep learning methods in bioinformatics 58 , we utilized a deep convolutional neural network (CNN), which is an extension of an artificial neural network (ANN) with multiple hidden layers between input and output layers ( Supplementary Fig. S2). With the increasing number and complexity of high-throughput biological datasets, CNNs can decipher more complicated patterns and relationships within the investigated attributes than a traditional ANN, which only includes one hidden layer. A significant increase in the data count of MS/MS-identified protein succinylation would enable the number of neurons required in each layer to increase exponentially along with the potential patterns. Hence, this work exploited a CNN to learn the predictive models using various types of sequence-based attributes. In recent years, CNNs have been extended to incorporate convolution and pooling strategies in hidden layers to reduce the quantity of weights and complexity of calculations, respectively, when generating network structure. When implementing a CNN model, it is necessary to determine the number of convolution and pooling layers and choose a classification function for the output layer. As presented in Fig. 3, the first layer of CNN is the input layer. The AAPC attribute, represented as a matrix with 20 × 20 elements, was used as an example for constructing the CNN model.
When developing a CNN model, the convolution layer is the core layer, which functions as a pattern scanner and contains two major parameters: filters (or kernels) and stride. Each filter, which can be regarded as a small pattern with specified matrix size (e.g. 3 × 3 used in this work), is convolved across the width (20) and height (20) of the input data, based on the dot product between the elements of the filter and the input data in order to create new feature maps. We specified the value of stride as 1, then moved the filter one pixel at a time, and the input data with a 20 × 20 matrix size can be transformed into a new feature map with a matrix size of (20-3 + 1) × (20-3 + 1). The number of filters controls the depth (the number of neurons) in the convolution layer that may detect a specific type of pattern connecting to the input data. In addition to filters and stride, zero padding is a convenient approach to pad the input with zeros on the border of the matrix. Zero padding can be used to control the matrix size of input data. to learn a predictive model with two-node output layer. A total of eight layers were implemented in this work, such as one input layer, two convolution layers, two max pooling layers, two fully connected layers, and one output layer. For each dense layer, the ReLU activation function was applied to avoid gradient diffusion. In addition, the dropout step was conducted in the hidden layers with an attempt to reduce overfitting. Finally, the output layer is composed of two nodes corresponding to the classifying results based on a softmax function. (2019) 9:16175 | https://doi.org/10.1038/s41598-019-52552-4 www.nature.com/scientificreports www.nature.com/scientificreports/ The pooling layers, which comprise another critical part of a CNN model, usually immediately follow the convolution layers. Max pooling is a sort of non-linear down-sampling strategy used frequently for CNN construction. Typically, the max pooling layer can split the input matrix into a set of non-overlapping rectangles and can form a smaller matrix containing maximal outputs of each sub-region. Two major parameters used in max pooling are kernel size and stride, which are usually set as 2 × 2 and 2, respectively, for moving the 2 × 2 kernel along width or height 2 pixels at a time, discarding 75% of the activations. For instance, a feature map with matrix size of 18 × 18 in the convolution layer can be transformed into a smaller feature map with matrix size of 9 × 9 in the following max pooling layer. The function of max pooling is to reduce the amount of computing time in a CNN model and examine if the patterns extracted from the corresponding convolution layer exist in the input data or not 59 .
After two convolution and max pooling layers, the highly complicated CNN modeling was accomplished by fully connected layers. Before getting into the fully connected layer, the flattening step (flatten layer) is a necessity that can be used to convert the matrix of input data into a vector. The flattening process is typically used prior to the fully connected layer. In a general CNN model, neurons in a fully connected layer have full links to all activations in the previous layer, as shown in Fig. 3. Thus, all the activations in the previous layer can be summarized by matrix multiplication along with a set of weight values on the links. Due to the occupation of most neurons in fully connected layers, an over-fitting problem might easily occur during CNN model construction. Herein, the dropout layer has been adopted to randomly mask a specified portion of its neurons in order to prevent CNN model construction from an over-fitting problem 60 . The dropout layer is carried out by dropping out the neurons with a specified probability P and retaining the neurons with probability 1 -P. The value of probability P ranges from 0 to 1 with an attempt to determine the best P value for optimizing predictive performance. After that, we obtained a reduced network, in which the incoming and outgoing links to the dropped-out neurons are also eliminated.
As for the binary classification between succinylated and non-succinylated sites, the output layer comprised two neurons corresponding to the classification results based on a softmax function. The two nodes in the output layer were fully connected to the neurons of the previous layer. The softmax function could be regarded as a loss function by specifying how to penalize the difference between the predicted and true classes. The softmax function (or normalized exponential function) is a kind of logistic function that can be used to represent a probability distribution over K different categories. In this work, the value of K was set as two for the succinylated and non-succinylated datasets. Given a sample vector x and a weight vector w, the predicted probability for j-th class by the softmax function is defined as This can be regarded as the probability of x for the j-th class against the composition of K linear functions: . Additionally, the ReLU is frequently used as the activation function when generating a CNN model with an enhanced nonlinear property but without a significant penalty for generalization accuracy 61 . In this work, the ReLU function was also employed to avoid gradient diffusion during the process of CNN construction. The ReLU function is defined as: performance evaluation of predictive models. In the generation of CNN models, the k-fold cross-validation was employed to evaluate their predictive performances. When implementing k-fold cross-validation, all the training data, including positive and negative sequences, were randomly clustered into k equal-sized subgroups. After having k subgroups, k-1 of them shall be regarded as the training sample and the remaining one subgroup was considered as the validation sample. In a round of k-fold cross-validation, each of the k subgroups should be considered as the validation sample once in turn. Sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews correlation coefficient (MCC) have been used as the metrics to determine the performance of the generated models. The four metrics are defined as: where TP, FN, TN, and FP denote the instances of true positive, false negative, true negative, and false positives, respectively. Due to the unbalanced positive and negative training datasets in this work, we have decided to choose MCC value as a major benchmark for achieving a relatively balanced sensitivity and specificity. After evaluating the of k-fold cross-validation, the CNN model reaching the best predictive performance was further evaluated by an independent testing dataset that was not included at all in the training dataset. www.nature.com/scientificreports www.nature.com/scientificreports/ independent testing. Due to the potential over-fitting issue originating from the training dataset, the predictive power of the generated models might be overestimated. Thus, the use of an independent testing dataset was necessary to further evaluate of the real case. In this study, the independent testing dataset was mainly collected from dbPTM 44,45,62 . Before the extraction of positive and negative testing sequences, the experimentally verified succinylated proteins in testing dataset were compared with training dataset in order to eliminate the homologous protein sequences between the two datasets. When extracting sequence fragments using the same window length as used in constructing the training dataset, the fragmented sequences might be overlapped between the two datasets. Hence, CD-HIT software was used again to delete fragmented sequences with 30% similarity. After that, the final dataset for independent testing contained 218 succinylated and 2621 non-succinylated entries. Moreover, the testing dataset was utilized to make a comparison between the proposed deep-learning models and other machine learning schemes in terms of predictive performance. Another cause of over-fitting might be due to the training process of the CNN. To avoid the over-fitting problem, we only used two convolution layers with lower filters to reduce the complexity of our model by minimizing the possible training parameters 59 .

Results and Discussion
Substrate site signatures of lysine succinylation. The amino acid composition (AAC) was a feasible scheme to explore the potential motif of conserved residues around the succinylation sites based on the fragments with 31-mer sequence length. Since comparing the AAC between positive and negative datasets, the residues having significant differences could be regarded as useful attributes for succinylated sites prediction. Supplementary  Fig. S3 showed that, for succinylated sites, the positively charged lysine (K) residue appeared to have the highest frequency around the substrate sites. In addition to AAC, the position-specific AAC neighboring the succinylation sites can be displayed by frequency plots of WebLogo 63 . As illustrated in Fig. 4A, there is no any amino acid having significantly high frequency near the succinylation sites, but the slightly prominent amino acid residues included Leucine (L), Lysine (K), Alanine (A), and Valine (V). Without conserved motifs observed in frequency plot, the TwoSampleLogo 64 program was further applied to compare the differences of position-specific AAC between succinylated and non-succinylated sequences. As displayed in Fig. 4B, when comparing with the sequence logo of non-succinylated sites (Fig. 4C), the most conserved motifs appeared to be associated with charged residues, in particular the positively charged K and arginine (R) residues on positions −11~−4 and +3~+12. Additionally, the negatively charged amino acids, such as aspartic acid (E), located at positions −2, +1 and +2.
A hierarchical clustering analysis was performed on the detection of motif signatures by categorizing all positive training sequences into seven subgroups that possess statistically significant dependencies of amino acid composition around the substrate sites. The MDD-clustered subgroups with motif signatures for the 5842 non-homologous succinylated sites are presented in Fig. 5 based on a tree-like structure. The motif in Group1 (933 sequences) is the significant occurrence of basic amino acids (K, R, and H) at position −5, with the highest dependence value among all subgroups. In the meantime, the remaining 4909 sequences are further analyzed based on the maximal dependency in the occurrence of amino acids neighboring the substrate sites. The Group2 (466 sequences) possesses a similar motif of basic amino acids at position −4. Additionally, the Group3 (398 sequences) and Group4 (832 sequences) also have the motif of basic amino acids at position +4 and +1, respectively. This investigation demonstrates that the detected motif signatures are consistent with the observation in two-sample logo, which having positively charged residues conserved in the upstream and downstream regions of succinylated sites. On the other hand, the Group5 (905 sequences) has the conserved motif of acidic residues at position +1. The Group6 also reveals that the position +1 is potent that contains the motif signature of polar and uncharged amino acids. The remaining data in the Group7 contain a slightly significant character in position +1.

Performance evaluation of CNN models trained with single attributes.
In an attempt to examine the optimal window size for yielding the best performance, various window size values were adopted to extract the training sequences for model construction. After comprehensive analyses of performance comparisons, the window size of 31 (−15 to +15; with the succinylated residue in the center) achieved the best prediction performance, which is consistent with the difference of position-specific AACs between positive and negative training sequences. Based on the investigated features, their corresponding CNN models were built to determine the effectiveness of those features in identifying succinylation sites. As shown in Table 2, the CNN model trained with PspAAC reached an accuracy of 73.36% and an MCC value of 0.371. The AAPC model performed slightly better than the PspAAC model, which yielded an accuracy of 76.48% and an MCC value of 0.428. In the investigation of k-spaced amino acid pairs, the CNN model trained with the composition of one-spaced amino acid pairs (K = 1) provided the best performance at 77.95% sensitivity, 76.63% specificity, 76.85% accuracy, and MCC value of 0.432. After extracting the top 400 k-spaced amino acid pairs (K = 1−5) based on mRMR, the performance of the CNN model trained with the selected CKSAAP (top400) showed remarkable improvement, reaching a sensitivity of 85.35%, specificity of 83.49%, accuracy of 83.79%, and MCC value of 0.569. Among these CNNs, the model trained with the PSSM feature performed best for discriminating between succinylated and non-succinylated lysine residues. The PSSM model yielded a sensitivity, specificity, accuracy, and MCC value of 85.51%, 84.16%, 84.38%, and 0.579, respectively. Additionally, the ROC curve was generated to compare the predictive performance and stability of different CNN models (Supplementary Fig. S4). Regarding to the comparison among single features, the CNN model trained from the PSSM feature gave the best predictive power, which is consistent with the results reported in PSSM-Suc 65  www.nature.com/scientificreports www.nature.com/scientificreports/ Performance evaluation of CNN models trained with hybrid attributes. In addition to the comparison of predictive powers among single attributes, we also consider a hybrid of multiple attributes to generate the predictive model. Based on the results of performance testing of single attributes, the PSSM, which can yield the best performance, was selected as the principal attribute for the combination with other single attributes. Consequently, a total of three hybrids, such as PSSM + PspAAC, PSSM + CKSAAP(top400), and PSSM + PspAAC + CKSAAP(top400), were further evaluated for uncovering their predictive capabilities in the succinylation site identification. As presented in the Table 2, the CNN model trained using the hybrid of PSSM and PspAAC attributes can reach a comparable performance with that trained using single PSSM attribute. In this investigation, the CNN model trained using the hybrid of PSSM and CKSAAP (top400) could perform best with the sensitivity of 86.94%, the specificity of 85.43%, the accuracy of 85.68%, and the MCC value of 0.608. However, the CNN model trained with the combination of all features performs slightly worse than that trained with the hybrid of PSSM and CKSAAP (top400). Additionally, Supplementary Fig. S4 revealed that the CNN model trained using the hybrid of PSSM and CKSAAP (top400) can outperform other CNN models in terms of ROC curves comparison. The AUC value of the CNN model trained with PSSM and CKSAAP (top400) is 0.886.

Performance comparison between CNN and other machine learning methods.
To demonstrate the effectiveness of the deep learning method in PTM prediction, the predictive performance of this CNN model was compared with that of three popular machine learning methods: decision tree (DT), support vector machine (SVM), and random forest (RF). As summarized in Supplementary Table S1, the SVM and RF algorithms have been widely utilized to identify protein succinylation sites. In this work, the Classification and Regression Trees (CART) was employed to generate binary DTs for classifying between positive and negative instances. Based on the scikit-learn package 66 , the function 'DecisionTreeClassifier' was used to construct a classification tree by a top-down recursion. During the construction process, the 'best' feature set was selected to classify the training tuples that make a split in the tree. In addition, the CART program specified the 'Gini index' as the feature set selection approach. For the construction of RFs, the CART was again adopted to generate multiple trees with the www.nature.com/scientificreports www.nature.com/scientificreports/ 'bootstrap aggregation' (bagging) of data sampling. In scikit-learn package, the function 'RandomForestClassifier' was applied to measure the importance of training features and to generate the RF models. More specifically, Gini importance is the average decreased impurity of each feature across all trees; this impurity was the least-randomness of the given data. Moreover, the function 'svm.SVC' in the scikit-learn package was used to train the binary SVM classifiers. The 'radial basis function' (RBF) was selected as the kernel function of SVM to transform the training data into a higher-dimensional vector space, with an attempt to search for a linearly optimal separating hyperplane.
According to the predictive performance of previous studies that have incorporated SVM or RF into their model construction, the SVM or RF models trained with combinatorial attributes could perform with reliable prediction accuracies. Based on the evaluation of ten-fold cross-validation, among the sequence-based attributes, this investigation has revealed that the DT model trained with PspAAC performed better than other attribute types (Supplementary Table S2). Instead of the PSSM attribute, both SVM and RF methods could reach a better performance by using the composition of the top 400 k-spaced amino acid pairs. Herein, the RF model performs slightly better than the SVM model in terms of MCC value. In addition to the comparison of different models trained using single attribute type, a hybrid of multiple attribute types was further considered into the generation of predictive models. Table 3 shows the comparison of ten-fold cross-validation between deep learning method and other three learning methods, on the basis of combining various attributes. Based on these sequence-based features, this investigation revealed that the CNN model trained using PSSM and CKSAAP(top400), which can  www.nature.com/scientificreports www.nature.com/scientificreports/ yield the sensitivity, specificity, accuracy, and MCC values at 86.94%, 85.43%, 85.68%, and 0.608, respectively, can outperform other three learning methods. However, it is noteworthy that the RF model trained using the hybrid of PspAAC, PSSM, and CKSAAP(top400) attributes can yield a comparable performance (83.08% accuracy) to the CNN model. In conclusion, the proposed CNN model can outperform other three popular machine-learning methods, with reference to the comparison of predictive performances based on the evaluation of ten-fold cross-validation. performance evaluation using an independent testing dataset. When discriminating between succinylated and non-succinylated sequences, it is possible to generate a predictive model whose prediction accuracy is over-estimated due to an over-fitting problem. To avoid presenting an over-estimating performance, this work compiled a dataset for independent testing. These independent testing instances, which are not present in the training dataset, were used to measure the real ability of the proposed model. The independent testing dataset comprised a total of 218 positive and 2621 negative instances. The CNN model trained using the PSSM and CKSAAP(top400) attributes can yield a promising performance with a sensitivity of 84.40%, specificity of 86.99%, accuracy of 86.79%, and MCC value of 0.489. Additionally, to judge the practicality of the proposed model, the comparison between our model and six existing prediction tools was performed using the testing dataset. As displayed in Table 4, our proposed model achieved the highest MCC value, reaching 0.489. In this comparison, the SuccinSite 2.0 can provide the best predictive accuracy (88.83%), while its specificity (91.22%) was much higher than its sensitivity (60.09%). However, the overall performance of SuccinSite 2.0 did not outperform our method in terms of MCC value. Interestingly, as presented in Supplementary Fig. S5, most of the existing prediction tools can provide much better specificity values than sensitivity values. This might be because their models were generated by using the unbalanced positive and negative datasets. In an overall evaluation, the testing results have indicated that the proposed method can provide a more reliable and stable prediction capability than other existing prediction tools, in terms of balanced sensitivity and specificity.

Implementation of web-based prediction tool.
To facilitate the functional analyses of protein succinylation, the proposed method has been utilized to implement a web-based tool, named CNN-SuccSite, for classifying between succinylated and non-succinylated sites. After submitting protein sequences in the FASTA format, the CNN-SuccSite will return the prediction results, including succinylated sites, their flanking amino acids, and the corresponding substrate motif signatures. A case study of succinylation site prediction on mouse Glutathione S-transferase P 1 (Gstp1) was utilized to demonstrate the effectiveness of CNN-SuccSite. The Gstp1 contains six verified succinylation sites at Lys-82, Lys-103, Lys-116, Lys-121, Lys-128, and Lys-191 67 . As presented in Fig. 6, the CNN-SuccSite can achieve an accurate prediction at five validated succinylaion sites, according to the corresponding motif signatures.

conclusion
Due to the abundance of experimentally verified succinylation data obtained from public resources, we were motivated to develop a new method to predict protein succinylation sites based on a deep learning strategy. Systematic investigation of various attributes in the neighborhood of substrate sites were performed on large-scale succinyl-proteome data. In accordance with the results of 10-fold cross-validation, the CNN model trained with  Table 3. Comparison of ten-fold cross-validation between deep learning method and other machine learning methods.  Table 4. Performance comparison between our method and six existing available prediction tools based on the independent testing dataset.

Method
the hybrid of PSSM and CKSAAP(top400) attributes can outperform that trained with other attributes. Besides, this investigation also demonstrated that the CNN model could provide a better performance than three popular shallow machine learning methods, including DT, SVM, and RF. Moreover, the independent testing was performed and the results demonstrated that the selected CNN model could outperform other existing prediction tools. Based on the usage of the independent testing dataset, the CNN model trained with the hybrid of PSSM and CKSAAP(top400) attributes could yield a promising performance. We truly believe that our proposed approach will help facilitate the determination of succinylated lysine residues of proteins. In the future, the physicochemical properties, such as solvent accessibility 68 , hydrophobicity 69 , and side-chain orientation 70 , can be considered for obtaining a better predictive performance. Additionally, the tertiary structures of succinylated proteins can be www.nature.com/scientificreports www.nature.com/scientificreports/ used to extract more useful information for the characterization of succinylated substrate sites. A stand-alone software will be developed for providing a practical means to facilitate the determination of succinylated targets from a large-scale proteome data.