A hybrid feature extraction scheme for efficient malonylation site prediction

Lysine malonylation is one of the most important post-translational modifications (PTMs). It affects the functionality of cells. Malonylation site prediction in proteins can unfold the mechanisms of cellular functionalities. Experimental methods are one of the due prediction approaches. But they are typically costly and time-consuming to implement. Recently, methods based on machine-learning solutions have been proposed to tackle this problem. Such practices have been shown to reduce costs and time complexities and increase accuracy. However, these approaches also have specific shortcomings, including inappropriate feature extraction out of protein sequences, high-dimensional features, and inefficient underlying classifiers. A machine learning-based method is proposed in this paper to cope with these problems. In the proposed approach, seven different features are extracted. Then, the extracted features are combined, ranked based on the Fisher’s score (F-score), and the most efficient ones are selected. Afterward, malonylation sites are predicted using various classifiers. Simulation results show that the proposed method has acceptable performance compared with some state-of-the-art approaches. In addition, the XGBOOST classifier, founded on extracted features such as TFCRF, has a higher prediction rate than the other methods. The codes are publicly available at: https://github.com/jimy2020/Malonylation-site-prediction

Post-translational modification (PTM) is one of the fundamental mechanisms to regulate many biological processes. Today, more than 620 types of PTMs are discovered, including a wide range of chemical groups to a small protein. Malonylation is a recently identified PTM, wherein positively charged lysine amino-acids of a protein are chemically reformed by adding a negatively charged malonyl group, playing a crucial role in various cellular operations, biological processes, and regulating the dynamicity of a cell [1][2][3][4] . In 2011, lysine malonylation substrates were identified through proteomic analysis, demonstrating their prominent effects on eukaryote and prokaryote cells 1 . Proteins continuously interact, and incorrect identification of a PTM may result in disease. Therefore, their vigorous and precise scrutiny is needed, through which some daily life mechanisms and conditions, including cancer, diabetes, and auto-immunization, could be identified [5][6][7] . Regarding the crucial importance of malonylation, precise identification of protein malonylation sites is the primary concern, leading to useful biomedical information and in-depth molecular function perceptions. Thus far, many computational and experimental methods have been proposed for detecting malonylation sites 8 . However, experimental methods suffer from temporal and financial limitations, and their implementations are cumbersome. Hence, an efficient computational method is required to identify the malonylation sites accurately. Some recent works have employed machine learning and deep learning methods to predict malonylation sites 9 . The main contributions of such methods include feature extraction and selection for efficient classification or model representation such as hybrid or deep learning models.
In 10 , the "Mal-Lys" method is presented to predict K-mal sites. In this approach, residue sequence order information, position-specific amino acid propensity, and physicochemical properties are extracted as features. Then, the significant features are identified by the "minimum redundancy maximum replication" (mRMR) approach. Eventually, the existence of a malonylation site is predicted via a support vector machine (SVM). Wang et al. 11 proposed a novel method for malonylation site recognition based on unique sequences, evolutionary profiles of sequences, and amino-acid attributes. In 12 , sequence orders, gene ontologies, and their composition have been used as features, and an SVM is used for classification. The result has shown that feature combination yields more efficient results. In the "SPRINT Mal" method 13  www.nature.com/scientificreports/ -The term frequency and category relevancy factor (TFCRF) method for weighting features is investigated. Some weighting schemes inspired by document analysis have already been used for malonylation site prediction; however, to the best of our knowledge, TFRCF has not been explored yet. In this method, the distribution of features within various classes is considered along with their distribution in entire sequences of all classes. The results show the efficiency of TFCRF. -The proposed feature combination scheme provides a feature-level diversity, improving amino-acid sequence classification. That is, each combined feature includes a specific piece of information. TFCRF feature includes binary classification distribution information, position-specific scoring matrix (PSSM) contains genomic sequence information and other features envelope frequency information. This strategy has been seldom investigated in the related works thus far. -Selecting relevant features and omitting redundant ones is another novelty of the proposed method, which has rarely been considered in previous works. For this purpose, the best feature combination is selected based on Fisher's score.
The remaining sections of the paper are as follows. Section "Feature extraction" describes various feature extraction schemes for malonylation site identification. Section "The proposed method" elaborates the five stages of the proposed method, including feature extraction, preprocessing, feature selection, classification, and model assessment. Section "Experimental results" describes the experimental results for the proposed approach, and the outcomes are compared with several other common methods. Finally, Section "Conclusion" concludes the paper.

Feature extraction
One of the most important phases in malonylation site prediction is feature extraction. A primary approach is to extract various pre-known features out of protein sequences, and then, a classification process is devised. A secondary approach is to design an end-to-end deep neural network model, through which significant features can be extracted systematically, and the classification could be conducted upon the basis of such features. No end-to-end model has been proposed for the secondary approach thus far. In most of the presented works, the features are extracted using known feature extraction methods, and then classical machine learning or deep learning models are incorporated for classification. Typically, end-to-end models are not recommended due to the lack and insufficient data for training plenty of parameters in deep neural networks. So, we opt to extract significant pre-known features out of protein sequences in the proposed method. The sequential nominal character information can be converted to a numerical vector using several feature extraction algorithms. Extracting efficient features will enhance the performance of the classification. To extract features out of protein sequences the following algorithms are incorporated: the enhanced amino acid composition (EAAC) 33 the enhanced grouped amino acid composition (EGAAC) 33 , dipeptide deviation from expected mean (DDE) 34 , PKA 35 , term frequencyinverse document frequency (TFIDF) 36 , TF_CRF 37 , and position-specific scoring matrix (PSSM) 38 . These methods are elaborated in the following subsections.
Enhanced amino acid composition (EAAC). This method is presented by Chen et al. 33 . In this algorithm, sequential protein information is extracted, and accordingly, amino-acid frequency information is calculated as 33 : where H(m, n) is the number of amino-acid type m , and H(n) is the length of the n'th window length.
Enhanced grouped amino acid composition (EGAAC). In this method, protein sequences are converted to numerical feature vectors based on their attributes. It is a compelling feature extraction algorithm in bioinformatics research fields such as malonylation site prediction.
EGAAC is computed based on amino-acid categorization. In 39 , amino acids are categorized based on five physicochemical characteristics: aliphatic (including GAVLMI amino-acids), aromatic (including GFYW aminoacids), positively charged (including KRH amino-acids), negatively charged (including DE amino-acids), and neutral or uncharged (including STCPNQ amino-acids). Accordingly, EGAAC is calculated based on the following equation: where g is one of the five categories, H(g, n) is the number of amino acids in group g , and H(n) is the length of n'th window 33 . A window size of length five is considered in this paper.
Dipeptide deviation from expected mean. Dipeptide deviation from the expected mean (DDE) is proposed and developed in 34 , wherein feature extraction based on amino-acid combination is studied to discriminate a cell's epitopes and non-epitopes. For this purpose, the dipeptide combination (DC) of a protein sequence is primarily calculated as: (2) G g, n = H g, n H(n) , g ∈ g1, g2, g3, g4, g5 , n ∈ {W1, W2, · · · WL} www.nature.com/scientificreports/ where H mn is the number of paired mn amino-acids, and H is the size of the protein sequence. Next, a protein's theoretical mean (TM) and theoretical variance (TV) are computed as: where C m and C n are the number of codons encrypting the first and the second amino-acids, respectively, and C H is the total number of codons. Finally, DDE is calculated based on TV, TM, and DC as: PKA. This feature is the negative logarithm of the isolation constant for every group in the molecule 35 .
Term frequency: inverse document frequency. TF_IDF feature extraction is composed of two terms, TF and IDF, which stand for the term frequency and inverse document frequency, respectively. Both terms should be calculated separately and multiplied to yield the TF_IDF coefficient 36 . Each term is defined as follows: TF(t, d) : the number of amino-acid t in a protein sequence, divided by the size of the protein, namely d. IDF(t) : the logarithm of the total number of proteins (namely |D| ) divided by the number of contents which include amino-acid t (namely DF(t) ). It is calculated as: Having calculated TF and IDF, TF-IDF is calculated as: Term frequency and category relevancy factor (TF-CRF). In this method, two factors, namely posi-tiveRF (positive relation frequency) and negativeRF (negative relation frequency), are defined as follows 37 : PositiveRF. This factor is the ratio of the number of amino acids in a protein sequence c i , having a common characteristic t k , to the total number of amino acids in the protein sequence. It is calculated as: NegativeRF. This factor is the ratio of the total number of amino acids in protein sequences except for c i , having a common characteristic t k , to the total number of amino acids in protein sequences except for c i . It is calculated as: where D(c j ) is the number of amino acids in protein sequence c j , and D(t k , c j ) is the number of amino acids in the set D and protein c j with common characteristic t k .
Category relevancy factor value (crfValue) is defined as follows, considering the equations mentioned above: The relevance factor of each category has a direct relation with positiveRF and a reverse relationship with negativeRF. Accordingly, the proposed weighting for feature t k in protein sequence d i is: where c d i is the category of protein sequence d i . Normalization is used to mitigate the effect of the length of the sequence on the classification performance. It confines the weights in the range (0, 1) . The final equation of TFCRF will be: www.nature.com/scientificreports/ Accordingly, the content of each protein sequence is represented by a feature vector d i = (W 1i , W 2i , . . . , W ki ) , where k is the total number of selected features, and w ki is the weight of feature (i.e., amino-acid) t k in sequence d i . W ki indicates to what extent feature t k includes the concept of protein sequence d i .
Most class-based weighting methods, such as IDF, have been used for information retrieval (IR) and document analysis purposes. These methods have not been applied in protein sequence classification. Hence, some aspects of IR and document analysis, also associated with protein sequence classification, have been neglected. The weighting method of TFCRF contains such elements, as stated in the following.
Consider a set of protein sequences that belong to a number of classes, with a specific number of instances. Figure 1 depicts various distributions of a feature, namely x , in 4 hypothetical states regarding a class, namely c i . In this figure, a and b are the numbers of sequences in class c i that include and exclude feature x , respectively; also, c and d denote the number of sequences in all classes other than c i that include and exclude feature x , respectively. The frequency of feature x is taken constant in all states.
In IDF-based schemes, the weight of every feature is inversely related to the number of sequences including that feature. In the above instance, the weight of feature x in class c i can be calculated via (7) as: where N is the total number of sequences. Lets w s x denote the weight of feature x in state s of Fig. 1. Then, the relation between the weights of x in various states will be: As it can be seen, the weight of feature x will not change in various states due to the identical number of sequences including it (i.e., b + c ); while the status of feature x is apparently changed in class c i in multiple states, and this fact is overlooked in weighting this feature. Furthermore, in IDF-based approaches, the more the number of sequences including a specific feature, the less discrimination the feature will have; hence, it is assigned a lower weight. Although this is an accurate hypothesis in IR, it needs to be reformed for the purpose of protein classification. As evident from Fig. 1, despite a significant number of sequences including x , if most of those sequences belong to the same class c i (cases 3 and 4 in Fig. 1), feature x is not only efficient, but also it must be known significant to discriminate class c i from others and dedicated a great weight. In addition, a lower weight should be dedicated to x in class c i if a great number of sequences of classes other than c i include feature x (state 2 in Fig. 1). www.nature.com/scientificreports/ The introduced crfValue in TFCRF delivers a solution for the abovementioned problem. That is, the weight of every feature in each sequence has a direct relation with the number of sequences belonging to the class of that sequence and an inverse relation with the number of sequences belonging to the other classes. In the presented example of Fig. 1, the weight of feature x in class c i via (11) equals: As a result, the relation between weights feature x in Fig. 1 will be: It can be seen that in this method, the effect of classes, in which the features attend, is taken into account. It should be noted that crf Value is not independent of the number of sequences in each class, drastically increasing the performance of sequence classifiers.
PSSM. Position-specific scoring matrix (PSSM) is a scoring matrix used in the protein BLAST search, in which a score is dedicated to each amino acid separately, based on its position in the sequence of a number of proteins 41 . This matrix can be shown as: where L is the protein sequence length with a number of 20 possible amino acids. Each element of the PSSM matrix is calculated as: where M i,j is the probability of amino-acid j attending at position i , and b j is the background model for aminoacid j (e.g. b j = 0.05 by postulating a uniform distribution for amino acids). PSSM scores are positive or negative values. Positive values show that the due amino-acid locational presence occurs more than expected stochastically, while the negative values depict that it takes place less than what is anticipated. PSSM includes locational and evolutionary information of protein sequences.

The proposed method
This section proposes a novel model for predicting malonylation sites based on feature extraction and machine learning algorithms. The overall schema of the proposed method is depicted in Fig. 2. It comprises five major stages: dataset selection, feature extraction, feature normalization, feature selection, and classification. Each stage is elaborated in the following. Stage 1: dataset selection. Three datasets, namely Escherichia coli, Mus musculus, and Homo sapiens 40 , have been hired for training and testing the proposed method. The dataset is randomly divided into train and test sets. For efficient analysis, a tenfold cross-validation strategy is conducted. At each iteration, one fold is used as a test set, and the remaining nine folds are incorporated for training the model. Model parameters are tuned based on the training sets. The ultimate result is the average results of 10 iterations. Stage 2: feature extraction. At this stage, feature extraction methods including EAAC, EGAAC, TFIDF, PSSM, and TF-CRF have been applied as: a. EAAC and EGAAC in EAAC, amino-acid frequencies are calculated, and in EGAAC, the protein sequences are converted to numerical vectors based on their characteristics. The resulting feature vectors will be of lengths 20 and 45, respectively. b. TF-IDF it is used for calculating the weighted frequency of amino acids. This method shows the frequency of amino acids and aims to depict an amino acid's significance by comparing its frequency in the dataset with a larger reference dataset. The resulting feature vector will be of length 20 in this method. c. TF-CRF it is used for more precise weighting by two factors, i.e., psitiveRF and negativeRF. The resulting feature vector is of length 20. d. PSSM a score is dedicated to a selected amino acid, solely based on its location in a protein sequence. The resulting feature vector will be of length 400. e. PKA includes negative logarithm of isolation for each group in a molecule. The values pertaining to each amino acid are taken into account. The result will be a single numerical feature.   www.nature.com/scientificreports/ Accordingly, the data should be normalized to improve efficiency. In the present work, Z-score normalization is used for this purpose.
In fact, Z-score is a normalization strategy that prevents outlier data and features. The normalization equation is as follows.
where µ and σ are mean and variance of feature x . If a value equals the mean, it is normalized to zero. If it is less or greater than the mean, it is normalized to a negative or positive value. The magnitude of this negative/positive value is determined based on the variance. The variance of an abnormal feature would be a large number, and its normalized values dwindle to zero.  www.nature.com/scientificreports/ results in model overfitting. Therefore, it is needed to preserve relevant features. Fisher's score (F-score) method, a filter-based approach, is applied to identify relevant features. F-score criteria for the i'th feature is calculated as: where x k i and x i are the mean of the i'th feature in the class k and the whole dataset, respectively, x k j,i is the i'th feature value of instance j in class k , n k is the number of instances in class k , and m is the total number of classes. A number of highly-ranked features are selected for classification in the next stage.
The key idea of the Fisher score is to find a subset of features, such that in the data space spanned by the selected features, the inter-class distances of data points are maximized while the intra-class distances are minimized. Since this is a combinatorial optimization problem, it is reduced to computing a score for individual features, independently, via the scoring function of (19); then, a number of highly-ranked features are selected. In (19), the nominator and denominator represent inter-class and intra-class distances, only with regard to feature x i , respectively. Although some informative dependencies between features are ignored, this method will reduce the time complexity of feature selection to a linear order.

Stage 5: model assessment. A tenfold class validation strategy is conducted to assess the prediction per-
formance of the classification model. The classifiers include XGBoost, SVM, RF, and DNN. Various measures, including AUC, ACC, Sn, Sp, and MCC, have been used for performance assessment.

Experimental results
The datasets. A pilot confirmed dataset is hired for the simulations 40 . The dataset includes 1746 malonylation sites of 595 proteins in "E. coli", 3435 malonylation sites of 1174 proteins in "M. musculus", and 4579 malonylation sites of 1660 proteins in "H. sapiens" 40 . The length of amino-acid sequences is reduced to 25, centered at lysine (K). Table 1 elaborates the characteristics of the dataset.

Model assessment.
A tenfold cross-validation strategy is conducted to tune the models' parameters based on the training dataset, and the independent set is used for testing the model. Efficiency measures sensitivity (sn), Specificity(Sp), accuracy (acc), and Mathew's correlation coefficient (MCC) have been used to assess the underlying models 42 . These measures are calculated as follows.
where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively. Sequence analysis. The datasets of "H. sapiens, " "E. coli, " and "M. musculus" have been incorporated to discriminate malonylation and non-malonylation sites. The statistical differences between protein sequences of malonylation and non-malonylation sites in the datasets mentioned above are depicted in Fig. 3 28 . This figure represents the amino-acid distribution of a protein sequence in the dataset. As shown, lysine is located at the center, and the significantly enriched/depleted surrounding residues are described in the range − 12 to + 12. The   www.nature.com/scientificreports/ diagram depicts a significant difference in amino-acid frequencies between protein sequences of malonylation and non-malonylation sites in various sequence fragments. Compared to central lysine, an arbitrary amino acid is studied in two sections, i.e., enriched and depleted. It is observed that the frequency of amino acids is higher around central lysine than the other fragments in the enriched section. The more distant from the central lysine, the less frequency is observed. Moreover, the exclusive enriched/depleted amino acids around the central lysine unfold the importance of feature selection based on ordinal protein sequences. Accordingly, the importance of a feature extraction scheme based on the combination of multiple sequential features comes into the light to predict the malonylation sites more efficiently. www.nature.com/scientificreports/ Feature extraction analysis. As described earlier, it is sought to extract different features out of protein sequences in order to identify malonylation sites precisely. In this study, seven feature extraction schemes were applied to protein sequences. A random forest classifier was trained based on each feature scheme EAAC, EGAAC, PKA, DDE, TF-IDF, TF-CRF, and PSSM through a tenfold cross-validation strategy to assess the attributes of each method. The results are depicted in Fig. 4 for the three datasets. It is observed that TF-CRF is more discriminative than the others, with higher accuracy in all of the datasets. Moreover, EAAC, EGAAC, and PKA have promising and comparable results. Based on these results, the combination of features was exploited, and the RF classifier was trained and tested by each combination. In order to obtain the best features, they have been combined and compared with each other. In this phase, the features are selected and combined randomly. The features with higher independent prediction rates have been of higher selection priority. At this stage, combinations of 2 to 5 features have been assessed and compared with each other, primarily.
In this paper, a number of classification methods, including XGBoost, SVM, RF, and DNN, have been used. It should be noted that other classifiers, including k-nearest neighbors (KNN) and naïve Bayes classifiers, have also been assessed empirically; however, they were not reported due to their low performance. In order to assess various classifiers, they have been compared in terms of various metrics, including accuracy, error rate, etc. The results are reported in the following.
Parameter tuning is performed based on a series of trials. A penalty factor of 2 along with the RBF kernels are used in SVM classification. The number of random trees in the RF classifier has been 100, with the Gini split criterion. An exponential cost function is used in XGBoost. The number of estimators and the learning rate have been 80 and 0.1, respectively. Also, the DNN is modeled by a 4-layered structure with a learning rate of 0.08.
Moreover, as shown in Table 3, the feature selection method has increased the performance of various classifiers. Indeed, the highly discriminative features have been selected via the F-score method, and redundant ones have been eliminated. This task has improved the performance measures of all of the approaches. Regarding the different dimensionality of datasets, a variety of features have been selected based on a number of trials. Apparently, no unique combination outperforms the others in all of the datasets globally. In H. sapiens and M. musculus, the second combination has better performance, whilst the third is the best for E. coli. Regarding the number of training samples and the structural differences between protein sequences across the datasets, the extracted www.nature.com/scientificreports/ features have different discrimination performances for each dataset, and they would differ. By eliminating the redundant and uncorrelated features at the phase of feature selection, the second combination outperforms the others in all of the datasets. As depicted in Fig. 4, TFCRF has shown the best performance in all of the datasets. In this scheme, weighting features is performed by considering their distribution in classes, in addition to their distribution in sequences. Also, the weighting has not been independent of the number of sequences in each class. This issue has increased the classification performance based on TFCRF. In comparison with other feature weighting schemes, this method can drastically increase classification performance.
In order to deeper analysis of various feature combinations, the ROC diagram on the training dataset is sketched in Fig. 5. The ROC curve is depicted for the third combination, and selecting 80% of the best features in the datasets M. musculus, E. coli, and H. sapiens. As evident in the ROC curve of SVM, XGboos, RF, and DNN classifiers, the area under the curve for XGboost is considerably greater than that of the other methods, indicating its potent generalization and high performance for malonylation and non-malonylation site prediction of lysine proteins.
The values of AUPR and AUROC for various classifiers on the three datasets are tabulated in Table 4. As it can be seen, XGBoost outperforms the other methods. To study the significance of the results, the p-values of AUPR (namely P-AUPR) and AUROC (namely P-AUROC) for various methods and datasets are depicted in Table 4 too. As it can be seen, the prediction rate of each method is significantly higher than that of random prediction. In addition, XGBoost classifier outperforms the others, having a lower P-value. www.nature.com/scientificreports/ Error analysis is carried out to depict model resistivity and stability. The error bar conveys estimated errors or uncertainty in order to achieve a deeper understanding of the measurements. Typically, error bars are used to denote the standard deviation, standard error, confidence intervals, or minimum/maximum values in a dataset. The length of an error bar helps to picture the uncertainty associated with a data point. A short error bar shows the compaction of values, signaling that the mean value has had a further effect in the training model, whilst a long error bar addresses sparsity and a lesser number of data values. A comparison is carried out between DNN, RF, XGBoost, and SVM. The accuracies of the algorithms via a tenfold cross-validation strategy are pictured out in Fig. 6 for the underlying datasets. As evident from Fig. 6, XGBoost has outperformed the others, and DNN depicts the highest error regarding the lengths of the bars. The lesser length of the error bars in Fig. 6 states a higher accuracy of the due algorithm and lower variance of the model accuracy. According to this diagram, it can be concluded that the results of iterations in the tenfold cross-validation have been close in XGBoost, leading to errors approximately equal to zero. Therefore, this model has a high generalization performance. However, the reverse has taken place for DNN, addressing that the results of the iterations in tenfold cross-validation are not close, leading to a higher variance in the accuracy, and hence, a lower generalization performance.
Evaluation through comparison with other methods. In order to further analysis, the proposed method is compared with various prediction methods for the datasets E. coli, H. sapiens, and M. musculus in terms of ACC, SN, SP, and MCC measures. The results are taken in Table 5. As shown, the proposed method has Table 3. Classification performance with the combination of features when F-score is applied for feature selection. Significant values are in bold. Since the extracted features are based on TFCRF in the proposed scheme, the discrimination performance is higher (as discussed in Sections "Feature extraction" to "Term frequency and category relevancy factor  www.nature.com/scientificreports/ (TF-CRF)"); thus, a higher recognition rate is achieved. In addition, dimension reduction through selecting highly relevant features has increased the performance of the proposed method since model overfitting is potentially mitigated.

Conclusion
In this paper, a machine learning-based method has been proposed for malonylation site prediction. Since the input features are crucial in machine-learning models, several features, including a novel one based on TF-CRF, have been extracted out of protein sequences. Next, the features are combined. Since feature combination leads to high dimensional data and, in turn, model overfitting, the most efficient and discriminating features have been chosen based on a feature selection method. The results show that XGboost outperforms the other classifiers based on the extracted and selected features.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.