A hybrid feature extraction scheme for efficient malonylation site prediction

Sorkhi, Ali Ghanbari; Pirgazi, Jamshid; Ghasemi, Vahid

doi:10.1038/s41598-022-08555-9

Download PDF

Article
Open access
Published: 06 April 2022

A hybrid feature extraction scheme for efficient malonylation site prediction

Ali Ghanbari Sorkhi¹,
Jamshid Pirgazi¹ &
Vahid Ghasemi²

Scientific Reports volume 12, Article number: 5756 (2022) Cite this article

1722 Accesses
4 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Lysine malonylation is one of the most important post-translational modifications (PTMs). It affects the functionality of cells. Malonylation site prediction in proteins can unfold the mechanisms of cellular functionalities. Experimental methods are one of the due prediction approaches. But they are typically costly and time-consuming to implement. Recently, methods based on machine-learning solutions have been proposed to tackle this problem. Such practices have been shown to reduce costs and time complexities and increase accuracy. However, these approaches also have specific shortcomings, including inappropriate feature extraction out of protein sequences, high-dimensional features, and inefficient underlying classifiers. A machine learning-based method is proposed in this paper to cope with these problems. In the proposed approach, seven different features are extracted. Then, the extracted features are combined, ranked based on the Fisher’s score (F-score), and the most efficient ones are selected. Afterward, malonylation sites are predicted using various classifiers. Simulation results show that the proposed method has acceptable performance compared with some state-of-the-art approaches. In addition, the XGBOOST classifier, founded on extracted features such as TFCRF, has a higher prediction rate than the other methods. The codes are publicly available at: https://github.com/jimy2020/Malonylation-site-prediction

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Discovery of potent inhibitors of α-synuclein aggregation using structure-based iterative learning

Article Open access 17 April 2024

Introduction

Post-translational modification (PTM) is one of the fundamental mechanisms to regulate many biological processes. Today, more than 620 types of PTMs are discovered, including a wide range of chemical groups to a small protein. Malonylation is a recently identified PTM, wherein positively charged lysine amino-acids of a protein are chemically reformed by adding a negatively charged malonyl group, playing a crucial role in various cellular operations, biological processes, and regulating the dynamicity of a cell^1,2,3,4. In 2011, lysine malonylation substrates were identified through proteomic analysis, demonstrating their prominent effects on eukaryote and prokaryote cells¹. Proteins continuously interact, and incorrect identification of a PTM may result in disease. Therefore, their vigorous and precise scrutiny is needed, through which some daily life mechanisms and conditions, including cancer, diabetes, and auto-immunization, could be identified^5,6,7. Regarding the crucial importance of malonylation, precise identification of protein malonylation sites is the primary concern, leading to useful biomedical information and in-depth molecular function perceptions. Thus far, many computational and experimental methods have been proposed for detecting malonylation sites⁸. However, experimental methods suffer from temporal and financial limitations, and their implementations are cumbersome. Hence, an efficient computational method is required to identify the malonylation sites accurately. Some recent works have employed machine learning and deep learning methods to predict malonylation sites⁹. The main contributions of such methods include feature extraction and selection for efficient classification or model representation such as hybrid or deep learning models.

In¹⁰, the “Mal-Lys” method is presented to predict K-mal sites. In this approach, residue sequence order information, position-specific amino acid propensity, and physicochemical properties are extracted as features. Then, the significant features are identified by the “minimum redundancy maximum replication” (mRMR) approach. Eventually, the existence of a malonylation site is predicted via a support vector machine (SVM). Wang et al.¹¹ proposed a novel method for malonylation site recognition based on unique sequences, evolutionary profiles of sequences, and amino-acid attributes. In¹², sequence orders, gene ontologies, and their composition have been used as features, and an SVM is used for classification. The result has shown that feature combination yields more efficient results. In the “SPRINT Mal” method¹³, some ordinal and structural features are extracted out of the protein sequences. It is the first online prediction scheme that has pondered the structural attributes of proteins. The prediction is carried out by an SVM too.

In¹⁴, a variety of 11 features is extracted out of protein sequences. Regarding the high-dimensionality of the feature vectors, the features are further processed by their gain ratio, and the significant features are selected. Then, several classifiers are employed, such as a decision tree, support vector machine, K-nearest neighbors, logistic regression, and light gradient boosting machine. In¹⁵, the features are extracted regarding the neighboring amino-acid interactions using a B-peptide-based scheme. Then, the light gradient boosting classification is incorporated to identify the malonylation sites.

In¹⁶, pseudo-amino acids have been used as features to train an SVM classifier to identify malonylation sites. In¹⁷, a novel approach, called CKSAAP_FormSite, is proposed. In this method, an efficient feature extraction scheme based on the composition of k-spaced acid pairs is used for encrypting malonylation sites. Then, malonylation sites are detected using SVM. In¹⁸, a 3-phase approach is presented. Features are extracted based on sequence orders in the first stage. Then, the data of both classes are balanced using random sampling. Eventually, malonylation sites are predicted by a random forest classifier.

In¹⁹, a machine learning-based scheme is proposed for predicting malonylation sites. In this approach, physicochemical attributes, sequential, structural, and functional information of proteins are used as features. Then, mRMR and symmetrical uncertainty methods were used for efficient feature selection. The classification model is SVM. Feature composition is considered in²⁰. In this scheme, one-hot coding, physicochemical attributes, and composition of k-spaced acid pairs are considered for feature extraction. Then, principal component analysis (PCA) is used to extract efficient features, and an SVM is used to predict malonylation sites. In²¹, amino acid’s predicted secondary structure is used to extract two types of structural features out of neighboring amino acids on protein sequences. The results show that the proposed method has a promising performance.

Recently, deep learning-based approaches have gained ground for predicting malonylation sites. However, these methods are not end-to-end and need to extract features from the input data. The extracted features are fed into the deep networks. Moreover, a great deal of training data is required to tune the parameters of the deep networks, while a short amount of data is confronted yet. In²², a hybrid model, including a convolutional neural network (CNN) and the composition of physicochemical attributes, evolutional information, and sequential features, is used to identify mammals’ protein malonylation sites.

In²³, a deep learning (DL) model is proposed based on long-short term memory (LSTM) together with word embedding for malonylation sites prediction. The proposed method outperforms the traditional approaches using various extracted features and LSTM-based DL classification with a one-hot vector. This method suffers from being sensitive to the size of the training set; however, a concoction with traditional machine learning may overcome the weakness. In²⁴, conditional general adversarial networks (CGAN) have been used to identify seven different types of malonylation sites. Primarily, the features are extracted via eight different sequential and four structural feature extraction schemes. Then, the number of features is augmented to 1479 using Pearson correlation. Afterward, both classes’ instances are balanced by a CGAN and a Conditional Wasserstein Generative Adversarial Network (CWGAN). A random forest classifier is incorporated to predict malonylation sites.

In²⁵, a multi-layer perceptron (MLP) is presented. In this approach, six different features are extracted from protein sequences, and an MLP is hired for malonylation site prediction. A DL-based method is presented in²⁶ to increase the prediction rate. For this purpose, some features such as position-specific amino acid composition, the composition of k-spaced acid pairs, and position-specific scoring matrix are extracted from protein sequences. Then, maximal dependence decomposition is hired to extract efficient features. Eventually, a multi-layered DL network carries out the classification. Transfer learning approaches have been incorporated to achieve prediction on large scales in²⁷. In this work, a recurrent neural network-based deep learning model is primarily trained and then tuned using propionylation. The trained model is used for feature extraction, such that it is fed with a protein sequence and yields the due feature vector as output. An SVM is used for the final classification. In²⁸, five different feature types are extracted from protein sequences. A feature vector of length 1431 ensues. The resulted features are fed into a CNN. The classification is carried out in the last layers, which are fully connected.

In²⁹, DeePPSite is presented for phosphorylation site prediction based on LSTM neural networks. In this method, various features, including PSSM, IPC, and EGBW, are extracted. The prediction is then carried out via LSTM. In³⁰, the site prediction is conducted by hot-encoder feature extraction and CNN classifiers. In this approach, the features are extracted via the hot-encoder method. The extracted features are then fed into a one-dimensional CNN classifier. In³¹, various feature sequences are used for malonylation site prediction. The prediction is carried out based on DNNs. A method called NearMiss-2 is used in this approach to cope with imbalanced data. In³², eight different feature extraction schemes and three structural features have been studied. In this approach, various features are combined, and the performance is higher tone-dimensional features.

The primary focus of the present work is on delivering a novel feature extraction strategy to predict the malonylation sites efficiently. For this purpose, primarily, various features are extracted out of protein sequences. The primary features are combined, each combination is assessed, then weighed, and the best one is selected. Features are selected based on the Fisher’s score (F-score) to select efficient features and avoid model over-fitting. Eventually, the classification is carried out via various classifiers, including random forest (RF), extreme gradient boosting (XGBoost), SVM, and DNN. Totally, a five-stage approach is proposed in the present work, in which the feature extraction is carried out in the first stage. A preprocessing of the extracted features is conducted in the second stage. The third stage is dedicated to selecting features out of various combinations. Eventually, a classification is achieved in the fourth stage to predict malonylation sites. The model assessment is carried out at stage five. Specific contributions and novelties of this paper can be summarized as follows:

The term frequency and category relevancy factor (TFCRF) method for weighting features is investigated. Some weighting schemes inspired by document analysis have already been used for malonylation site prediction; however, to the best of our knowledge, TFRCF has not been explored yet. In this method, the distribution of features within various classes is considered along with their distribution in entire sequences of all classes. The results show the efficiency of TFCRF.
The proposed feature combination scheme provides a feature-level diversity, improving amino-acid sequence classification. That is, each combined feature includes a specific piece of information. TFCRF feature includes binary classification distribution information, position-specific scoring matrix (PSSM) contains genomic sequence information and other features envelope frequency information. This strategy has been seldom investigated in the related works thus far.
Selecting relevant features and omitting redundant ones is another novelty of the proposed method, which has rarely been considered in previous works. For this purpose, the best feature combination is selected based on Fisher’s score.

The remaining sections of the paper are as follows. Section “Feature extraction” describes various feature extraction schemes for malonylation site identification. Section “The proposed method” elaborates the five stages of the proposed method, including feature extraction, preprocessing, feature selection, classification, and model assessment. Section “Experimental results” describes the experimental results for the proposed approach, and the outcomes are compared with several other common methods. Finally, Section “Conclusion” concludes the paper.

Feature extraction

One of the most important phases in malonylation site prediction is feature extraction. A primary approach is to extract various pre-known features out of protein sequences, and then, a classification process is devised. A secondary approach is to design an end-to-end deep neural network model, through which significant features can be extracted systematically, and the classification could be conducted upon the basis of such features. No end-to-end model has been proposed for the secondary approach thus far. In most of the presented works, the features are extracted using known feature extraction methods, and then classical machine learning or deep learning models are incorporated for classification. Typically, end-to-end models are not recommended due to the lack and insufficient data for training plenty of parameters in deep neural networks. So, we opt to extract significant pre-known features out of protein sequences in the proposed method. The sequential nominal character information can be converted to a numerical vector using several feature extraction algorithms. Extracting efficient features will enhance the performance of the classification. To extract features out of protein sequences the following algorithms are incorporated: the enhanced amino acid composition (EAAC)³³ the enhanced grouped amino acid composition (EGAAC)³³, dipeptide deviation from expected mean (DDE)³⁴, PKA³⁵, term frequency-inverse document frequency (TFIDF)³⁶, TF_CRF³⁷, and position-specific scoring matrix (PSSM)³⁸. These methods are elaborated in the following subsections.

Enhanced amino acid composition (EAAC)

This method is presented by Chen et al.³³. In this algorithm, sequential protein information is extracted, and accordingly, amino-acid frequency information is calculated as³³:

$$g\left(m,n\right)=\frac{H\left(m,n\right)}{H\left(n\right)} , m\in \left\{A,C,D, \cdots ,Y\right\} ,n\in \left\{W1,W2,\cdots WL\right\}$$

(1)

where $H(m,n)$ is the number of amino-acid type $m$, and $H(n)$ is the length of the $n$’th window length.

Enhanced grouped amino acid composition (EGAAC)

In this method, protein sequences are converted to numerical feature vectors based on their attributes. It is a compelling feature extraction algorithm in bioinformatics research fields such as malonylation site prediction.

EGAAC is computed based on amino-acid categorization. In³⁹, amino acids are categorized based on five physicochemical characteristics: aliphatic (including GAVLMI amino-acids), aromatic (including GFYW amino-acids), positively charged (including KRH amino-acids), negatively charged (including DE amino-acids), and neutral or uncharged (including STCPNQ amino-acids). Accordingly, EGAAC is calculated based on the following equation:

$$G\left(g,n\right)=\frac{H\left(g,n\right)}{H\left(n\right)} , g\in \left\{g1,g2,g3,g4,g5\right\} , n\in \left\{W1,W2,\cdots WL\right\}$$

(2)

where $g$ is one of the five categories, $H(g,n)$ is the number of amino acids in group $g$, and $H(n)$ is the length of $n$’th window³³. A window size of length five is considered in this paper.

Dipeptide deviation from expected mean

Dipeptide deviation from the expected mean (DDE) is proposed and developed in³⁴, wherein feature extraction based on amino-acid combination is studied to discriminate a cell’s epitopes and non-epitopes. For this purpose, the dipeptide combination (DC) of a protein sequence is primarily calculated as:

$$DC\left(m,n\right)= \frac{{H}_{mn}}{H-1} , m,n\in \left\{A,C,D,\cdots ,Y\right\}$$

(3)

where ${H}_{mn}$ is the number of paired $mn$ amino-acids, and H is the size of the protein sequence. Next, a protein’s theoretical mean (TM) and theoretical variance (TV) are computed as:

$$TM\left(m,n\right)= \frac{{C}_{m}}{{C}_{H}}\times \frac{{C}_{n}}{{C}_{H}}$$

(4)

$$TV\left(m,n\right)= \frac{TM\left(m,n\right)\left(1-TM\left(m,n\right)\right)}{H-1}$$

(5)

where ${C}_{m}$ and ${C}_{n}$ are the number of codons encrypting the first and the second amino-acids, respectively, and ${C}_{H}$ is the total number of codons. Finally, DDE is calculated based on TV, TM, and DC as:

$$DDE\left(m,n\right)= \frac{DC\left(m,n\right)\left(1-TM\left(m,n\right)\right)}{\sqrt{TV\left(m,n\right)}}$$

(6)

PKA

This feature is the negative logarithm of the isolation constant for every group in the molecule ³⁵.

Term frequency: inverse document frequency

TF_IDF feature extraction is composed of two terms, TF and IDF, which stand for the term frequency and inverse document frequency, respectively. Both terms should be calculated separately and multiplied to yield the TF_IDF coefficient³⁶. Each term is defined as follows:

$TF(t,d)$: the number of amino-acid $t$ in a protein sequence, divided by the size of the protein, namely $d$.

$IDF(t)$: the logarithm of the total number of proteins (namely $\left|D\right|$) divided by the number of contents which include amino-acid $t$ (namely $DF(t)$). It is calculated as:

$$IDF\left(t\right)= \mathrm{log}\left(\frac{\left|D\right|}{DF\left(t\right)}\right)$$

(7)

Having calculated TF and IDF, TF-IDF is calculated as:

$$TF-IDF\left(t,d\right)=TF\left(t,d\right)\times IDF\left(t\right)$$

(8)

Term frequency and category relevancy factor (TF-CRF)

In this method, two factors, namely positiveRF (positive relation frequency) and negativeRF (negative relation frequency), are defined as follows³⁷:

PositiveRF

This factor is the ratio of the number of amino acids in a protein sequence ${c}_{i}$, having a common characteristic ${t}_{k}$, to the total number of amino acids in the protein sequence. It is calculated as:

$$PosotiveRF\left({t}_{k},{c}_{{d}_{i}}\right)=\frac{\left|D\left({t}_{k},{c}_{j}\right)\right|}{\left|D\left({c}_{j}\right)\right|}$$

(9)

NegativeRF

This factor is the ratio of the total number of amino acids in protein sequences except for ${c}_{i}$, having a common characteristic ${t}_{k}$, to the total number of amino acids in protein sequences except for ${c}_{i}$. It is calculated as:

$$NegativeRF\left({t}_{k},{c}_{{d}_{i}}\right)= \frac{\sum_{m=1 ,m\ne j}^{\left|c\right|}\left|D\left({t}_{k},{c}_{m}\right)\right|}{\sum_{m=1 ,m\ne j}^{\left|c\right|}\left|D\left({c}_{m}\right)\right|}$$

(10)

where $\left|D({c}_{j})\right|$ is the number of amino acids in protein sequence ${c}_{j}$, and $\left|D({t}_{k},{c}_{j})\right|$ is the number of amino acids in the set $D$ and protein ${c}_{j}$ with common characteristic ${t}_{k}$.

Category relevancy factor value (crfValue) is defined as follows, considering the equations mentioned above:

$$crfValue({t}_{k},{c}_{j})=\frac{PosotiveRF\left({t}_{k},{c}_{j}\right)}{NegativeRF\left({t}_{k},{c}_{j}\right)}$$

(11)

The relevance factor of each category has a direct relation with positiveRF and a reverse relationship with negativeRF. Accordingly, the proposed weighting for feature ${t}_{k}$ in protein sequence $d_{i}$ is:

$$w_{ki} = \log \left( {tf\left( {t_{k} ,d_{i} } \right) \times crfValue\left( {t_{k} ,c_{{d_{i} }} } \right)} \right) = \log \left( {tf\left( {t_{k} ,d_{i} } \right) \times \frac{{PosotiveRF\left( {t_{k} ,c_{{d_{i} }} } \right)}}{{NegativeRF\left( {t_{k} ,c_{{d_{i} }} } \right)}}} \right)$$

(12)

where ${c}_{{d}_{i}}$ is the category of protein sequence ${d}_{i}$. Normalization is used to mitigate the effect of the length of the sequence on the classification performance. It confines the weights in the range $(\mathrm{0,1})$. The final equation of TFCRF will be:

$$W_{ki} = TFCRF\left( {t_{k} ,d_{i} } \right) = \frac{{\log (tf\left( {t_{k} ,d_{i} } \right) \times crfValue\left( {t_{k} ,c_{{d_{i} }} } \right))}}{{\sqrt {\mathop \sum \nolimits_{k} (\log (tf\left( {t_{k} ,d_{i} } \right) \times crfValue\left( {t_{k} ,c_{{d_{i} }} } \right)))^{2} } }}$$

(13)

Accordingly, the content of each protein sequence is represented by a feature vector ${d}_{i}=({W}_{1i},{W}_{2i},\dots ,{W}_{ki})$, where $k$ is the total number of selected features, and ${w}_{ki}$ is the weight of feature (i.e., amino-acid) ${t}_{k}$ in sequence ${d}_{i}$. ${W}_{ki}$ indicates to what extent feature ${t}_{k}$ includes the concept of protein sequence ${d}_{i}$.

Most class-based weighting methods, such as IDF, have been used for information retrieval (IR) and document analysis purposes. These methods have not been applied in protein sequence classification. Hence, some aspects of IR and document analysis, also associated with protein sequence classification, have been neglected. The weighting method of TFCRF contains such elements, as stated in the following.

Consider a set of protein sequences that belong to a number of classes, with a specific number of instances. Figure 1 depicts various distributions of a feature, namely $x$, in 4 hypothetical states regarding a class, namely ${c}_{i}$. In this figure, $a$ and $b$ are the numbers of sequences in class ${c}_{i}$ that include and exclude feature $x$, respectively; also, $c$ and $d$ denote the number of sequences in all classes other than ${c}_{i}$ that include and exclude feature $x$, respectively. The frequency of feature $x$ is taken constant in all states.

In IDF-based schemes, the weight of every feature is inversely related to the number of sequences including that feature. In the above instance, the weight of feature $x$ in class ${c}_{i}$ can be calculated via (7) as:

$$idf\left(x\right)=\mathrm{log}\frac{N}{b+c}=\mathrm{log}\frac{a+b+c+d}{b+c}$$

(14)

where $N$ is the total number of sequences. Lets ${w}_{x}^{s}$ denote the weight of feature $x$ in state $s$ of Fig. 1. Then, the relation between the weights of $x$ in various states will be:

$${w}_{x}^{1}={w}_{x}^{2}={w}_{x}^{3}={w}_{x}^{4}$$

As it can be seen, the weight of feature $x$ will not change in various states due to the identical number of sequences including it (i.e., $b+c$); while the status of feature $x$ is apparently changed in class ${c}_{i}$ in multiple states, and this fact is overlooked in weighting this feature. Furthermore, in IDF-based approaches, the more the number of sequences including a specific feature, the less discrimination the feature will have; hence, it is assigned a lower weight. Although this is an accurate hypothesis in IR, it needs to be reformed for the purpose of protein classification. As evident from Fig. 1, despite a significant number of sequences including $x$, if most of those sequences belong to the same class ${c}_{i}$ (cases 3 and 4 in Fig. 1), feature $x$ is not only efficient, but also it must be known significant to discriminate class ${c}_{i}$ from others and dedicated a great weight. In addition, a lower weight should be dedicated to $x$ in class ${c}_{i}$ if a great number of sequences of classes other than ${c}_{i}$ include feature $x$ (state 2 in Fig. 1).

The introduced crfValue in TFCRF delivers a solution for the abovementioned problem. That is, the weight of every feature in each sequence has a direct relation with the number of sequences belonging to the class of that sequence and an inverse relation with the number of sequences belonging to the other classes. In the presented example of Fig. 1, the weight of feature $x$ in class ${c}_{i}$ via (11) equals:

$$crfValue(x,{c}_{i})=\frac{\frac{b}{b+a}}{\frac{c}{c+d}}$$

(15)

As a result, the relation between weights feature $x$ in Fig. 1 will be:

$${w}_{x}^{2}<{w}_{x}^{1}<{w}_{x}^{3}<{w}_{x}^{4}$$

It can be seen that in this method, the effect of classes, in which the features attend, is taken into account. It should be noted that crf Value is not independent of the number of sequences in each class, drastically increasing the performance of sequence classifiers.

PSSM

Position-specific scoring matrix (PSSM) is a scoring matrix used in the protein BLAST search, in which a score is dedicated to each amino acid separately, based on its position in the sequence of a number of proteins⁴¹. This matrix can be shown as:

$$PSSM=\left[\begin{array}{ccc}{P}_{1,1 }& \cdots & {P}_{1,20}\\ \vdots & \ddots & \vdots \\ {p}_{L,1}& \cdots & {P}_{L,20}\end{array}\right]$$

(16)

where $L$ is the protein sequence length with a number of 20 possible amino acids. Each element of the PSSM matrix is calculated as:

$${P}_{i,j}={\mathrm{log}}_{2}\left(\frac{{M}_{i,j}}{{b}_{j}}\right)$$

(17)

where ${M}_{i,j}$ is the probability of amino-acid $j$ attending at position $i$, and ${b}_{j}$ is the background model for amino-acid $j$ (e.g. ${b}_{j}=0.05$ by postulating a uniform distribution for amino acids). PSSM scores are positive or negative values. Positive values show that the due amino-acid locational presence occurs more than expected stochastically, while the negative values depict that it takes place less than what is anticipated. PSSM includes locational and evolutionary information of protein sequences.

The proposed method

This section proposes a novel model for predicting malonylation sites based on feature extraction and machine learning algorithms. The overall schema of the proposed method is depicted in Fig. 2. It comprises five major stages: dataset selection, feature extraction, feature normalization, feature selection, and classification. Each stage is elaborated in the following.

Stage 1: dataset selection

Three datasets, namely Escherichia coli, Mus musculus, and Homo sapiens⁴⁰, have been hired for training and testing the proposed method. The dataset is randomly divided into train and test sets. For efficient analysis, a tenfold cross-validation strategy is conducted. At each iteration, one fold is used as a test set, and the remaining nine folds are incorporated for training the model. Model parameters are tuned based on the training sets. The ultimate result is the average results of 10 iterations.

Stage 2: feature extraction

At this stage, feature extraction methods including EAAC, EGAAC, TFIDF, PSSM, and TF-CRF have been applied as:

a.
EAAC and EGAAC in EAAC, amino-acid frequencies are calculated, and in EGAAC, the protein sequences are converted to numerical vectors based on their characteristics. The resulting feature vectors will be of lengths 20 and 45, respectively.
b.
TF-IDF it is used for calculating the weighted frequency of amino acids. This method shows the frequency of amino acids and aims to depict an amino acid’s significance by comparing its frequency in the dataset with a larger reference dataset. The resulting feature vector will be of length 20 in this method.
c.
TF-CRF it is used for more precise weighting by two factors, i.e., psitiveRF and negativeRF. The resulting feature vector is of length 20.
d.
PSSM a score is dedicated to a selected amino acid, solely based on its location in a protein sequence. The resulting feature vector will be of length 400.
e.
PKA includes negative logarithm of isolation for each group in a molecule. The values pertaining to each amino acid are taken into account. The result will be a single numerical feature.

Stage 3: preprocessing

Having extracted features out of protein sequences, they would be of various ranges. The difference in feature values would plummet the effect of some important features. In the present work, the primitive values of features range from 0 to 0.03 and, in some cases, from 0 to 200. Additionally, hiring features with a sprawling domain of fluctuations deteriorate the efficacy of the underlying learning models. Accordingly, the data should be normalized to improve efficiency. In the present work, Z-score normalization is used for this purpose.

In fact, Z-score is a normalization strategy that prevents outlier data and features. The normalization equation is as follows.

$$z=\frac{x-\upmu }{\upsigma }$$

(18)

where $\mu$ and $\sigma$ are mean and variance of feature $x$. If a value equals the mean, it is normalized to zero. If it is less or greater than the mean, it is normalized to a negative or positive value. The magnitude of this negative/positive value is determined based on the variance. The variance of an abnormal feature would be a large number, and its normalized values dwindle to zero.

Stage 4: feature selection

The extracted features are used for malonylation site prediction. However, all of the features may not be efficient. Some of them may be irrelevant, and some may be redundant. Such features results in model overfitting. Therefore, it is needed to preserve relevant features. Fisher’s score (F-score) method, a filter-based approach, is applied to identify relevant features. F-score criteria for the $i$’th feature is calculated as:

$$F-Score\left(i\right)= \frac{\sum_{k=1}^{m}{{n}^{k}\left({\overline{x} }_{i}^{k}- {\overline{x} }_{i}\right)}^{2}}{\sum_{k=1}^{m}\frac{1}{{n}^{k}-1}\sum_{j=1}^{{n}^{k}}{\left({x}_{j,i}^{k}- {\overline{x} }_{i}^{k}\right)}^{2}}$$

(19)

where ${\overline{x} }_{i}^{k}$ and ${\overline{x} }_{i}$ are the mean of the $i$’th feature in the class $k$ and the whole dataset, respectively, ${x}_{j,i}^{k}$ is the $i$’th feature value of instance $j$ in class $k$, ${n}^{k}$ is the number of instances in class $k$, and $m$ is the total number of classes. A number of highly-ranked features are selected for classification in the next stage.

The key idea of the Fisher score is to find a subset of features, such that in the data space spanned by the selected features, the inter-class distances of data points are maximized while the intra-class distances are minimized. Since this is a combinatorial optimization problem, it is reduced to computing a score for individual features, independently, via the scoring function of (19); then, a number of highly-ranked features are selected. In (19), the nominator and denominator represent inter-class and intra-class distances, only with regard to feature ${x}_{i}$, respectively. Although some informative dependencies between features are ignored, this method will reduce the time complexity of feature selection to a linear order.

Stage 5: model assessment

A tenfold class validation strategy is conducted to assess the prediction performance of the classification model. The classifiers include XGBoost, SVM, RF, and DNN. Various measures, including AUC, ACC, Sn, Sp, and MCC, have been used for performance assessment.

Experimental results

The datasets

A pilot confirmed dataset is hired for the simulations⁴⁰. The dataset includes 1746 malonylation sites of 595 proteins in “E. coli”, 3435 malonylation sites of 1174 proteins in “M. musculus”, and 4579 malonylation sites of 1660 proteins in “H. sapiens”⁴⁰. The length of amino-acid sequences is reduced to 25, centered at lysine (K). Table 1 elaborates the characteristics of the dataset.

Table 1 The number of malonylation and non-malonylation samples in the dataset.

Full size table

Model assessment

A tenfold cross-validation strategy is conducted to tune the models’ parameters based on the training dataset, and the independent set is used for testing the model. Efficiency measures sensitivity (sn), Specificity(Sp), accuracy (acc), and Mathew’s correlation coefficient (MCC) have been used to assess the underlying models⁴². These measures are calculated as follows.

$$Sn=\frac{TP}{TP+FN}$$

(20)

$$Sp=\frac{TN}{TN+FP}$$

(21)

$$ACC=\frac{TP+TN}{TP+TN+FP+FN}$$

(22)

$$MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FN)(TP+FP)(TN+FP)(TN+FN)}}$$

(23)

where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively.

Sequence analysis

The datasets of “H. sapiens,” “E. coli,” and “M. musculus” have been incorporated to discriminate malonylation and non-malonylation sites. The statistical differences between protein sequences of malonylation and non-malonylation sites in the datasets mentioned above are depicted in Fig. 3²⁸. This figure represents the amino-acid distribution of a protein sequence in the dataset. As shown, lysine is located at the center, and the significantly enriched/depleted surrounding residues are described in the range − 12 to + 12. The diagram depicts a significant difference in amino-acid frequencies between protein sequences of malonylation and non-malonylation sites in various sequence fragments. Compared to central lysine, an arbitrary amino acid is studied in two sections, i.e., enriched and depleted. It is observed that the frequency of amino acids is higher around central lysine than the other fragments in the enriched section. The more distant from the central lysine, the less frequency is observed. Moreover, the exclusive enriched/depleted amino acids around the central lysine unfold the importance of feature selection based on ordinal protein sequences. Accordingly, the importance of a feature extraction scheme based on the combination of multiple sequential features comes into the light to predict the malonylation sites more efficiently.

Feature extraction analysis

As described earlier, it is sought to extract different features out of protein sequences in order to identify malonylation sites precisely. In this study, seven feature extraction schemes were applied to protein sequences. A random forest classifier was trained based on each feature scheme EAAC, EGAAC, PKA, DDE, TF-IDF, TF-CRF, and PSSM through a tenfold cross-validation strategy to assess the attributes of each method.

The results are depicted in Fig. 4 for the three datasets. It is observed that TF-CRF is more discriminative than the others, with higher accuracy in all of the datasets. Moreover, EAAC, EGAAC, and PKA have promising and comparable results. Based on these results, the combination of features was exploited, and the RF classifier was trained and tested by each combination. In order to obtain the best features, they have been combined and compared with each other. In this phase, the features are selected and combined randomly. The features with higher independent prediction rates have been of higher selection priority. At this stage, combinations of 2 to 5 features have been assessed and compared with each other, primarily.

Three combinations outperformed the others: (1) the combination of TF-CRF, EGAAC, and TF-IDF with a vector of 228 features, (2) the combination of EAAC, PKA, PSSM, and TF-CRF, with a vector of 494 features, (3) the combination of EAAC, PKA, PSSM, TF-CRF, and EGAAC with a vector of 599 features. The results of incorporating these feature combinations into various classification models and the ensued performance measures are taken in Table 2.

Table 2 The performance of classifiers with various feature combinations.

Full size table

In this paper, a number of classification methods, including XGBoost, SVM, RF, and DNN, have been used. It should be noted that other classifiers, including k-nearest neighbors (KNN) and naïve Bayes classifiers, have also been assessed empirically; however, they were not reported due to their low performance. In order to assess various classifiers, they have been compared in terms of various metrics, including accuracy, error rate, etc. The results are reported in the following.

Parameter tuning is performed based on a series of trials. A penalty factor of 2 along with the RBF kernels are used in SVM classification. The number of random trees in the RF classifier has been 100, with the Gini split criterion. An exponential cost function is used in XGBoost. The number of estimators and the learning rate have been 80 and 0.1, respectively. Also, the DNN is modeled by a 4-layered structure with a learning rate of 0.08.

Moreover, as shown in Table 3, the feature selection method has increased the performance of various classifiers. Indeed, the highly discriminative features have been selected via the F-score method, and redundant ones have been eliminated. This task has improved the performance measures of all of the approaches. Regarding the different dimensionality of datasets, a variety of features have been selected based on a number of trials. Apparently, no unique combination outperforms the others in all of the datasets globally. In H. sapiens and M. musculus, the second combination has better performance, whilst the third is the best for E. coli. Regarding the number of training samples and the structural differences between protein sequences across the datasets, the extracted features have different discrimination performances for each dataset, and they would differ. By eliminating the redundant and uncorrelated features at the phase of feature selection, the second combination outperforms the others in all of the datasets.

Table 3 Classification performance with the combination of features when F-score is applied for feature selection.

Full size table

As depicted in Fig. 4, TFCRF has shown the best performance in all of the datasets. In this scheme, weighting features is performed by considering their distribution in classes, in addition to their distribution in sequences. Also, the weighting has not been independent of the number of sequences in each class. This issue has increased the classification performance based on TFCRF. In comparison with other feature weighting schemes, this method can drastically increase classification performance.

In order to deeper analysis of various feature combinations, the ROC diagram on the training dataset is sketched in Fig. 5. The ROC curve is depicted for the third combination, and selecting 80% of the best features in the datasets M. musculus, E. coli, and H. sapiens. As evident in the ROC curve of SVM, XGboos, RF, and DNN classifiers, the area under the curve for XGboost is considerably greater than that of the other methods, indicating its potent generalization and high performance for malonylation and non-malonylation site prediction of lysine proteins.

The values of AUPR and AUROC for various classifiers on the three datasets are tabulated in Table 4. As it can be seen, XGBoost outperforms the other methods. To study the significance of the results, the p-values of AUPR (namely P-AUPR) and AUROC (namely P-AUROC) for various methods and datasets are depicted in Table 4 too. As it can be seen, the prediction rate of each method is significantly higher than that of random prediction. In addition, XGBoost classifier outperforms the others, having a lower P-value.

Table 4 The values of AUROC, AUPR and their P-values for various classifiers and datasets.

Full size table

Error analysis is carried out to depict model resistivity and stability. The error bar conveys estimated errors or uncertainty in order to achieve a deeper understanding of the measurements. Typically, error bars are used to denote the standard deviation, standard error, confidence intervals, or minimum/maximum values in a dataset. The length of an error bar helps to picture the uncertainty associated with a data point. A short error bar shows the compaction of values, signaling that the mean value has had a further effect in the training model, whilst a long error bar addresses sparsity and a lesser number of data values. A comparison is carried out between DNN, RF, XGBoost, and SVM. The accuracies of the algorithms via a tenfold cross-validation strategy are pictured out in Fig. 6 for the underlying datasets. As evident from Fig. 6, XGBoost has outperformed the others, and DNN depicts the highest error regarding the lengths of the bars. The lesser length of the error bars in Fig. 6 states a higher accuracy of the due algorithm and lower variance of the model accuracy. According to this diagram, it can be concluded that the results of iterations in the tenfold cross-validation have been close in XGBoost, leading to errors approximately equal to zero. Therefore, this model has a high generalization performance. However, the reverse has taken place for DNN, addressing that the results of the iterations in tenfold cross-validation are not close, leading to a higher variance in the accuracy, and hence, a lower generalization performance.

Evaluation through comparison with other methods

In order to further analysis, the proposed method is compared with various prediction methods for the datasets E. coli, H. sapiens, and M. musculus in terms of ACC, SN, SP, and MCC measures. The results are taken in Table 5. As shown, the proposed method has outperformed Malopred¹¹, kmal-sp¹⁴, DeepMal²⁸ and RF-MaloSite⁴⁰ with a higher ACC, SN, SP, and MCC, in all of the datasets. The 97.21% accuracy of the proposed method for E. coli is 12.71%, 17.41%, 4.2% greater than kmal-sp, MaloPred, and DeepMal, respectively. The 95.22% ACC index of the proposed method for H. sapiens is also 4.3% to 20.22% greater than the other prediction models. Performance measures MCC and AUC are high for this dataset too. The 94.31% accuracy of the proposed method for M. musculus is greater than the other prediction approaches. The 92.17% MCC of the proposed method outperforms the others for this dataset and has considerably improved the results for malonylation site prediction.

Table 5 A comparison between the proposed method and the approaches of DeepMal, Kmal-sp, Malopred, and RF-MaloSite.

Full size table

Since the extracted features are based on TFCRF in the proposed scheme, the discrimination performance is higher (as discussed in Sections “Feature extraction” to “Term frequency and category relevancy factor (TF-CRF)”); thus, a higher recognition rate is achieved. In addition, dimension reduction through selecting highly relevant features has increased the performance of the proposed method since model overfitting is potentially mitigated.

Conclusion

In this paper, a machine learning-based method has been proposed for malonylation site prediction. Since the input features are crucial in machine-learning models, several features, including a novel one based on TF-CRF, have been extracted out of protein sequences. Next, the features are combined. Since feature combination leads to high dimensional data and, in turn, model overfitting, the most efficient and discriminating features have been chosen based on a feature selection method. The results show that XGboost outperforms the other classifiers based on the extracted and selected features.

Data availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Peng, C. et al. The first identification of lysine malonylation substrates and its regulatory enzyme. Mol. Cell Proteomics. 10(12), 012658. https://doi.org/10.1074/mcp.M111.012658 (2011).
Article CAS PubMed Google Scholar
Bao, X., Zhao, Q., Yang, T., Fung, Y. M. E. & Li, X. D. A chemical probe for lysine malonylation. Angew. Chem. Int. Ed. 52(18), 4883–4886. https://doi.org/10.1002/anie.201300252 (2013).
Article CAS Google Scholar
Du, Y. et al. Lysine malonylation is elevated in type 2 diabetic mouse models and enriched in metabolic associated proteins. Mol Cell Proteomics 14(1), 227–236 (2015).
Article CAS Google Scholar
Gallego, M. & Virshup, D. M. Post-translationalmodifications regulate the ticking of the circadian clock. Nat. Rev. Mol. Cell Biol. 8, 139–148 (2007).
Article CAS Google Scholar
Luna, L. et al. Dynamic relocalization of hOGG1 during the cell cycle is disrupted in cells harbouring the hOGG1-Cys326 polymorphic variant. Nucleic Acids Res. 33, 1813 (2005).
Article CAS Google Scholar
Nicolls, M. R. The clinical and biological relationship between Type II diabetes mellitus and Alzheimer’s disease. Curr. Alzheimer. Res. 1, 47–54 (2004).
Article CAS Google Scholar
Visser, A., Hamza, N., Kroese, F. G. & Bos, N. A. Acquiring new N-glycosylation sites in variable 450regions of immunoglobulin genes by somatic hypermutation is a common feature of autoimmune diseases. Ann. Rheum. Dis. 77, e49 (2017).
Google Scholar
Wang, M. et al. SulSite-GTB: Identification of protein S-sulf enylation sites by fusing multiple feature information and gradient tree boosting. Neural Comput. Appl. 32, 13843–13862. https://doi.org/10.1007/s00521-020-04792-z (2020).
Article Google Scholar
Taherzadeh, G., Yang, Y., Zhang, T., Wee-Chung Liew, A. & Zhou, Y. Sequence-based prediction of protein–peptide binding sites using support vector machine. J. Comput. Chem. 37, 1223–1229 (2016).
Article CAS Google Scholar
Xu, Y. et al. Mal-Lys: Prediction of lysine malonylation sites in proteins integrated sequence-based features with mRMR feature selection. Nat. Publ. Gr. 1–7, 2016 (2016).
Google Scholar
Wang, L. N., Shi, S. P., Xu, H. D., Wen, P. P. & Qiu, J. D. Computational prediction of species-specific malonylation sites via enhanced characteristic strategy. Bioinformatics 33(10), 1457–1463. https://doi.org/10.1093/bioinformatics/btw755 (2017).
Article CAS PubMed Google Scholar
Du, Y. et al. Prediction of protein lysine acylation by integrating primary sequence information with multiple functional features. J. Proteome Res. 15, 4234–4244 (2016).
Article CAS Google Scholar
Taherzadeh, G. et al. Predicting lysine-malonylation sites of proteins using sequence and predicted structural features. J. Comput. Chem. 39, 22 (2018).
Article Google Scholar
Zhang, Y. J. et al. Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework. Brief. Bioinform. 20, 2185–2199 (2019).
Article CAS Google Scholar
Ahmad, W. et al. Mal-light: Enhancing lysine malonylation sites prediction problem using evolutionary-based features. IEEE Access. 8, 77888 (2020).
Article Google Scholar
Xiang, Q., Feng, K., Liao, B., Liu, Y. & Huang, G. Prediction of lysine malonylation sites based on pseudo amino acid. Comb. Chem. High Throughput Screen. 20, 622–628 (2017).
Article CAS Google Scholar
Ju, Z. & Wang, S.-Y. Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou’s 5-steps rule and general pseudo components. Genomics 112, 859–866 (2020).
Article CAS Google Scholar
Jia, J., Liu, Z., Xiao, X., Liu, B. & Chou, K. C. pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J. Theor. Biol. 394, 223–230. https://doi.org/10.1016/j.jtbi.2016.01.020 (2016).
Article ADS CAS PubMed MATH Google Scholar
Jamal, S., Ali, W., Nagpal, P., Grover, A. & Grover, S. Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins. J. Transl. Med. 19(1), 218. https://doi.org/10.1186/s12967-021-02851-0 (2021).
Article CAS PubMed PubMed Central Google Scholar
Liu, X., Wang, L., Li, J., Hu, J. & Zhang, X. Mal-Prec: Computational prediction of protein Malonylation sites via machine learning based feature integration: Malonylation site prediction. BMC Genomics 21(1), 812. https://doi.org/10.1186/s12864-020-07166-w (2020).
Article CAS PubMed PubMed Central Google Scholar
Abdollah, D., López, Y., Taherzadeh, G., Sharma, A. & Tsunoda, T. SumSec: Accurate prediction of sumoylation sites using predicted secondary structure. Molecules 23(12), 3260. https://doi.org/10.3390/molecules23123260 (2018).
Article CAS Google Scholar
Chung, C.-R. et al. Incorporating hybrid models into lysine malonylation sites prediction on mammalian and plant proteins. Sci. Rep. https://doi.org/10.1038/s41598-020-67384-w (2020).
Article PubMed PubMed Central Google Scholar
Chen, Z. et al. Integration of a deep learning classifier with a random forest approach for predicting malonylation sites. Genom. Proteom. Bioinform. 16(6), 451–459. https://doi.org/10.1016/j.gpb.2018.08.004 (2018).
Article Google Scholar
Yang, Y. et al. Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks. BMC Bioinform. 22(1), 171. https://doi.org/10.1186/s12859-021-04101-y (2021).
Article CAS Google Scholar
Wu, M., Yang, Y., Wang, H. & Xu, Y. A deep learning method to more accurately recall known lysine acetylation sites. BMC Bioinform. 20(1), 49. https://doi.org/10.1186/s12859-019-2632-9 (2019).
Article Google Scholar
Huang, K. Y., Hsu, J. B. & Lee, T. Y. Characterization and identification of lysine succinylation sites based on deep learning method. Sci. Rep. 9(1), 16175. https://doi.org/10.1038/s41598-019-52552-4 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Li, A., Deng, Y., Tan, Y. & Chen, M. A transfer learning-based approach for lysine propionylation prediction. Front Physiol. 12, 658633. https://doi.org/10.3389/fphys.2021.658633 (2021).
Article PubMed PubMed Central Google Scholar
Wang, M. et al. DeepMal: Accurate prediction of protein malonylation sites by deep neural networks. Chem. Intell. Lab. Syst. 207, 104175 (2020).
Article CAS Google Scholar
Ahmed, S., Kabir, M., Arif, M., UllahKhan, Z. & Yu, D.-J. DeepPPSite: A deep learning-based model for analysis and prediction of phosphorylation sites using efficient sequence information. Anal. Biochem. 612, 113955. https://doi.org/10.1016/j.ab.2020.113955 (2021).
Article CAS PubMed Google Scholar
Sha, Y. et al. DeepSADPr: A hybrid-learning architecture for serine ADP-ribosylation site prediction. Methods https://doi.org/10.1016/j.ymeth.2021.09.008 (2021).
Article PubMed Google Scholar
Wang, M. et al. Malsite-deep: Prediction of protein malonylation sites through deep learning and multi-information fusion based on NearMiss-2 strategy. Knowl. Based Syst. https://doi.org/10.1016/j.knosys.2022.108191 (2022).
Article Google Scholar
Jiang, P. et al. FSL-Kla: A few-shot learning-based multi-feature hybrid system for lactylation site prediction. Comput. Struct. Biotechnol. J. 19, 4497–4509. https://doi.org/10.1016/j.csbj.2021.08.013 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chen, Z. et al. iFeature: A python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34, 2499–2502 (2018).
Article CAS Google Scholar
Saravanan, V. & Gautham, N. Harnessing computational biology for exact linear B-cell epitope prediction: A novel amino acid composition-based feature descriptor. OMICS 19, 648–658 (2015).
Article CAS Google Scholar
Zhang, B., Li, J., Quan, L., Chen, Y. & Lü, Q. Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network. Neurocomputing 357, 86–100 (2019).
Article Google Scholar
Jing, L.-P., Huang, H.-K. & Shi, H.-B. Improved Feature Selection Approach TFIDF in Text Mining (IEEE, 2003).
Google Scholar
Maleki, M. & Abdollahzadeh, A. TFCRF: A novel feature weighting method based on class information in text categorization. in Accepted in the XIX. International Conference on Computer, Information and Systems Science and Engineering (2007).
Altschul, S. F. & Koonin, E. V. Iterated profile searches with PSI-BLAST: A tool for discovery in protein databases. Trends Biochem. Sci. 23, 444–447 (1998).
Article CAS Google Scholar
Lee, T. Y., Lin, Z. Q., Hsieh, S. J., Bretaña, N. A. & Lu, C. T. Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics 27, 1780–1787 (2011).
Article CAS Google Scholar
Al-Barakati, H. et al. RF-MaloSite and DL-Malosite: Methods based on random forest and deep learning to identify malonylation sites. Comput. Struct. Biotechnol. J. 18, 852–860. https://doi.org/10.1016/j.csbj.2020.02.012 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jamshid, P. & Ali, R. K. GENIRF: An algorithm for gene regulatory network inference using rotation forest. Curr. Bioinform. 13, 407–419 (2017).
Google Scholar
Pirgazi, J. et al. An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets. Sci. Rep. 9, 18580. https://doi.org/10.1038/s41598-019-54987-1 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, University of Science and Technology of Mazandaran, Behshahr, Iran
Ali Ghanbari Sorkhi & Jamshid Pirgazi
Department of Computer Engineering, Faculty of Information Technology, Kermanshah University of Technology, Kermanshah, Iran
Vahid Ghasemi

Authors

Ali Ghanbari Sorkhi
View author publications
You can also search for this author in PubMed Google Scholar
Jamshid Pirgazi
View author publications
You can also search for this author in PubMed Google Scholar
Vahid Ghasemi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.G.S., J.P. and V.G. designed the research, J.P. and V.G. collected data, A.G.S., J.P. wrote and performed computer programs, A.G.S., J.P. and V.G. analyzed and interpreted the results, J.P and V.G. wrote the first version of the manuscript, A.G.S., J.P. and V.G. revised and edited the manuscript.

Corresponding author

Correspondence to Jamshid Pirgazi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sorkhi, A.G., Pirgazi, J. & Ghasemi, V. A hybrid feature extraction scheme for efficient malonylation site prediction. Sci Rep 12, 5756 (2022). https://doi.org/10.1038/s41598-022-08555-9

Download citation

Received: 19 November 2021
Accepted: 07 March 2022
Published: 06 April 2022
DOI: https://doi.org/10.1038/s41598-022-08555-9

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Highly accurate protein structure prediction with AlphaFold

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Discovery of potent inhibitors of α-synuclein aggregation using structure-based iterative learning

Introduction

Feature extraction

Enhanced amino acid composition (EAAC)

Enhanced grouped amino acid composition (EGAAC)

Dipeptide deviation from expected mean

PKA

Term frequency: inverse document frequency

Term frequency and category relevancy factor (TF-CRF)

PositiveRF

NegativeRF

PSSM

The proposed method

Stage 1: dataset selection

Stage 2: feature extraction

Stage 3: preprocessing

Stage 4: feature selection

Stage 5: model assessment

Experimental results

The datasets

Model assessment

Sequence analysis

Feature extraction analysis

Evaluation through comparison with other methods

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links