A Novel Modeling in Mathematical Biology for Classification of Signal Peptides

The molecular structure of macromolecules in living cells is ambiguous unless we classify them in a scientific manner. Signal peptides are of vital importance in determining the behavior of newly formed proteins towards their destined path in cellular and extracellular location in both eukaryotes and prokaryotes. In the present research work, a novel method is offered to foreknow the behavior of signal peptides and determine their cleavage site. The proposed model employs neural networks using isolated sets of prokaryote and eukaryote primary sequences. Protein sequences are classified as secretory or non-secretory in order to investigate secretory proteins and their signal peptides. In comparison with the previous prediction tools, the proposed algorithm is more rigorous, well-organized, significantly appropriate and highly accurate for the examination of signal peptides even in extensive collection of protein sequences.

The classification of proteins bears vast interest to researchers across the world. For better understanding of the molecular structure of living cells, it is important to classify and categorize their macromolecules like proteins 1 in terms of their attributes. To examine the behavior of newly synthesized proteins towards cellular and extracellular positions in cells (for both eukaryotes and prokaryotes), signal peptide works like a "ZIP code" 2,3 . The study of signal can help to propose new medications for genetic remedial treatment which has become difficult for pharmaceutical chemists to develop more accurately 4,5 . A newly translated secretory protein starts to move from rough endoplasmic reticulum to Golgi transport vesicles and then Golgi cisternae leading to secretory transport vesicles. Consequently, these are secreted to the cell exterior surface. The secretory proteins enter the exterior surface of a cell via protein conducting channels (see Fig. 1). The signal peptides translocate the newly created proteins in the secretory pathway 6,7 .
It is worth mentioning that the signal peptide of newly synthesized protein can diverge under abnormal circumstances from the exact path. This deviation results the protein ending up in an inappropriate cellular location and hence causes severe diseases. Hereafter, an accurate model for prediction of signal peptide is of much more crucial importance. Predictive results are momentous to examine cell functions and analyze its prospective and genetic purposes 8 . For the prediction of arbitrary signal sequences, a capable novel modeling with a consolidated approach is to be devised to provide more assiduous results. Everyday, a huge number of protein sequences are collected and entering into databanks. Indeed, it is extremely desirable to develop a robust, reliable and excellently accurate computational method for the prediction of signal sequences. A number of researchers presented several models in this respect [9][10][11][12][13][14][15][16] . A huge amount of literature is found pertaining to such prediction models 3,4,[17][18][19][20] . PrediSi 21 associated the prediction techniques for secretory and non-secretory protein and proposed position weight matrix (PWM) approach for identification of cleavage site. SignalP 22 developed a method that uses a hidden Markov model (HMM) while Signal-CF 8 suggested a coupling fusion predictor that contrived through comprising the subsite coupling effects on protein sequences. Gunnar Von Heijne developed a prototype model depicting the nearby amino acids signal sequences 23 . Lal et al. devised a scheme for identifying proteins containing signal peptides and assigned a label SP to a protein after knowing that a protein contains a signal 24 . Extreme learning machine (ELM) and improved ELM were also devised to categorize the avalanche of protein

Results
In order to validate the current model, a comparative analysis with existing models 8,14,21,22 is established. It is found that the techniques suggested in PrediSi 21 , Neural Network (NN) 14 and hidden Markov model (HMM) 22 were less efficient than the Signal-CF 8 . Although the PredSi denied schemes such as NN and HMM 14,22 . Moreover, Signal-CF is a predictor established using pseudo amino acid composition, scaled window and subsite coupling effect, and is an improvement over the PredSi and SignalP methods. Also, few disadvantages of the previous predictors were observed. For example, the feature vector in Chou's scheme was dependent on a the value of a variable λ.
As for the current framework, the occurrence of each residue with a special weight factor has been incorporated for all protein sequences. This yielded a comprehensive description for any arbitrary sequence. A better accuracy rate has been achieved as compared to previous models. Self-consistency test, 10-fold cross validation and jackknife testing were conducted to ensure the accuracy of the proposed model. These tests are elaborated in Tables 1 and 2. Table 1 provides accuracy outcomes for the prediction of secretory protein of eukaryotic, Gram-positive and Gram-negative datasets whereas Table 2 represents the same statistical analysis for non-secretory proteins.
In previous works, most of the methodologies were evaluated by self-consistency, cross validation or/and jackknife testing. The proposed model is validated by all the three testing procedures simultaneously. Firstly, self-consistency test is performed. This is applied to the trained neural network predictor. Datasets for eukaryotic, Gram-positive and Gram-negative are separated into subsets of positives and negatives with comparable random Figure 1. Protein secretion: Ribosomes deposits the protein in endoplasmic reticulum (ER), protein exits ER and enters Golgi apparatus for processing and later it exits Golgi and enters into the cell exterior. sizes. After the completion of the training process, the test is executed. In further steps, the other two validation techniques, namely cross validation and jackknife are also evaluated. The validation process is demonstrated in the following flow chart (see Fig. 2). Secretory and non-secretory datasets are divided into ten equal units for 10-fold cross validation. Evaluation is performed based on performance in ten dissimilar and disjoint subsets of test data and is translated into ten distinctive confusion matrices together with the ROC (Receiver operating characteristic) graphs. The overall prediction accuracy is achieved by averaging all the outcomes. The prediction accuracies are estimated for each subset separately. This procedure is necessary for cross validation and jackknife testing as well. The ROC is a graphical interpretation for the performance of a classifier which illustrates the false positive rate (FPR) against the true positive rate (TPR). The area under the curve signifies the accuracy of the system. ROC graph yielded as a result of self consistency test for the three organisms dataset in given below (see Fig. 3).

Self-consistency Test
A comparative analysis of the proposed model with existing techniques is elaborated in Table 3. It is found that the previous accuracy rates varied from 72% to 87% whereas the accuracy of the proposed model for Self-consistency test lies between 95% to 97%.
In order to assess the accuracy of the prediction model four distinct metrics were utilized. The expected success rate and performance of the proposed model is predicted by these statistical measures. To understand these statistical metrics in easiest way, the prediction scales are expanded as discussed in 46,47 . The true prediction rates for secretory ϒ + and non-secretory ϒ − for three describes organisms categories are given by   where  + and  − + represents the total number of the predicted secretory proteins and incorrectly identified non-secretory protein sequences. Similarly  − and  + − denotes the total number of anticipated non-secretory peptides and falsely recognized secretory protein chains. In general the prediction rate is defined by From the equations (1) (4) which is frequently found in literatures for observing the performance superiority of a predictor. Particularly, its advantages have been analyzed and endorsed by a series of significant studies published very recently [48][49][50][51][52][53] .  By substituting (5) into (4) along and utilizing (3), we have It is worth mentioning that from (6) that if  = − + 0 then Sn = 1 signifying that no secretory protein sequences is incorrectly predicted as non-secretory protein sequence. In other case when   = − + + implies that Sn = 0 represents that all the secretory protein sequences were incorrectly predicted as non-secretory protein chains. Similarly if  = + − 0 yields Sp = 1 then it signifies that no non-secretory sequence was incorrectly predicted similarly if   = + − − gives Sp = 0 then it shows that all the non-secretory sequences were falsely predicted as secretory sequences. On the other hand the prediction accuracy metric = ϒ = Acc 1 when there is no incorrectly predicted sequences for secretory ϒ + as well as for non-secretory proteins ϒ − i.e., indicates that all the secretory ϒ + and non-secretory ϒ − sequences were falsely predicted hence yielding an overall accuracy = ϒ = Acc 0. Additionally, Matthew correlation coefficient (MCC) is frequently used to assess the performance of binary classifications. MCC is designed in such a way that the disparity in the size of positive or negative samples in the comprehensive dataset does not bias the overall outcome.
The statistical analysis using independent set testing for computing these metrics mentioned above is given in Table 4 with a noteworthy MCC value for the three segregated Eukrayotic, Gram-positive and Gram-negative datasets is 0.81, 0.71 and 0.90 respectively.

Discussion
Undoubtedly, the examination and analysis of patterns and sequence is quite convoluted when there are avalanche of protein sequences with diverse lengths. The statistical formulation of these sequences along with construction of robust data set to produce assiduous results was a challenging task. These cumbersome tasks have been rectified by the proposed technique. A performance comparison over the existing and previous predictors 8,14,21,22 has been analyzed and a worth seeing prediction accuracy using the proposed technique was observed. The idea was to recognize the query proteins as secretory or non-secretory and hence identify the presence of signal peptides, further in the next step the cleavage site for the signal peptide is identified based on "(−3, −1) -Rule", as described by Von Heijni 10 . It is obseved that signal peptides are more important than protein synthesization. The insufficiency of protein in distribution or contribution is cause of health hazards. So, the protein synthesization is not enough to perform cell functions properly but the compartmentalization of proteins to their relevent loci is of vital importance. Involvement of deviated protein with beneficial cells function is responsible for severe health diseases including Neurodegenerative disorder and cells death 54,55 . Hence, the computational capability to classify a protein as secretory or a non-secretory protein with a high level of accuracy bears great significance. In this process of anticipation, a large-scaled benchmark data set had been selected and incorporated into the prediction model. Even for huge data set, signal sequences along with their cleavage site was predicted at the cost of minor computational overhead with a high accuracy. The proposed algorithm can also be applied in future works to solve several other problems such as identification of Post Translational Modification (PTM) sites 56

Methods
Material. Since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful models 61-63 , we shall make efforts in our future work to provide a web-server for the method presented in this paper. A current updated edition of the well-known database uniprot wasn employed to obtain an inclusive benchmark dataset. The proposed model was trained over the assimilated dataset and subsequently validated in the following steps: • The eukaryotic or eukaryota, Gram-positive and Gram-negative are the organisms which have been incorporated for the desired prediction via proposed model. A benchmark data set related to these organisms was composed. A query was operated for the extraction of non-secretory protein dataset in the entire database. Through this inquiry, those protein sequences were selected which were marked in OC (Organism Classification) field as eukaryotic or eukaryota, Gram-positive or Gram-negative. Amongst these extracted entities, the eukaryotic entries annotated with subcellular locations as the nucleus or cytoplasm were selected while Gram-positive and Gram-negative entries annotated with subcellular location as cytoplasm were selected as a non-secretory protein sequences shown below (see Fig. 4). The vague extraction glossed with keywords like "by similarity" and "fragment" were discarded. • In taxonomy field, the keyword eukaryota was combed to obtain a training dataset for secretory protein sequences. The proteins annotated as signal peptides were only selected excluding those containing ambiguous keywords like potential, probable, fragment and by similarity in FT field. Out of those selected protein sequences, in which the preamble signal peptide was surely contained, the first 100 residues were extracted. The Gram-positive and Gram-negative organisms data was also extracted in a similar fashion. • The defined steps were adopted strictly to collect a first-rate benchmark data set of signal peptides for both secretory and non-secretory proteins in the three organisms namely eukaryotic, Gram-positive and Gram-negative. • Table 5, shows that 28,220 eukaryotic, 934 Gram-positive and 1046 Gram-negative samples were found in all 30,200 secretory proteins. After deleting similar entities, these are reduced to 25327 unique values. Similarly for all non-secretory proteins, 5595, 908 and 441 were found in eukaryotic, Gram-positive and Gram-negative respectively and 6,944 conjointly. After deleting repeated items, these are reduced to 6426 unique values. The benchmark dataset extracted for the proposed model is more diverse and comprehensive than the benchmark data set employed by existing predictors. Within the training dataset the secretory samples are considered as positive and non-secretory samples as negative as shown in flowchart given below (see Fig. 5). where ξ 1 , is the first amino acid residue, ξ 2 is the second amino acid residue and so on up till the last residue ξ N within the polypeptide chain S. Also, N represents the length of protein sequence (7). In order to predict the signal peptide in a convenient manner a computational algorithm has been proposed. This algorithm preserves the sequence order effect and is carried out by assimilating the whole sequence data together with the occurrence of each amino acid residue Λâ of type ≤ ≤â a : 1 20 (any one of residue among twenty amino acid residues). The whole structure is based on expressions (8) to (11). Expression (8) is associated with the number of occurrence Λâ of residue â and with the possible number of correlated factors ϑ of â with itself such that a a a . Subsequently, expression (9) represents the mean factors M 0 , M i and M j which are linked with the difference factors of â at their corresponding positions and is appended with constraint (10). Moreover, M i varies according to the number of difference factors and each difference factor associates an exclusive mean term. The difference is labeled as (q − p) i , where p and q are respective positions of â in the polypeptide chain given as ξ ξ ξ = =p q a . Also subscript i represents the number of the possible difference factors between similar residues excluding first and last time occurrence of residue, â depends upon the n number of â residues in sequence. Likewise M j is fitted to factor (N − r), where r is the position of the n th occurrence of the residue â in sequence (7) such that ξ ξ =r a and 1 ≤ p < q < r ≤ N represents the â residue at their corresponding positions.     Combining expression (8) and (9) and using constraint (10) yields the template for manipulating feature component related to â, given in (11) and (12). whereas M i and M j are count factors of residue â with other nineteen residues occurring before and after â respectively and can be defined in terms of X and Y, as elaborated in equations (13) and (14). where f k , 1 < k < 20 represents the frequency of pair function ϑ related to residue â with other nineteen amino acid residues and in case of non-occurrence, it is denoted by f 0 . Consider ρ l,m , it describes pair function ϑ for all amino acid residues with each other and while another pair function ϑ(ξ l ,ξ m ) in term of ρ l,m is defined as ϑ(ξ l ,ξ m ) = ρ l,m ; l = m = 1, 2, 3 … 20, shown in matrix (15). Subsequently, expression (16) is the complete representation for all possible pair factors regarding X and Y followed by (17), When pair ϑ(ξ l ,ξ m ) exists then ρ l,m is established as 1 otherwise its assigned a zero entry. Furthermore, (16) admits to (15) with entries ρ l,m , ρ l,l and ρ m,l indicating lower triangular matrix for X. Consequently, diagonal entries represents the combination among similar residues and upper triangular matrix for Y.
l m l m , Extension of (11) gives the idea for manipulating feature components for all twenty amino acid residues in matrix form, as given in (18). Using equation (14) in (12) contributes as a component of feature vector corresponding to residue â as given in equations (19) and (20).  To understand the structural scheme of proposed model consider g th term of sequence given in equation (7), say, ξ g , which reflects the first alphabetical letter of amino acid residues say ' A' . Notice its occurrences as well as corresponding positions in the sequence. ξ g makes pair with its contiguous residues before and after the g th residue in the terms ϑ(ξ k ,ξ g ) and ϑ(ξ g ,ξ k ) represented by green and blue curved lines and pairs ξ g by itself represented by red loops (see Fig. 6). This process will be continued until next ξ h occurs at h th position such that ξ g = ξ h = A. Similarly same steps will be applied for ξ j . The feature component corresponding to residue "A" is interpreted in equation (21). where k = 1, 2, 3 …, 20 represents the ordinal values of twenty amino acid residues in alphabetical order and for more simplification assume that ξ 1 , ξ 2 , ξ 3 , …, ξ 20 represents 20 amino acids in alphabetical order labeled as: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y also ξ 21 onwards the 20 residues cyclically repeats themselves then let Π 1 , Π 2 , Π 3 , …, Π 20 be their corresponding feature components. The set of twenty feature components is given in equation (22).    The set of above twenty feature components are based on three properties of amino acids, namely hydrophobicity, hydrophilicity and side chain mass of amino acids. Each property associates a set of sixty components so collectively this admits 180 components manipulated by using equations (23) to (25), where s represents the number of attributes for amino acid residues in succinct representation, for s = 1, 2, 3 it corresponds to hydrophobicity, hydrophilicity and side chain mass of amino acids respectively.   where ∆ ⁎ 1 , ∆ ⁎ 2 , ∆ ⁎ 3 represents the normalized hydrophobicity, hydrophilicity and side-chain mass respectively. The values used in (23) to (25) are normalized with (26), and standardized with preferred range such that (−R, R), where R is the number in which â amino acids are being standardized. The hydrophobicity values are taken from Tanford C. 64 , hydrophilicity assesses are assumed from Hopp T. P., Woods K. R. 65 and side-chain mass values are generally available in any biochemistry text book.
Feature set is characterized with a vector having two hundred and twenty (220) elements, where the first sixty components based on the hydrophobic nature of amino acids, second sixty represents their hydrophilic nature, next sixty components contain information regarding side chain mass of 20 amino acids and the last forty elements reflect the positions as well as composition of individual amino acids residues. This novel predictor establishes outstanding outcomes towards recognition of roll down protein sequences. These extracted vectors obtained for training data are further used to train Neural Networks (NN) based classifier. Neural networks are one of the most powerful techniques used to solve decision problems. They work by receiving labelled inputs and hence gain experience which help them to develop an opinion regarding arbitrary input for test purposes. After training process is completed the network seemingly behaves in a way that makes it capable to classify each given input within an acceptable degree of accuracy. During the learning process the network adjusts its weights such that the error is minimized which essentially translates into improved learning and increased accuracy 17 . A multilayer neural network was used to tackle this problem (see Fig. 7). The feature vector constructed for the prediction of signal peptides consists of 220 coefficients. Its connectionist architecture consists of 220 input layer neurons, 50 hidden layer neurons and two output neurons that discern among secretory and non-secretory poly-peptide chains. The training of the multilayered neural network is performed using back propagation method. In order to reduce the error and increase the prediction accuracy gradient descent technique was used along with an adaptive learning rate. Respective output units are ultimately brought together through individual units of input, output representation 19 .
Signal peptidase perform a momentous role in order to cleave signal sequence and the mature peptide from the nascent protein. Signal peptide is found in the vicinity of N-terminus site of protein sequence. Customarily, signal consists of 3-60 amino acid residues 8 . Translocon duct allows the passing of signal sequence transversely (see Fig. 8). Peptidases firstly confronts a nascent protein within endoplasmic reticulum (ER) 66 . Signal peptidase is also encountered in prokaryotes 67 When protein builds appurtenance for mitochondria and chloroplasts.
A signal cleavage site for signal peptide is revealed by dividing concatenation into 3 parts, N-terminal part is basic positively charged and is labeled as the n-region, the central hydrophobic region is labeled as h-region and the C-terminal part describes the more polar site of the sequence and is represented as a c-region. Signal is encountered through the n-region to h-region by occupying positions −3 and −1. The application of "(−3, −1) -rule" proves very productive in identifying signal cleavage site directly. As discussed previously the dataset is divided in two categories, eukaryotes and prokaryotes (Gram-positive and Gram-negative) to predict the signal sequence and its cleavage site in both categories. Signal is mostly encountered embedded within the concatenation segregating the signal sequence and mature protein chain. The cleavage site of a secretory protein is determined by following these steps: First count the amino acid residues at each position in the sequence, formally, P(â, i), where P is the count factor for the occurrence of residue of type â at position i. Subsequently build Weight-matrices Q(â, i) by dividing all counts by their diverse expected abundance in proteins, primarily, P a ( ) . Taking the natural logarithms of these entities for all sequences arrayed from their accepted sites of removed peptide chain between positions −1 and +1, t follows in equation (27 Figure 8. Signal peptidase is an enzyme that removes the signal of translocated primary proteins from the membrane to exhibit its mature form when they are substituted from a cytoplasmic position of synthesis to extracytoplasmic regions. Ultimately, these cleaved signal peptides are directed towards secretory tract. Eukaryotic and prokaryotic Signal concatenation splitter follows the "(−3, −1) -rule" 10,23 . Statistical analysis shows that any residue out of Ala, Ser, Gly, Cys, Thr is placed at −1 location respect to the cleavage site while −3 site is occupied by Asp, Glu, Lys, Arg, Asn, Gln, (but not Phe, His, Tyr, Try), furthermore, there is no Pro residue between −3 to +1. Similarly, for Prokaryotic proteins same rule applies with a different set of residues. The −1 location is occupied by any of Ala, Gly, Ser, Thr whereas −3 is occupied by any of Ala, Gly, Leu, Ser, Thr, Val, also, −7 and −8 is mostly occupied by Leu or any other hydrophobic residue other than Val, Phe. In addition, it was also submitted that Proline (Pro) necessarily will be missed in the region −3 over +1.
A scanning algorithm was developed to search the cleavage site pattern. It transcribed a weight matrix onto the polypeptide sequence. The residues bearing significance in identifying the cleavage site were assigned higher non-zero values while others were substituted by a zero value, hence the primary sequence was transformed into a vector. The algorithm worked by identifying a spike among the neighboring elements of the vector 10,23 . Systematic drawing reflects the overall procedure involved in predicting a signal peptide and determining its cleavage site (see Fig. 9).