Identification of stress response proteins through fusion of machine learning models and statistical paradigms

Proteins are a vital component of cells that perform physiological functions to ensure smooth operations of bodily functions. Identification of a protein's function involves a detailed understanding of the structure of proteins. Stress proteins are essential mediators of several responses to cellular stress and are categorized based on their structural characteristics. These proteins are found to be conserved across many eukaryotic and prokaryotic linkages and demonstrate varied crucial functional activities inside a cell. The in-vivo, ex vivo, and in-vitro identification of stress proteins are a time-consuming and costly task. This study is aimed at the identification of stress protein sequences with the aid of mathematical modelling and machine learning methods to supplement the aforementioned wet lab methods. The model developed using Random Forest showed remarkable results with 91.1% accuracy while models based on neural network and support vector machine showed 87.7% and 47.0% accuracy, respectively. Based on evaluation results it was concluded that random-forest based classifier surpassed all other predictors and is suitable for use in practical applications for the identification of stress proteins. Live web server is available at http://biopred.org/stressprotiens, while the webserver code available is at https://github.com/abdullah5naveed/SRP_WebServer.git


Materials and methods
This section explains the proposed methodology based on the described stepwise approach. Each step of the described model is illustrated in Fig. 1. Firstly, a robust diverse, and homology restricted dataset is accumulated. In the next step, a comprehensive feature extraction methodology is formulated that ensures that all the crucial obscure features for deciphering the attributes of each protein have been extracted. Subsequently, based on the obtained feature vectors machine learning models are trained. Consequently, the precision of each model is calibrated to identify the most accurate model. Ultimately, the most assiduous model is integrated into a web server for public use.
Benchmark dataset collection. The protein sequences of the benchmark dataset were meticulously extracted from the well-known UniProt database for proteins 13 . Reviewed data of stress protein sequence of different cell lines (organisms) were retrieved from the UniProt dataset using the UniProt keyword "Stress Response" labelled as KW-0346 in the database, where the query resulted in 7092 reviewed positive protein sequences. Again, using the inverse query, 7500 reviewed negative protein sequences were retrieved. The data contained sequences of proteins in FASTA format, where each sequence was comprised of amino acids letters. Furthermore, all the sequences were of non-uniform length. However, this was the raw data, which might have been containing homologous sequences. Thus, to cater for this, redundancy from the dataset was removed using the CD-HIT suite to an acceptable level. A cutoff value was set at 0.7 to form homologous clusters with similarities greater than or equal to 70%, as reported by 14 . Ultimately, 6140 clusters were formed out of the positive samples while 7163 were formed for negative samples. A representative protein sequence was taken from each cluster. The ratio of positive and negative proteins sequence is illustrated in Fig. 2. Sample encoding. Machine-learning algorithms learn based on numeric representations of data, instead of raw sequences, as expounded in 15,16 ]. Thus, to represent sequences as vectors, the pseudo amino acid com-  www.nature.com/scientificreports/ position (PseAAC) was proposed 17 . The idea of PseAAC is popular in bioinformatics research 18,19 and has been used in numerous bio-medicine and medication improvement studies 20,21 as well as other disciplines of computational proteomics. An extensive rundown of references is provided in a survey paper 22 . Since it has been generally and progressively utilized, many profound open-access projects 23,24 were developed to create different methods of feature extractions using PseAAC. Enlivened by the achievements of utilizing PseAAC for feature extraction, multiple predictors were proposed by researchers 25,26 . Specific components for protein groups have been utilized for enabling vector encoding of samples to be processed through machine learning algorithms. These vector encoding techniques are used in various genomic research contributions as illuminated in 13 including, a robust web server called Pse-In-One 27 . Both protein/peptide and Protein groups can be used to make a perfect fixed scale feature vector as 8 The parts Ψj (j = 1, 2, ⋯, Ω) of the sequence are considered as a method of incorporating the properties of the Protein's sequence. We encoded the PseAAC values using the simplest encoding as shown in Table 1.
For feature extraction, we used the structure of measurable statistical moments to capture the characteristics and measurements as discussed in the following sections.
Statistical moments. Statistical moments were used in this study for feature extraction. Arbitrary statistical moments explain different aspects of the dataset defined moments defining functions and the distribution polynomial. Some specific moments were incorporated such as "Raw, Central and Hahn moments" 28,29 . The raw moment exhibits location and scale variance are used for mean calculation as well as determination of dataset asymmetry based upon probability distribution. Central moments are location invariant because centric calculations are performed. They two provide information regarding mean, variance, and distribution of data along with the mean 30,31 . Hahn moments are used to calculate the variation of size and position based on "Hahn Polynomials". All these moments provide significant information about the sequence order and composition of data 32 . The benefit of selecting these measurable moments is the availability of the sensitive hidden patterns of peptide sequence uncovered by these moments 33 . Some of these moments especially the Hahn moments require a two-dimensional square matrix as input. For this purpose, a one-dimensional protein sequence is mapped onto a two-dimensional array. A square matrix, L′, is formed which can be expressed as where each c ij is an amino acid residue. The square root of the length of each sequence is computed and ceiled, and a square matrix of obtained value is formed. Each element of that square matrix is filled with residues of the respective sequence, sequentially. Moments up to the degree of 3 were computed and using the components of L′ the raw moments are determined as where (l + n) represents the degree of moments, while up to 3-degree moments are G 00, G 10 , G 20 , G 30 , G 01 , G 11 , G 21 , G 02 , G 12 , and G 03 . Further, the central moments are determined as Furthermore, discrete Hahn moments effectively enrol for an even-dimensional information connection. Discrete Hahn moments require square cross-segment as information. The computing of Hahn moments does not change any meaning of the data, thus, due to this orthogonality, the Hahn moments are reversible and the sequence could be regenerated by using the inverse function of Hahn moments. These moments comprise sequence composition and relative positioning of amino acid residues. Hahn moments were computed through where the definition of the operators used is discussed in detail by Akmal et al. 34 . The orthogonal normalized Hahn moments for 2D data are further computed as www.nature.com/scientificreports/ Thus, for computing all the three types of moments, their respective equations were used. Each type of moment yielded 10 moments of order 3, thus, a total of 30 moments were computed for each sample.
Computing position relative incidence matrix (PRIM) by original and reverse sequence. The relative positioning of amino acid residues in a polypeptide chain plays a pivotal role in determining the biological function and physiochemical characteristics of that peptide. This pivotal model is based on the distinctive arrangement of proteins and the relative positions of the amino acids of the residue chain.
The relative positioning statistics of a total number of residues are used to assemble a 20 × 20 position relative incidence Matrix (PRIM), while the resultant matrix is used to extract features. The PRIM matrix is shown as where each element M i→j is the sum of all positions of jth amino acid, relative to the first occurrence of ith amino acid. Through this 20 × 20 matrix (as there are 20 amino acids), a total of 400 coefficients are produced by the PRIM. For reverse position relative incidence matrix (RPRIM), the same process is used on the reverse proteomic sequence and RPRIM is shown as To reduce the dimensionality of both the 20 × 20 matrix and to extracting more significant and meaningful information from PRIM and RPRIM, again 30 moments were computed for both matrices.
Determination of frequency vector. PRIM and RPRIM mainly provide information regarding the relative positioning of the residues in the amino acid sequences, however, amino acid frequencies in a sequence also play an important role. To elucidate the compositional confirmation of a primary sequence, another vector is formulated namely the frequency vector. The vector is of length 20, where each index i represents the ith amino acid from A to Y and each coefficient in this vector is used to measure the frequency occurrence of the corresponding amino acid. The frequency vector is represented as where each τ i, characterizes the frequency occurrence of that respective ith amino acid.
Computing accumulative absolute position incidence vector (AAPIV) by original and reverse sequence. The frequency vector is only used for extraction of the information of the composition of amino acids, whereas, the PRIM and RPRIM only provide information of relative amino acid positioning. To encode the absolute position of amino acids in a sequence, the Accumulative Absolute Position Incidence Vector (AAPIV), is used. It provides an estimate of the absolute positioning of residues. It computes the ordinal value of each residue and accumulates this ordinal value into a 20-length vector at the respective coefficient where each index represents the respective amino acid from A to Y. Thus, an arbitrary ith element of AAPIV is calculated as where p k represents the ordinal value of an arbitrary occurrence of ith amino acid. Similarly, Reverse Accumulative Absolute Position Incidence Vector (RAAPIV) is computed based on the same mechanism but with reversed sequence. More obscure features are unravelled by its enumeration. Generic representation of AAPIV and RAAPIV could be seen as Model training and optimization. This study is focused on a specific type of protein and pertaining to the stress response. Three different classification algorithms were analyzed for the prediction of stress response proteins. A feature vector was assimilated using the raw, central, and Hahn moments of the two-dimensional Random FOREST. Firstly, the Random Forest (RF) was used which is a well-known algorithm used for regression and classification problems. While training, it is operated by generating a forest of decision trees using a feature matrix and outputs the predicted class, that is the mode of the classes predicted by all trees in that forest. Random forest is non-parametric, have higher classification accuracy, and is capable of determining the set of coefficients that are most crucial for predicting the class with maximum accuracy 35 . Feature vectors as input matrix and their corresponding class labels as expected output matrix are congregated to train the random forest predictor. The architecture of Random Forest is shown in Fig. 3. Each tree uses a subset of the vector to classify the input where n is the number of decision trees. A voting algorithm finally decides the actual predicted class based on the majority votes.
In the present study, RF implementation was used from Scikit-Learn, with default parameters, while n_estimators used were 50.
Artificial neural network. An artificial neural network is a connectionist network of neurons. An input neuron receives the transformed input while each subsequent neuron receives the output yielded by all the former neurons. The output of each neuron is calculated as the activated consequence of the weighted sum of the inputs to that neuron, as shown in Fig. 4. The feature vectors formed are clamped to the neural network input layer 36 . An optimized number of hidden layer neurons are used. During each epoch, the backpropagation along with gradi-  www.nature.com/scientificreports/ ent descent technique is used to find the most optimal neuron weights. The gradient descent method makes use of the gradients of the cost function to take a step towards the optimal solution with respect to a parameter θ as Here, M(θ) is the objective function while θ ∈ R d , a is the learning rate and the gradient of the objective function is given as ∇ θ M(θ) . The learning rate is considered to be problem-specific and its value usually relies on the cost function. It governs the step size of gradient descent at each iteration. The learning rate differential is usually a constant but it can be variable in which case it is adaptively set to find the most optimal point 37 . To find the most optimal point, the gradient is calculated for consecutive points while the weights are readjusted and the learning rate is fine-tuned 38 . When the algorithm is fully trained, it can be used to predict outcomes for unknown data.
In the present study, fully connected NN was used from Keras, with dense layers. 1 hidden layer was employed with 50 neurons, while the size of the input layer was 150 (equal to FV length). Output neurons were 2 for binary classification based on one-hot encoding. For the hidden layer, ReLU was used as an activation function while for the output layer, Sigmoid was used. The learning rate was set as 0.001.

Support vector machine.
Lastly, a support vector machine (SVM) was used, which is also a supervised learning model, known for classification and regression tasks. SVMs has been abundantly and successfully deployed to solve numerous classification problem 39 . SVMs works on the principle of finding a hyperplane that could separate the classes of a dataset, as shown in Fig. 5.
The hyperplane is adjusted with the help of support vectors so that the distance between the hyperplane and the nearest training data points is maximal. This would provide a large margin and hence improve the accuracy by reducing the generalization error.
For the present study, Linear SVM implementation from Scikit-Learn was used, which is based on libsvm, while default parameters were employed.

Results and discussion
Once a model is trained it is important to establish its suitability for the research community. Trained models are rigorously subjected to testing and validation using well-known validation methods. These methods help to identify the most robust, assiduous, and accurate model based on specific metrics. Performance estimations. The objective evaluation of a predictor is crucial to determine the performance of that predictor, however, choosing the metrics for evaluation, as well as, the method of evaluation, is a critical step. Here, in this study, four well known and well-reported metrics are used for performance evaluation which is Mathew's Correlation coefficient (MCC), Specificity of the model (S p ), Sensitivity of the model (S n ), and finally, the Accuracy (Acc) 40 , as described in 14,[41][42][43][44][45][46][47][48] . Here, the MCC is computed to reflect the stability of the model as it opts for all elements of the confusion matrix. Furthermore, the S n is computed to measure the ability of the model for predicting positive samples, while S p was computed for describing the ability of the model to identify negative samples 49 . In 50 , the performance estimation metrics are described as www.nature.com/scientificreports/ where TP represents the number of stress response proteins that were predicted truly as stress proteins (True Positive). FN represents the total number of stress response protein samples that were predicted falsely as the non-stress response protein (False negative). Also, TN is the total number of non-stress response proteins that were predicted truly as non-stress (True Negatives), and FP is the total number of non-stress response proteins that were falsely predicted as stress response proteins. However, these metrics are generically appropriate for binary class data. Other variants of these metrics are also proposed for multi-class data 51,52 . After training of the proposed classifiers i.e., Random Forest Classifier, Artificial Neural Network, and SVM these predictors were thoroughly tested using rigorous validation and testing techniques i.e., self-consistency testing, jack-knife testing, independent set testing, and k-Fold Cross-Validation with k = 5 and k = 10.
Self-consistency test. The self-consistency test was conducted to determine the training accuracy of all classifiers using the same dataset for training and testing. The results of this test for all algorithms are shown in Table 2, which depict the overall accuracy, specificity, sensitivity, and stability of the predictive model. While the ROC curve of these predictors is Fig. 6.
The area under the curve for the Random Forest classifier is maximal indicating that it exhibits the best performance, while the accuracy of ANN and SVM-based classifiers is 87% and 80.2%, respectively. It indicates that the random forest classifier can predict the protein class using the formulated features quite accurately as compared to the other two opted classifiers i.e., ANN and SVM.
jack-knife testing. Jack-knife validation test bears peculiar significance as it always returns a unique output for a dataset. By using a jack-knife, intentional problems with the data such as subsampling and imbalance can be ignored. This test rigorously tests each sample as unknown data while the model is trained on the rest of the samples. At each iteration, that specific test sample is totally unseen for data, because, the model is re-initialized and re-trained. Thus, this testing is more robust than the independent dataset testing, for being testing model n-times on an un-seen sample. A jack-knife test was performed for RF, ANN, and SVM classifiers, the results are shown in Table 3. It displays the overall accuracy, specificity, sensitivity, and stability of the predictive model. The ROC curves of all three predictors as well as comparison thereof is shown in Fig. 7.
Again, the performance of the RF classifier outperforms the rest of the classifiers.
Independent set testing. Before deploying a model for the real world, it is necessary to evaluate the performance of the model for unseen data. Independent set testing works on this principle and is performed by splitting the whole dataset into two splits. One partition is used for training, while the other is for testing. The classifiers are  www.nature.com/scientificreports/ trained using 70% of the dataset while it's tested on the remaining 30% dataset. Table 4 displays the overall accuracy, specificity, sensitivity, and stability of the random forest, artificial neural network, and SVM classifiers. The ROC curves of all three predictors as well as comparison thereof is shown in Fig. 8. Apparently, from the ROC curve, it is established that SVM performs unremarkably. Still, random forest exhibits the best performance while ANN shows a performance slightly inferior to RF.

Validation through 5-fold and tenfold cross-validation.
The cross-validation test divides the dataset into k disjoint sets, where k is a predefined value chosen at the beginning of testing, and kept constant. Usually, k is kept 5 or 10 but, in this section, k is set to 5 and 10 to achieve fivefold and tenfold cross-validation. At each iteration,    www.nature.com/scientificreports/ that specific test set is totally unseen for data, as at each iteration, the model is re-initialized and re-trained. Thus, this testing is more robust than the independent dataset testing, for being testing model k-times on un-seen data. This test helps to determine the possibility of encumbrances like underfitting and overfitting. For fivefold, the test is iterated 5 times, in each iteration, one set (kth set) is treated as a testing set while the rest of the four sets are accumulatively used as the training set. The overall accuracy of the model is computed by taking an average of scores achieved in each iteration which is reported as the result of the cross-validation test.
The 5-Fold Cross-Validation test reveals 95.9% overall accuracy using a random-forest classifier, 87% accuracy is brought by ANN, and 79.4% accuracy is obtained by SVM as shown in Table 5. The ROC also depicts the same in Fig. 9.
Furthermore, the boxplot comparison of 5-Fold cross-validation of ANN, SVM and, RF is shown in Figs. 10, 11 and 12.
The boxplot shows that the peak accuracy of the RF classifier reaches 0.975 while the peak accuracies of NN and SVM are 0.91 and 0.84, respectively. Subsequently, the minimal accuracy exhibited by RF is 0.94 whereas the minimal accuracy of NN and SVM are 0.82 and 0.76, respectively. This clearly shows that RF performance is far superior to SVM and also performs NN. Random Forest classifier exhibits a stable accuracy rate over numerous partitions with a mean accuracy rate of 0.959.
For tenfold, the benchmark dataset is divided into 10 folds using k = 10, while at each iteration, 9 folds are selected for training the model while the remaining onefold is used for testing. Hence, all the partitions of the    Table 6 shows the performance of all the classifiers which is also reflected by the ROC curve in Fig. 13.
Discussion. This study proposed a method based on three classifiers using Random Forest, Support Vector Machine, and Artificial Neural Network. The stress response proteins benchmark dataset was encoded with statistical features to enable better representation of the benchmark dataset which is why the dataset was clean from redundancy and the applied classifier was able to use these representations for developing better predictor models for stress response proteins. The highest performance, among the proposed three classifiers, was observed for the random forest, as shown in Tables 2, 3, 4, 5 and 6. The significance of the Random Forest classifier was ascertained by the achievement of best accuracy metrics as compared to other classifiers. As opposed to ANN and SVM, Random Forest gave the most assiduous outcomes. For each model, five distinct tests were performed and a variance in the performance of each classification model is apparent from the results of this test. Further, the suitability of the proposed classification model is discussed based on its performance. The independent dataset testing, fivefold, and tenfold validation were performed only with the goal of performing a robust and exhaustive validation of the model, while jackknife testing helped to perform such validation which enhances the transparency of the method. The score generated by jackknife testing is always the same, whenever the method is reproduced, as this testing covers all samples of data for   www.nature.com/scientificreports/ testing, individually. For independent dataset testing, testing was performed using 30% of data, while in fivefold and tenfold, testing was performed using random folds 5 and 10 times, respectively. The accuracies reported for k-fold are the average of k-experiments. Thus, in the case of SVM, the accuracy of independent dataset testing is quite low as compared to fivefold and tenfold. One of the main reasons could be a random sampling of the fold chosen for independent dataset testing of SVM. However, overall, it was observed that the random forest algorithm gave the most accurate and efficient results in the prediction of stress response proteins. Artificial Neural Network was observed to be passable but inferior as compared to the random forest while Support Vector Machine algorithm showed the poorest performance among all three classifiers. The performance of RF was observed consistently to be best while SVM was consistently observed poor. Usually, the performance of different classifiers varies from data to data and one cannot compare it specifically that one algorithm is good and the other is bad. However, in the case of the present study, it could be inferred that RF is basically the combination of multiple individual decision trees to act as an ensemble, employing multiple learners to solve the same problem, whereas, the SVM classifies binary data based on a hyperplane [53][54][55][56][57] . Thus, the ensemble of multiple learners performed well in our case of classifying stress response proteins.
The results depict that the proposed stress response protein predictor can turn into a convenient high throughput apparatus for researchers exploring stress proteins. It may serve as the primary predictor to foresee protein stress reaction ( Supplementary Information 1).
Webserver. The final phase is the implementation of the webserver. As discussed in 58,59 , the webserver enables the research community to use the latest advances. The openly available webservers show the practical usage which must be accurate and useful for prediction. For the development of a web server, the python Flask 1.0.2 framework was used to deploy models for the Stress response protein prediction. The models were implemented using Scikit-Learn, wtform 2.2.1, NumPy 1.16.3, TensorFlow 2.0.0, and Keras 2.2.4 libraries, and these libraries are used for the backend of a Webserver. Figure 14 shows the Homepage, Fig. 15 shows the introduction page, Fig. 16 shows the Prediction Server page, Fig. 17 shows the sample sequence data page, and Fig. 18 shows the results page. A live webserver is available at http:// biopr ed. org/ stres sprot iens.

Conclusion
In this study, Stress Response Protein prediction models based on RF, ANN and SVM are proposed. We used statistical measures to extract features from the benchmark dataset. Out of the three proposed models, the random forest classifier delivered the most elevated outcome. There are two aspects to the diligent outcome. Firstly, the experimental results strengthen the notion that the feature extraction model is forthcoming and yielding. Obscure patterns encompassed within the setup of amino acid residues within the polypeptide chain are uncovered by the proposed feature extraction technique. Secondly, out of all the classifiers discussed the most suitable classifier for deciphering each class based on the proposed feature set in random forest. Though ANN and SVM stimulate responses largely better than pure luck it is observed that random forest somehow exhibits the best capability for deciphering among classes. Rigorous testing techniques like cross-validation and jackknife testing ensure that the results are realistic and not an outcome of overfitting. Finally, the best performing model i.e., random forest is used to deploy as a web service to make it available for public use. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.