SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins

Fast and accurate identification of phage virion proteins (PVPs) would greatly aid facilitation of antibacterial drug discovery and development. Although, several research efforts based on machine learning (ML) methods have been made for in silico identification of PVPs, these methods have certain limitations. Therefore, in this study, we propose a new computational approach, termed SCORPION, (StaCking-based Predictior fOR Phage VIrion PrOteiNs), to accurately identify PVPs using only protein primary sequences. Specifically, we explored comprehensive 13 different feature descriptors from different aspects (i.e., compositional information, composition-transition-distribution information, position-specific information and physicochemical properties) with 10 popular ML algorithms to construct a pool of optimal baseline models. These optimal baseline models were then used to generate probabilistic features (PFs) and considered as a new feature vector. Finally, we utilized a two-step feature selection strategy to determine the optimal PF feature vector and used this feature vector to develop a stacked model (SCORPION). Both tenfold cross-validation and independent test results indicate that SCORPION achieves superior predictive performance than its constitute baseline models and existing methods. We anticipate SCORPION will serve as a useful tool for the cost-effective and large-scale screening of new PVPs. The source codes and datasets for this work are available for downloading in the GitHub repository (https://github.com/saeed344/SCORPION).

standard approaches for PVP identification, they are difficult to employ for the analysis of PVPs at large scale as they are laborious and costly methods. Thus, researchers have invested much in efforts to develop computational models for predicting PVPs directly from their sequence information as a useful alternative.
To date, a variety of machine learning (ML)-based methods, including iVIREONS 8 19 and VirionFinder 20 have all been developed and proposed for PVP identification. Table 1 provides a summary of these machine learning-based methods along with their employed ML algorithms, feature descriptors and evaluation strategies. In 2013, Seguritan et al. developed the first PVP predictor called iVIREONS 8 based on ANN algorithm trained with AAC and PIP to predict viral structural proteins. Shortly afterward, Feng et al. created a high-quality dataset consisting of 99 PVPs and 208 non-PVPs, and also developed a NB-based predictor 9 cooperating with AAC and DPC. Most recently, Han et al. developed an ensemble-based model named iPVP-MCV 19 by combing three types of PSSM descriptors (i.e. PSSM-AAC, PSSM-composition and DP-PSSM). Until now, iPVP-MCV have represented a state-of-theart predictor for PVP identification. More detail information for all of the existing methods is summarized in an article by Kabir et al. 21 . Although above mentioned methods do efficiently facilitate the prediction of PVPs, there are some issues that still need to be addressed. First, the training dataset used by several existing methods in PVP identification was relatively small. This is an important consideration, as several previous studies have demonstrated that training with a large number of datasets is crucial for building a comprehensive predictive model 18,[22][23][24] . Second, almost all of the existing methods were developed by employing single ML methods to train the model. Therefore, their performance might not be optimal in some cases. However, ensemble models are capable to provide a greatly improved performance compared to baseline models 22,[24][25][26][27] . Finally, the prediction performance for these existing methods is still not satisfactory for many real therapeutic applications.
To address these limitations, we present a novel approach, termed SCORPION (StaCking-based Predictior fOR Phage VIrion PrOteiNs) to improve the accurate prediction of PVPs. The overall procedure for the development of SCORPION is illustrated in Fig. 1. Notably, SCORPION employs 13 different sequence-based feature descriptors from multiple perspectives (i.e., compositional information, composition-transition-distribution information, position-specific information and physicochemical properties) to extract the key pattern of PVPs. These feature descriptors were used to train a total of 130 baseline models by using 10 popular ML algorithms. Probabilistic features (PFs) were then generated by using these 130 baseline models, and considered as a new feature vector. To improve the predictive performance, a two-step feature selection strategy was applied to identify m out of 130 PFs. Finally, the optimal PF feature vector were used to develop an effective stacked model (SCOR-PION) by using the stacked ensemble learning strategy. Our comparative results base on cross-validation and independent tests indicate that SCORPION outperformed its baseline models. Moreover, SCORPION achieved a better performance than several existing methods for PVP prediction in terms of in terms of ACC (0.873), Table 1. Characteristics of the existing methods for PVP prediction. ANN artificial neural network; CNN convolutional neural network, LR logistic regression, NB naive bayes, RF random forest, SCM scoring card matrix, SVM support vector machine, AAC amino acid composition, AACPCP amino acid composition and physicochemical properties, AKSNG adaptive k-skip-n-Gram Algorithm, APAAC pseudo amino acid composition, ATC atomic composition, Bi-PSSM bigram position-specific scoring matrix, CTD composition translation and distribution, DPC dipeptide composition, PSSM_DP position-specific scoring matric based on dipeptides, GGAP g-gap dipeptide composition, GGAPTree g-gap feature tree, PAAC pseudo amino acid composition, PCP physicochemical properties, PF probabilistic features, PIP protein isoelectric points, PSSM position-specific scoring matrix, PSSM_AAC position-specific scoring matrix based on amino acid composition, PSSM_COM position-specific scoring matrix based on composition, PSSM Profiles positionspecific scoring matrix based on profiles, SAAC split amino acid composition, Seq-Str sequence-structure, 10CV tenfold cross-validation, IND independent test, LOOCV leave-one-out cross-validation.  PVPred 10  2014  SVM  GGAP  Single  LOOCV, IND   Zhang et al. 's method 17  2015  SVM  CTD, bi-profile Bayes, PAAC, PSSM  Ensemble  10CV, IND   PVP-SVM 11  2018  SVM  AAC, ATC, CTD, DPC, PCP  Single  10CV, IND   PhagePred 12  2018  NB  GGAP  Single  10CV,

Materials and methods
Overall framework of SCORPION. As can be seen in Fig. 1, there exist four major steps, including dataset construction, baseline models construction, new feature representations and the stacked model development. First, The same benchmark dataset derived from Charoenkwan et al. 18 were used to train and optimized baseline models and SCORPION. Second, 13 different feature descriptors were individually fed to 10 different ML algorithms to build the 130 baseline models using tenfold cross-validation. In addition, we comprehensively compared 13 different feature descriptors to determine the feature descriptors that are beneficial to PVP identification. Third, we constructed variant stacked models by using different sets of feature vectors. Forth, the optimal PF vector was determined and fed to RF algorithm in order to construct the final stacked model (SCORPION) by using the stacked ensemble learning strategy. Finally, we compared the predictive performance of SCOR-PION against its constitute baseline models and existing methods.
Dataset collection. As described in an article by Kabir et al. 21 , there are three well-known benchmark datasets (i.e. Feng2013 9 , Manavalan2018 11 and Charoenkwan2020_2.0 18 ) that have been established for developing existing PVP predictors. In this study, we utilized the Charoenkwan2020_2.0 dataset established by Charoenkwan et al. 18 as the benchmark dataset to assess the performance of SCORPION. Below, we provided two main reasons why we used the Charoenkwan2020_2.0 dataset. First, the Charoenkwan2020_2.0 dataset contained a larger number of PVPs and non-PVPs than other datasets. Specifically, the Charoenkwan2020_2.0 dataset combined Feng2013 9 and Manavalan2018 11 datasets along with novel PVPs and non-PVPs obtained from the Uni-Prot database (release 2019_11) 28 . Second, a lower CD-HIT threshold of 0.4 was used to exclude more redundant sequences in the Charoenkwan2020_2.0 dataset. As a result, the Charoenkwan2020_2.0 dataset contained of 313 PVPs and 313 non-PVPs. In the Charoenkwan2020_2.0 dataset, the training and independent datasets (PVPs, non-PVPs) consisted of (250, 250) and (63, 63), respectively. All datasets used in this study are available on https:// github. com/ saeed 344/ SCORP ION.
Feature encodings. In this study, we used 13 different sequence-based feature descriptors contain-  Table 2. Here, the iFeature Python package 29 was utilized to calculate all the 13 feature descriptors. www.nature.com/scientificreports/ Stacking ensemble learning framework of SCORPION. In this study, the stacked ensemble learning strategy was utilized to develop SCORPION for improving the prediction of PVPs. Unlike other ensemble learning strategies, this strategy enables an automatic integration of different ML classifiers in order to construct a single robust prediction model 23 . The stacked strategy has successfully achieve better performance as compared with its constituent baseline models 23,24,27,30,31 . The stacking strategy consists of two main steps, while the corresponding models at each step are referred to as baseline and meta models, respectively. In the first step, the PVPs and non-PVPs in the training dataset were extracted by using 13 different feature encoding schemes from four different perspectives containing AAC, AAI, APAAC, CTDC, CTDD, CTDT, DDE, DPC, EAAC, PAAC, PSSM_AAC, PSSM_DP and PSSM_COM with corresponding dimensions of 20,11,22,39,39,195, 400, 400, 20, 21, 20, 400 and 400, respectively [32][33][34][35] . Herein, we used the default iFeature parameter settings 29 to generate APAAC and PAAC descriptors. Then, each feature descriptor was individually employ to train 10 different ML algorithms (KNN, RF, SVM, decision tree (DT), extremely randomized trees (ET), logistic regression (LR), multi-layer perceptron (MLP), naive Bayes (NB), partial least squares regression (PLS) and extreme gradient boosting (XGB)). To enhance the predictive performance, all ML classifiers were trained and optimized using the scikit-learn package in Python (version 0.22) 36 . Specifically, the optimal parameters of ET, LR, MLP, RF, SVM and XGB classifiers were carefully determined under the tenfold cross-validation procedure on the training dataset, where the search range is shown in Supplementary Table S1. In the case of the remaining ML classifiers, they were constructed by using their default parameters. Therefore, we obtained a total of 130 baseline models (10 MLs × 13 encodings).
In the second step, each baseline model provided us three types of features from three perspectives containing PF, class feature (CF) and the combination of PF and CF (PCF). The PF is based on the predicted probability scores to be PVPs which is in the range of 0-1. In case of the CF, the protein sequence P is labeled as 1 if its predicted probability scores is greater than 0.5, otherwise the protein sequence P is labeled as 0. As a result, the protein sequence P was represented to 130-D, 130-D and 260-D feature vectors for PF, CF and PCF, respectively. The PF, CF and PCF were considered as new feature vectors. RF algorithm was employed as the meta model (called mRF) to train the stacked model. As result, we obtained three different stacked models based on three new feature vectors (i.e. PF, CF and PCF). To improve the discriminative ability of the new feature vectors, we used a two-step feature selection strategy to optimize PF, CF and PCF feature vectors. At the first step, we used XGB classifier to rank the features in PF, CF and PCF. The XGB classifier is widely used in the feature importance analysis 23,37 . Using the XGB classifier, we constructed a ranking list of features with respect to their importance scores. Higher ranked features in this list are the most important features. At the second step, we constructed n different feature subsets containing the top ranked features ranging from top 5 to top 100 features with an interval of 5. Then, we inputted all feature subsets into mRF models and optimized the mRF models' parameters using tenfold cross-validation scheme. The feature subset achieving the highest Matthews correlation coefficient (MCC) was considered as the optimal feature subset. The implementation of these classifiers in the two-step feature selection strategy is the same as used in our previous studies 18,31,[38][39][40][41] Performance evaluation strategies. In order to examine the performance of our proposed predictor, we used five common statistical metrics including ACC, MCC, sensitivity (Sn) and specificity (Sp) 24,42 as described follows: (1) ACC = TP + TN (TP + TN + FP + FN) ,

Results and discussion
Performance evaluation between different classifiers and feature encodings. In this section, Performance evaluation of different stacked models. As mentioned in the "Materials and methods" section, we designed and developed three different stacked models based on three types of new feature representations consisting of PF (130D), CF (130D) and PCF (260D). Specifically, these three new feature representations were inputted to RF algorithm for developing three different mRF models. The performance comparison results amongst the three mRF models are provided in Tables 3 and 4. As can be seen, it is clear that PF and PCF feature vectors achieved better performance in terms of all performance metrics based on both tenfold cross-validation and independent tests. To further improve the discriminative ability of our new features, we utilized the two-step feature selection scheme to optimize PF, CF and PCF feature vectors. Herein, the feature selection scheme identified 50, 5 and 5 informative PFs, CFs and PCFs, respectively, for generating three optimal feature sets. Tables 3  and 4 shows that the three optimal feature sets attained a similar performance based on tenfold cross-validation test. In case of the independent test results, optimal PF feature vector outperformed other feature sets in terms of four out of five performance metrics (i.e. ACC, Sp, MCC and AUC). Particularly, ACC, Sp, MCC and AUC of optimal PF feature vector were 0.881, 0.952, 0.770 and 0.922, respectively (  Fig. 3, we observe that SCORPION performed better than that of the model without the optimal PF feature vector in terms of all the five performance metrics on both the training and independent datasets. Impressively, ACC, Sn, Sp, MCC and AUC of SCORPION were 10.40%, 7.55%, 8.54%, 20.78% and 4.61%, respectively, higher than that of the model without the optimal PF feature vector on the independent dataset. After that, we Table 3. Cross-validation results for different feature representations using class and probabilistic information. www.nature.com/scientificreports/ compared the optimal PF feature vector with 13 different feature descriptors. As can be seen from Supplementary Tables S5 and S6, amongst 13 different feature descriptors, the five best-performing descriptors in terms of cross-validation MCC contained PSSM_COM, PSSM_AAC, AAC, PSSM_DP and EAAC. Here, we built RF classifiers with the five best-performing descriptors and evaluate the RF classifiers' performance based on the tenfold cross-validation and independent tests. The performance comparison results between the optimal PF feature vector and these five best-performing descriptors are depicted in Fig. 4. In the meanwhile, Supplementary Table S5 shows that the highest cross-validation ACC and MCC of 0.868 and 0.743, respectively, were achieved by using the optimal PF feature vector, while PSSM_COM performed well with the second highest cross-validation ACC and MCC of 0.814 and 0.633, respectively. In case of the independent test results, the optimal PF feature vector significantly outperformed the second-best descriptor in terms of four out of five performance metrics (i.e. ACC, Sp, MCC and AUC). Specifically, the optimal PF feature vector's ACC, Sp, MCC and AUC were 12.70%, 25.40%, 25.87% and 12.22%, respectively, higher than the second-best descriptor. In addition, we compared the distribution of the feature space of the optimal PF feature vector and the five best-performing descriptors on the training dataset by using the t-distributed stochastic neighbor embedding (t-SNE) based on the scikitlearn (version 0.22) 44,45 . Figure 5 shows six t-SNE plots representing their distributions between positive (red spots) and negative (green spots) samples in a 2D feature space. As can be seen, we notice that a clear separation between red and green spots was achieved in the feature space of the optimal PF feature vector. Finally, we compared the predictive performance of SCORPION against its constituent baseline models. Figure 2 shows that MLP-PSSM_COM performed well with the highest cross-validation ACC and MCC. As can be seen from Fig. 6, SCORPION attained the overall best performance as compared with MLP-PSSM_COM in terms of all performance metrics on both training and independent datasets. Remarkably, SCORPION's ACC, Sp, MCC and AUC were 10.32%, 19.05%, 21.40% and 6.35%, respectively, higher than MLP-PSSM_COM. This confirmed that the optimal PF feature vector derived from the integration of variant ML classifier were beneficial for PVP identification and could improve the model's predictive performance.

Model interpretation.
In this section, we utilized the SHAP approach to analyze feature importance for SCORPION and three selected baseline models (i.e. RF-AAC, XGB-DPC and LR-XGB) for providing better understanding of these five models to generate their prediction outcomes. The impact of each feature on these three models' prediction outcomes is illustrated in Fig. 7. To be specific, Fig In case of the tenfold cross-validation results, SCORPION and iPVP-MCV achieved better performances than Meta-iPVP in terms of all performance metrics (Table 5). In addition, SCORPION secured the best predictive performance on the independent dataset, while iPVP-MCV attained the second-best performance value. Specifically, SCORPION significantly outperformed the compared existing method in terms of ACC, Sp and MCC, while iPVP-MCV achieved the best Sn (Table 6). In the meanwhile, SCORPION's ACC, Sp and MCC were 4.80%, 17.44% and 9.88%, respectively, higher than iPVP-MCV. Altogether, our comparative results indicate that our predictor was able to attain the best predictive performance of PVP identification as compared to the existing methods. The significant improvement of our predictor SCORPION can be characterized to three major reasons. First, our predictor was trained and optimized using an up-to-date dataset established by Charoenkwan et al. 18 containing a larger number of PVPs and non-PVPs than other datasets. Second, our predictor combined variant sequence-based feature descriptors from different perspectives consisting of compositional information, composition-transition-distribution information, position-specific information and physicochemical properties. Third, the two-step feature selection scheme was utilized for identifying the most informative features that can help to precisely discriminate PVPs from non-PVPs.

Conclusions
In this study, we introduced SCORPION, a novel, stacked, machine learning-based approach for accurate identification of PVPs. Specifically, SCORPION employed 13 different feature encoding schemes (categorized into four main groups) to encode PVPs and non-PVPs sequences and used 10 popular ML algorithms to build a pool  www.nature.com/scientificreports/ of baseline models. These baseline models were then used to generate and construct the PF feature vector, which were considered as new feature representations. Finally, the optimal PF feature vector was optimized by using a two-step feature selection strategy and used this feature vector to develop the stacked model (SCORPION).
Extensive benchmarking experiments show that SCORPION was effective and outperformed its constitute baseline models. In addition, when compared with five well-known existing methods (i.e. PVPred, PVP-SVM, PVPred-SCM, Meta-iPVP and iPVP-MCV) on the independent dataset, SCORPION achieved a superior predictive performance as compared the compared methods for PVP identification in terms of ACC (0.873), Sp (0.905), MCC (0.748) and AUC (0.891), thereby highlighting its effectiveness and generalizability. We anticipate that SCORPION will be a valuable tool for facilitating antibacterial drug discovery and development.

Data availability
All the data used in this study are available at https:// github. com/ saeed 344/ SCORP ION.