Knowledge-transfer learning for prediction of matrix metalloprotease substrate-cleavage sites

Matrix Metalloproteases (MMPs) are an important family of proteases that play crucial roles in key cellular and disease processes. Therefore, MMPs constitute important targets for drug design, development and delivery. Advanced proteomic technologies have identified type-specific target substrates; however, the complete repertoire of MMP substrates remains uncharacterized. Indeed, computational prediction of substrate-cleavage sites associated with MMPs is a challenging problem. This holds especially true when considering MMPs with few experimentally verified cleavage sites, such as for MMP-2, -3, -7, and -8. To fill this gap, we propose a new knowledge-transfer computational framework which effectively utilizes the hidden shared knowledge from some MMP types to enhance predictions of other, distinct target substrate-cleavage sites. Our computational framework uses support vector machines combined with transfer machine learning and feature selection. To demonstrate the value of the model, we extracted a variety of substrate sequence-derived features and compared the performance of our method using both 5-fold cross-validation and independent tests. The results show that our transfer-learning-based method provides a robust performance, which is at least comparable to traditional feature-selection methods for prediction of MMP-2, -3, -7, -8, -9 and -12 substrate-cleavage sites on independent tests. The results also demonstrate that our proposed computational framework provides a useful alternative for the characterization of sequence-level determinants of MMP-substrate specificity.

feature-selection method. Our results suggest that the transfer learning method can provide a robust performance, at least comparable to the latter method, indicating that the cross-domain knowledge transfer is a promising method for dealing with substrate-cleavage site prediction of MMPs with limited substrate data. In summary, our proposed transfer-learning-based method is a useful and complementary approach to existing studies of protease substrate cleavage site prediction. Our method proves most useful in cases where cleavage site data is limited. All source codes of our proposed method, as well as the benchmark datasets used in this study, are freely available and part of the supplementary material at http://lightning.med.monash.edu/tl/.

Analysis of sequence-level determinants of MMP-substrate specificity.
Our first goal was to better understand the sequence-level determinants associated with MMP substrate-cleavage sites and to explore the efficiency of transfer-learning techniques in this line of research. By using curated MMP-substrate datasets, we analyzed the occurrences of amino acid residues surrounding the P8-P8′ sites. To identify common patterns among different MMP substrate-cleavage sites from both source and target domains, we subsequently rendered sequence-logo representations using the pLogo program 78 . As shown in Fig. 1, even though the residue distribution clearly varied among different MMP substrate-cleavage sites, they still exhibited similar sequence patterns. Remarkably, glycine was significantly overrepresented at the P7, P4, P1, P3′, and P6′ positions, proline was overrepresented at the P3 and P5′ positions, and leucine was overrepresented at the P1′ position (p < 0.05; Fig. 1), which we note is consistent with previous findings 79 .
We also observed significant overrepresentation of acidic residues surrounding the substrate-cleavage sites of multiple MMPs, including those at positions P7, P4, P3, P1, P3′, and P6′. While a predictive sequence motif was not readily apparent, we found that MMP substrate-cleavage sites were preferentially located in regions characterized by depletion of arginine residues in the N-terminal region and lysine and histidine in both the N-and C-terminal regions, with enrichment of acidic residues to a lesser extent (Fig. 1). Notably, while other machine-learning algorithms require additional training data, the substrate-cleavage site patterns shared by the six MMPs enabled us to use the transfer-learning framework to train the cleavage site prediction models. We would like to note that the heterogeneity of sequence patterns presented in Fig. 1 is not uncommon; in fact, heterogeneity has been reported in other studies in the literature. Here, the sequence logo plots were generated using Figure 1. Sequence-logo representations of the occurrences of amino acid residues surrounding the substratecleavage sites (from P8 to P8′ positions) of MMP-2, -3, -7, -8, -9, and -12. Sequence logos were generated using the pLogo program 78 . The red horizontal lines on the pLogo plots denote the statistical significance threshold (p = 0.05).
the state-of-the-art sequence log software program pLogo 78 . This program contrasts itself with other traditional logo software in that it essentially relies upon residue frequencies to graphically scale character heights, by generating probabilistic sequence logos whose characteristics are scaled relative to their statistical significance.
The overall framework. In the experiments, each of the six datasets (namely MMP-2, -3, -7, -8, -9 and -12) was used as the test data in the target domain. To build the model based on knowledge-transfer learning, the data of the other five MMPs were used as the knowledge data in the source domain. An illustration of the proposed transfer-learning framework for the substrate-cleavage site predictions of MMP-2, -3, -7, -8, -9 and -12 is provided in Fig. 2. There are four major stages in the development of this framework: data preprocessing, feature encoding, model construction, and performance evaluation. During the data preprocessing stage, the CD-HIT 80 algorithm was used to cluster homologous sequences with ≥70% sequence identity in the datasets and to reduce sequence redundancy, which can potentially lead to biased model training. Positive and negative samples were extracted with a ratio of 1:3 from the substrate datasets of MMP-2, -3, -7, -8, -9 and -12 using a sliding window of 16 amino acids (P8-P8′). Subsequently, eight feature-encoding schemes were used to generate the input feature sets used for model training. Models were then constructed for MMP-2, -3, -7, -8, -9 and -12, respectively, using the proposed transfer-learning method to predict their substrate-cleavage sites. In order to benchmark and compare the performance of the transfer learning method, we built a baseline model that was built after merging the substrate data of all six MMPs. This method served as our "baseline" and did not discriminate the knowledge extracted from the source domain or the target domain; in other words, the baseline method did not benefit from a knowledge-transfer procedure. Detailed descriptions of feature encoding, feature selection, transfer learning-based model construction and parameterization, and performance evaluation can be found in the "Materials and methods" section.
Predictive performance of transfer learning-based models for substrate-cleavage site prediction of MMPs from the source domain. As aforementioned, the effective application of transfer learning requires high-fidelity transfer of the knowledge. In our case, this implies that the extracted common knowledge of MMPs from the source domain must be sufficient for the prediction of substrate-cleavage sites of MMPs in the target domain. To examine this, the predictive performances of SVR models of MMP-2, -3, -7, -8, -9, and -12, which were trained using the extracted common knowledge of MMPs in the source domain, were examined for their ability to predict substrate-cleavage sites of respective MMPs using five-fold cross-validation tests.
Eight different encoding schemes were used to generate a variety of features that describe the knowledge of substrate cleavage sites of the MMPs (See the "Sequence-encoding schemes" section for details). These sequence encoding schemes have proved useful for the prediction of protease substrate cleavage sites and other post-translational modification sites [50][51][52]81 . Overall, a total of 4461 initial features (See Table 1 for a full list) were extracted for encoding the positive and negative cleavage sites of the six MMPs (please refer to the Materials and Methods section for details of positive and negative sites). As some of the extracted initial features are noisy and irrelevant for the prediction, we subsequently applied the minimum redundancy maximum relevance (mRMR) algorithm 82 to select the top ranked features. In this work, 50 top ranked features were selected to encode substrate cleavage sites of MMPs from the source domain. Such features represent the extracted knowledge of substrate-cleavage sites of MMPs from the source domain, which are to be transferred to other MMPs in the target domain. For all the eight feature-encoding schemes (including Binary, PSSM, AAindex, BLOSUM, CKSAAP, DISO, CHR, and AAC) ( Table 1), after the mRMR feature selection, the majority of features were selected from the Binary-encoding scheme, followed by the PSSM scheme. Both sequence encoding schemes have been found to be particularly useful for the prediction of protease cleavage sites in previous studies [50][51][52] . Our results here confirm that they indeed provide an enriched set of selected features, highlighting the usefulness of Binary features and evolutionary information in the form of PSSM for the prediction of substrate cleavage sites [50][51][52] and also other types of protein post-translational modification sites 76,77,83 .
To demonstrate the predictive performance of SVM models trained using the extracted common knowledge and evaluate their fidelity in retaining such knowledge, receiver operating characteristic (ROC) were derived and AUC values calculated (Fig. 3). In particular, SVR models trained using the extracted common knowledge achieved AUC values ranging from 0.856 to 0.937 for substrate-cleavage site prediction of the six MMPs. These results suggest that the top 50 features extracted from the source domain may well capture regularities that hold across multiple MMPs and hence such common features can be useful for the cleavage site prediction of other MMPs following knowledge transfer learning.
Performance comparison of transfer-learning and baseline methods. In this section, we discuss and compare the performances of the transfer-learning and the baseline methods for the prediction of substrate cleavage sites of MMP-2, -3, -7, -8, -9, and -12, based on five-fold cross-validation tests.
For the transfer-learning method, the final features used for training the predictive models of each MMP included both common knowledge extracted from the source domain and other novel features extracted from the target domain. The latter features were selected using the mRMR algorithm 82 . The number of extracted features for each sequence encoding scheme varied for the six MMPs. While a significantly lower number of features from the target domain were extracted and chosen for MMP-7, four sequence encoding schemes have relatively larger  numbers of selected features: Binary, CKSAAP, PSSM, and BLOSUM. On the other hand, the baseline method used the top 100 features selected by the mRMR algorithm to build the prediction model. The predictive performances of the transfer-learning and baseline methods were evaluated based on five-fold cross-validation tests for all six MMPs ( Fig. 4 and Table 2). As can be seen, the transfer-learning method achieved slightly lower AUC scores for the MMP-3 and MMP-12 substrate-cleavage site predictions than the baseline method. In contrast, for substrate-cleavage site predictions of MMP-2, -7, -8, and -9, the transfer-learning method achieved an outstanding performance with AUC values 0.914, 0.938, 0.903, and 0.860, respectively. As a comparison, the baseline method achieved AUC values 0.910, 0.910, 0.879, and 0.858, respectively. In terms of accuracy, the transfer learning method achieved a tangible performance improvement, i.e. it achieved an increase in Accuracy of nearly 5% for MMP-7 and 3% for MMP-2, respectively. In terms of sensitivity, the performance improvement is clearer. For example, the transfer learning method achieved an increase in sensitivity of 10.4%, 11.4% and 6.5% for MMP-2, 7 and 9, respectively (Table 2). Altogether, these results suggest that through effective integration of the common knowledge shared by multiple proteases, extracted from the source domain, the transfer-learning method indeed shows great promise for providing a superior or, at the very least, competitive predictive performance when compared to that of the baseline method. Lastly, we performed an independent test, and the prediction performances of the proposed transfer learning methods and baseline method are shown in Fig. 5. We note the transfer learning method achieved an AUC value of 0.734, which is competitive and comparable to that of the baseline method (AUC = 0.738).
In order to better understand why the transfer learning method performed worse than the baseline method when predicting MMP-3 and MMP-12 substrate cleavage sites, we calculated the MMPs' cleavage entropies 84  It is therefore conceivable that the knowledge transferred from other proteases to aid the learning of MMP-3 substrate cleavage sites effectively reduces the predictive performance. Taking all of these results together, it seems that transferred knowledge is useful in most cases but ineffective for predicting cleavage sites of MMPs at both extreme ends of the spectrum (high vs. non-specific).

Discussion
In this work, we present a new transfer-learning framework for the prediction of MMP substrate cleavage sites and validate its usefulness by applying this framework to learn the knowledge from the source domain (MMP-9 and MMP-12) to improve the prediction of cleavage sites of other MMPs (MMP-2, -3, -7, and -8) in the target domain. Benchmarking experiments indicate that such framework is robust and particularly attractive when predicting cleavage sites of MMPs with limited training data. Overall, our results indicate that this new framework provides a useful alternative for the characterization of sequence-level determinants of MMP-substrate specificity.
When using machine-learning techniques to extract knowledge of value from datasets, a typical assumption (and also a prerequisite) is that there should be a sufficient amount of well-annotated training data to enable the construction of a robust and reliable prediction model. However, in many bioinformatics applications related to biological data mining, this assumption often fails due to intrinsic limitations of the underlying experimental techniques and the amount of the generated experimental data that is generated through such techniques. In this study, we focused on providing efficient solutions for the accurate prediction of substrate-cleavage sites of MMP-2, -3, -7, -8, -9, and -12, for which a limited number of experimentally validated cleavage sites have been previously reported. We developed two models, which were based on a transfer learning method and a baseline method in combination with conventional feature selection, respectively, and compared their predictive performances for each of the six MMPs using benchmark datasets.
Here, we have provided a useful, complementary and alternative approach to predict substrate-cleavage sites of MMPs, based on knowledge transfer learning. Our work adds value to and complements existing approaches, in particular when dealing with insufficient data and/or datasets with limited sizes, where other methods recurrently fail. In the future, we expect sufficiently large heterogeneous cleavage site data to become available through distinct experimental approaches (e.g. cleavage site data, high-throughput proteomic approaches vs. low-throughput gel-based assays at the full-length substrate sequence level for each individual MMP) and it will be worth investigating their potential impact on the prediction performance of cleavage sites.
For added clarity, we would also like to emphasize several major differences between this work and our previous PROSPER work 50 : (i) The datasets used are different: The current work used substrate cleavage datasets of the MMP proteases; while PROSPER represents a generic approach that can be used to predict potential cleavage sites of 23 protease types;    Table 2. Predictive performance of the transfer-learning method and baseline method evaluated based on the five-fold cross validation tests. *Model: TL, Transfer-learning method; BL, Baseline method.  based on transferring useful knowledge learned from limited datasets, which complements with existing studies; while the primary aim of the PROPSER work was to provide a publicly available bioinformatics tool for computational prediction of multiple proteases.
Altogether, although both PROSPER and our transfer learning-based approach used support vector regression algorithms as the primary resource to train the prediction models for protease substrate cleavage sites, there exist major differences between PROPSER approach and our new transfer learning-based approach, presented in this work.
Our results presented herein demonstrated a MMP-specific predictive performance. Moreover, for MMP-3, -7, -8, and -9 substrate-cleavage site prediction, the transfer learning method outperformed the baseline method, highlighting the contribution of common knowledge extracted from the MMPs in the source domain to the cleavage site prediction of each individual MMPs. We anticipate this proposed transfer-learning-based framework will greatly facilitate the prediction of substrate-cleavage sites and further our understanding of the substrate specificity of MMPs. More generally, it provides a useful and complementary strategy to approach tasks associated with biological predictions using a limited supply of training samples. Lastly, while we recommend the proposed models be refined when more experimentally validated substrate-cleavage sites become available, this study provides a valuable method for the accurate prediction of substrate-cleavage sites. We expect that our findings and the proposed strategies will be inspirational and valuable for a number of biotechnology and biomedical applications where extraction of domain-common and specific knowledge is often required.

Materials and Methods
Non-redundant datasets. In this study, all experimentally verified substrates and their substrate-cleavage site annotations were extracted from the MEROPS database, which is a comprehensive, integrated knowledgebase for proteases, substrates, and inhibitors 85 . To avoid potential over-fitting, we performed sequence-homology reduction in the extracted substrate datasets using the CD-HIT 80 program with a 70% sequence-identity threshold. To ensure proper machine-learning-based model training and performance assessment, we only considered MMPs that had at least 50 experimentally validated substrate-cleavage sites at the time of this study. The above filters resulted in a final set of six MMPs, 210 substrate sequences, and 942 cleavage sites. When the substrate dataset of a MMP was used as the dataset in the target domain, the remaining five MMP datasets were used as the dataset in the source domain. Table 3 provides a statistical summary of the MMP-specific substrate datasets used in this study.
To extract the positive (i.e. cleavage sites) and negative (i.e. non-cleavage sites) peptide sequences, we further truncated the substrate sequences using a local sliding window, 16 residues in length, where the cleavage site was symmetrically flanked by eight upstream and eight downstream residues. Previous studies have shown the presence of important residue positions that might be involved in the substrate recognition of MMPs, for example, P4 or P3 to P2′ or P3′ positions for substrate recognition and P5-P4′ positions from protein structure point of view for substrate binding 50,86 . In the current work, we employed a uniform window size of 16 amino acids (i.e. P8-P8′) to include extended neighbouring sequence environments that potentially could have an influence on the substrate determination. The number of negative samples was much larger than that of positive samples, which could lead to biased model training in favor of negative samples. To address this data imbalance, we adopted a re-sampling strategy using a ratio of 1:3 between the positive and negative samples as previously suggested 50 .
Additionally, non-cleavage sites needed to be accurately predicted as being solvent inaccessible, given that residues located in the core of the protein structure would likely be inaccessible to proteases 87 . Therefore, to facilitate the selection of reliably negative samples, we used the NetsurfP 88 software, which allowed us to predict solvent accessibility of the P1 residues in substrate proteins. The solvent-inaccessible (shown as 'b' in the output of NetsurfP) P1 residues that were annotated as cleavage sites would more frequently be selected as reliable negative samples.
To evaluate the performance of the proposed transfer learning approach, we further constructed a dataset for independent testing. We first attempted to split our data into a training and a testing dataset. However, the resulting testing dataset was too small to provide any meaningful independent test results. To address this issue for  Table 3. Statistical summary of MMP-specific substrate datasets used in this study.
MMP-2 substrates, we constructed an independent test dataset using the latest version of MEROPS (Release 11.0) that only recently had been largely extended by experimental substrate cleavage data for MMP-2. After removing the overlapping and homologous sequences (by clustering sequences using CD-HIT at the 70% sequence identity) from the training dataset, we obtained 714 sequences with 1,433 cleavage sites as the independent test set for MMP-2.
Sequence-encoding schemes. Sequence-encoding schemes play an important role in determining the predictive performance of machine-learning-based models. Here, we used eight different sequence-encoding schemes for training SVR models based on a combination of various types of features. The sequence-encoding schemes and their corresponding feature dimensions are shown in Table 1. The window size of the sample, L (or the length of the segment) in this work was 16, and the total dimension for AAindex, AAPair, Binary, BLOSUM, CHARGE-Hyd, CKSAAP, DISOPRED, and PSSM was 4461, meaning that for each sample there is a 4461-dimensional input for the SVR. Detailed descriptions of these encoding schemes can be found in previous work 50,81 . Here, we applied a computational tool to convert the amino acid segments to numerical vectors, including over ten different kinds of encoding schemes, as proposed by Chen et al. 81 . The detailed information and dimensionality of each encoding scheme is described below.
AAindex. AAIndex 89 (v9.1) is a database containing 544 indices of amino acid physicochemical properties, such as alpha-CH chemical shifts and hydrophobicity index. Previous studies demonstrated that 64 of 544 indices are informative and beneficial for predictive tasks in a number of computational bioinformatics studies 90 . Therefore, we chose these 64 high-quality indices for use in our study. As a result, the AAindex-derived features were encoded as a L × 64 = 16 × 64 = 1024-dimensional real-valued vector, where L is the length of the segment.
AAPair. AAPair, also called Amino Acid Composition (AAC), has been widely used in a variety of protein-sequence analyses and predictions, including substrate-cleavage site prediction 50 and ubiquitination-site prediction 76,81 . In this study, it was used to calculate the frequencies of amino acids surrounding the cleavage site. Therefore, each segment in our datasets was encoded as a 20-dimensional vector.
Binary representation of amino acids in the segments. The amino acids flanking cleavage and non-cleavage sites were accounted for by using the binary sequence-encoding scheme as previously described [50][51][52] . Each amino acid residue was transformed into a 20-dimensional binary vector, alphabetically-sorted, and was represented by a combination of ones and zeros, e.g., alanine (10000000000000000000), cysteine (01000000000000000000), etc. Apart from the 20 standard amino acids, we used '00000000000000000000' to represent unnatural amino acids when necessary. Each segment in our datasets containing L = 16 amino acids was encoded as a L × 21 = 16 × 21 = 336-dimensional binary vector.
BLOSUM62 matrix. The BLOSUM62 matrix was used to extract primary sequence information. A vector of L × 21 elements was used to represent each segment in our datasets, where L is the length of the segment and 21 represents the 20 standard amino acids and an additional one representing non-conserved amino acid residues. Therefore, the BLOSUM-derived features for a segment of length L = 16 comprised a 16 × 21 = 336 dimensional vector 76 .

CHARGE-Hyd.
We also used CHARGE-Hyd 76,91,92 to calculate the charge and hydrophobicity of each amino acid segment contained in our datasets. Extracted information for each 16-residue segment included the mean net charge, the aromatic content, and the charge:hydrophobicity ratio, with each feature consisting of a three-dimensional vector. The resulting dimensionality of the CHARGE-Hyd-based features for each segment in our dataset was 3 × 3 = 9 76 .
CKSAAP. Composition of k-space Amino Acid Pair (CKSAAP) 81 encoding was used to calculate the amino acid pairwise frequencies for segments contained in our datasets. When k = 0, this indicates that there are 400 amino acid pairs (i.e., AA, AC, AD, …, YY) and that the encoded vector can be defined as: The values of N Total for a 16 amino acid fragment were 15, 14, 13, 12, 11, and 10 for spaces k = 0, 1, 2, 3, 4, and 5, respectively. When a fragment is located at the N-or C-terminus, the value of N Total was adjusted accordingly. In our study, the CKSAAP-encoding scheme was used with k = 0, 1, 2, 3, 4, and 5. Accordingly, the dimensionality of the resulting feature vector was 2400 81,93 .
DISOPRED. Previous studies indicated that incorporation of natively disordered regions [50][51][52] can be useful for protease substrate-cleavage site prediction. Therefore, we accounted for native-disorder features using DISOPRED 94 to predict the native profiles of substrate sequences. DISOPRED outputs the predicted probability for each residue being disordered (denoted by '*') or ordered (denoted by '. '). Native-disorder features were encoded as a L × 1 = 16 × 1 = 16-dimensional vector based on the probabilities associated with the corresponding residues.
Position-specific scoring matrix (PSSM). PSSM represents the occurrence probability for each type of amino acid at each corresponding position. PSSM profiles are widely used in many biological data analyses as a primary sequence-derived feature. In our study, PSSM profiles were generated for each sequence in the dataset using PSI-BLAST 95 against the UniRef90 protein database to yield an essential sequence-derived input feature.
Feature selection. Given that the initial features extracted from multiple sources are heterogeneous, we performed feature selection to remove any noisy and/or misleading features by using the mRMR 96 algorithm, which ranks the importance of all features. mRMR is able to rank features based on their relevance according to the response variables (labels) and redundancy among the features. Therefore, optimal candidate features can be identified and selected after performing mRMR calculations, thereby enhancing predictive performance.
Model training and parameterization. Support vector machine (SVM) is a supervised machine learning technique that has been widely applied to solve a variety of classification problems. In practice, SVM has two modes: the classification mode and the regression mode 97 . In this study, we used the regression mode, i.e. SVR, to train models for the prediction of MMP substrate-cleavage sites. In particular, for SVR, the real-valued prediction output value associated with each sample (either positive or negative) can be readily transformed to a classification outcome by applying a prediction cutoff value. The probability score generated by the SVR model can serve as a useful confidence metric for each predicted sample. Due to its attractive advantage, SVR has been used by several protease cleavage site prediction studies [50][51][52] and was chosen as the baseline algorithm for our transfer learning-based approach, as the choice of the baseline algorithms for transfer learning is not the focus or goal of this study. Here, we used the LibSVM 97 package with regression mode to output a quantitative score for each residue from the substrate sequences. The SVR classifiers were trained using the rational basis function kernel. There are two important parameters, c and γ, that require optimization: c represents the cost factor controlling the trade-off for maximizing the margin and minimizing the error rate, and γ regulates model generalization.
In order to optimize these parameters during model training, the grid-search strategy and the GFO 98 algorithm were used, as well as five-fold cross-validation on the training dataset, to fully optimize the model performance.
In addition, we adopted the following strategies to avoid potential overfitting problems: (i) At each cross-validation step, addition of each of the features to train the model was achieved by using four folds of the dataset, validating the performance of the trained model on the singled-out fold of the dataset. In our effort to effectively minimise the potential overfitting due to biased selection of features, the training and test datasets were kept distinct (i.e. completely separated) for each round of model validation. (ii) During the SVR model training process, we used the model training strategies (grid search and cross-validation) recommended by the LibSVM package to optimize the relevant parameters (namely, c and γ), which allowed us to effectively prevent the overfitting risk by rationally separating the data samples.
Identifying common knowledge from the source domain using the mRMR algorithm. Common knowledge (features) shared by the MMPs in the source domain was identified by feature selection using the mRMR algorithm 82 . We used mRMR to rank and select the top 50 features for all the MMPs in the source domain (5 proteases). The corresponding features were extracted from both source domain samples and target domain samples. The data was treated as the basic information for the model construction.
Target-domain modeling based on transferred common knowledge. The initial feature set for MMP-2, -3, -7, -8, -9, and -12 substrate-cleavage site prediction (target domain) was composed of all common features (knowledge) identified from the source domain. This feature set was used to perform feature selection for each MMP (i.e., MMP-2, -3, -7, -8, -9, and -12) in the target domain. Candidate features of the MMPs in the target domain were identified by first using the mRMR algorithm to generate and rank the top 100 features for MMP-2, -3, -7, -8, -9, and -12 in the target domain. We then combined the common knowledge with each of the top-100 features listed from the target domain to generate a feature list without overlaps. The common features were located at the beginning of each feature list. A feature-selection calculation based on the six feature lists was then performed to determine the optimal feature subsets. During each step of this process, one feature was added and a SVR model was constructed before using the AUC value to evaluate model performance. Subsequent model building incorporated five-fold cross-validation and performance evaluation. After all models were completed, we chose the models for each protease of the target domain exhibiting the highest AUC value and compared the performance of the transfer-learning method with that of the feature-selection method by analyzing ROC curves and AUC values associated with five-fold cross-validation and use of the independent test dataset. The pseudo code describing this process is shown in Algorithm 1 below, which is composed of two sections, including common knowledge from the source domain and model training in the target domain.
A l g o r i t h m 1 : F r a m e w o r k o f k n o w l e d g e t r a n s f e r -b a s e d m o d e l t r a i n i n g .
Performance evaluation. To evaluate the predictive performance of transfer learning model versus the baseline model, six performance measures were used, including sensitivity, specificity, accuracy, F-score, MCC, and AUC. These measures are defined as follows: where TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively.