PMLPR: A novel method for predicting subcellular localization based on recommender systems

The importance of protein subcellular localization problem is due to the importance of protein’s functions in different cell parts. Moreover, prediction of subcellular locations helps to identify the potential molecular targets for drugs and has an important role in genome annotation. Most of the existing prediction methods assign only one location for each protein. But, since some proteins move between different subcellular locations, they can have multiple locations. In recent years, some multiple location predictors have been introduced. However, their performances are not accurate enough and there is much room for improvement. In this paper, we introduced a method, PMLPR, to predict locations for a protein. PMLPR predicts a list of locations for each protein based on recommender systems and it can properly overcome the multiple location prediction problem. For evaluating the performance of PMLPR, we considered six datasets RAT, FLY, HUMAN, Du et al., DBMLoc and Höglund. The performance of this algorithm is compared with six state-of-the-art algorithms, YLoc, WOLF-PSORT, prediction channel, MDLoc, Du et al. and MultiLoc2-HighRes. The results indicate that our proposed method is significantly superior on RAT and Fly proteins, and decent on HUMAN proteins. Moreover, on the datasets introduced by Du et al., DBMLoc and Höglund, PMLPR has comparable results. For the case study, we applied the algorithms on 8 proteins which are important in cancer research. The results of comparison with other methods indicate the efficiency of PMLPR.


Methods
In this section, we present PMLPR algorithm for protein localization problem. PMLPR is based on one of the existing methods for recommender systems, NBI 28 . In the first part of PMLPR algorithm, the NBI method is used. Then, by applying interaction scores between proteins, PMLPR predicts a list of locations for a protein. In this section, we introduce the NBI method followed by a detailed explanation of our approach.
NBI. Recommender systems consist of two sets, users and objects. Each user collects a number of objects.
The purpose of such systems is to analyze this information and offer new objects to each user. One of the famous recommender systems is NBI algorithm introduced by Zhou et al. 28 . NBI is a network-based method which constructs a bipartite network of users and objects. Then, the algorithm performs a resource-allocation process in two steps; First, from objects to users, second from users to objects. The amount of resources after two steps is used to predict new objects for users. Up to now, NBI and its variations are utilized in different research areas. For example, recommending new movies, music and Internet bookmarks to users 28 , predicting new drug targets 38 , and so on. be the set of all locations. PMLPR algorithm comprises of four steps as follows: Step 1. A bipartite graph P L , ) is constructed where for ∈ p i  and ∈ l j , the edge = e p l ( , ) i j belongs to E if p i has already collected l j . In other words, protein p i belongs to the location l j .
Step 2. In this step, the personal recommender matrix R = [r ij ] with n rows and m columns is calculated similar to NBI method. To obtain R, let A = [a ij ] n×m be the adjacency matrix of G where a ij = 1 if p i and l j are neighbors and a ij = 0 otherwise. Define W = [w ij ] m×m as follows: In this formula, d(l j ) and d(p t ) are the degree of vertices l j and p t in G respectively. To obtain the kth row of R, kj j m 1 is defined as initial resource vector. The kth row of R is calculated by Step 3. Let s ppi denote the interaction score between protein p and p i . This score is obtained from STRING database.
denotes the predicted score of location l i for protein p.
Step 4. In this step, for protein p, a set of locations is predicted. To do this, we divide all the scores to the highest score of Pred(p)and sort them in descending order. We consider these sorted results as ′ Pred p ( ), which shows the probability of each location for protein p. According to a probability threshold, a set of sorted locations can be assigned to protein p. A visualization of these 4 steps is shown in Fig. 1 18,19 . Only the reviewed and manually annotated information is considered which is known as Swiss-Prot dataset. The RAT, FLY and HUMAN contain 7928, 2850 and 20203 proteins, respectively. Meanwhile, CD-HIT 40 is used to reduce the redundancy of the protein dataset. Proteins with 35% similarity and above are eliminated from the dataset. After applying CD-HIT, the number of proteins in RAT, FLY and HUMAN are 5301, 2474 and 13250 respectively. Then, the protein-location dataset is updated, and PMLPR results on this dataset is calculated.
In order to compare PMLPR with other cutting-edge prediction tools, three other datasets have been used. The first one, is introduced by Du et al. In this dataset, all the HUMAN proteins were obtained from BioGRID dataset, mapped into 18036 proteins in UniProt dataset.
Two other benchmark datasets are DBMLoc and Höglund. DBMLoc contains 10470 multiple subcellular localization-annotated entries, which all these protein entries are cross-referenced to GO-annotations and SwissProt 39 . DBMLoc contains 6 subcellular localizations, Cytoplasm, Mitochondrion, Nucleus, Plasma Membrane, Secreted, ER. Höglund contains 5959 protein entries and 11 subcellular localizations, Chloroplast, Cytoplasmic, ER, Extracellular, Golgi, Lysosomal, Mitochondrial, Nuclear, Proxisomal, Plasma-membrane, vacoular. In Höglund, BLASTClust has been used to cluster the sequences using 30% threshold for pairwise sequence identity in animal and fungal proteins and 40% threshold in plant proteins 37 . In order to compare our results with Du et al., eleven subcellular locations have been considered, including Cell membrane, Cytoplasm, ER, Extracellular region, Golgi Apparatus, Mitochondrion, Nucleus, Peroxisome, Lysosome, Endosome and Microsome. For a protein, if a subcellular location has been marked as "Probable", "By Similarity" or "Potential", the subcellular location has been discarded.

Evaluation Method.
To assess the performance of PMLPR against other algorithms, four different measurements are employed.
where |D| denotes the number of proteins. For a protein, be the set of locations, which protein p localized according to the dataset and the order set of locations that a prediction algorithm predicts for protein p, respectively. In this evaluation, we do not consider the order of locations predicted for each protein. Using this approach, we globally evaluate the performance of an algorithm regardless of the order of locations introduced for a protein. For example, if the order set (nucleus, cytoplasm) is introduced for protein p, Precision does not consider the order of locations and there is no significant difference between (nucleus, cytoplasm) and (cytoplasm, nucleus). However, with more reliability the algorithm suggest that the protein p is located in nucleus in the first prediction and cytoplasm in the second prediction. In order to consider this difference, we introduce an extra measurement. Let the intersection of l(p) and ′ l p ( ) be the order set, where: Since, Precision and Ordered Precision, reflect the size of the prediction and the order of the prediction respectively, we introduced: Precision OrderedPrecision 2 (9) which is the mean of the two measurements Precision and Ordered Precision. Finally, F MP -measure is defined as follows: The second measurement is introduced by Simha et al. 36 . For each location c, Pre c and Rec c are defined as follow: In this part, prec c and Rec c obtain the Precision and Recall of an algorithm for each location c. Moreover, Simha et al. considered F 1 -score c , the harmonic mean of Precision and Recall for each location c. Furthermore, the average F 1 -score for all locations are calculated as follow: The third measurement is introduced by Du et al. 25 . They introduced 5 statistical measures, Recall (AIM), Precision (CVR), ACC′, ATR and AFR. The first two statistical measures, Recall and Precision are introduced in Measure 1. ACC′, ATR and AFR are accuracy, absolute true-rate and absolute false-rate, respectively. They can be formulated as followed: p Where |C| is the number of subcellular locations. and Measure 4. The forth measurement is ACC(accuracy), which is slightly different from ACC′ ACC can be formulated as followed:  41 , the number of possible selections to divide a benchmark into k classes is an immense number. Hence, selecting one of the divisions cannot be a fair demonstration of the performance of the algorithm.
Jackknife method considers each protein as a test case. In fact, in this method each protein moves between the train and test datasets. Moreover, this method is more efficient in memory usage. For these testimonies, jackknife method does not have the mentioned problems and it truly fits our problem. Thus in this paper, jackknife method is mainly used due to representing the performance of the algorithms impartially. Plus, we applied k-fold cross-validation method for more affirmation. In order to evaluate the accuracy of the algorithm, per each test protein, a list of locations is predicted according to the training dataset.
In PMLPR algorithm, for each prediction, we introduce a reliability threshold. According to this threshold, a set of sorted locations can be assigned for each protein. This threshold is used to exclude predictions with low reliability score. It is possible for the users to change this reliability threshold in the online version of PMLPR algorithm. For example, if the reliability threshold of 80% is considered for sample protein P35213, PMLPR's sorted result will be ′ l p ( ) = (cytoplasm, membrane), and if the reliability threshold of 30% is considered, the sorted list for this protein will be ′ l p ( ) = (cytoplasm, membrane, nucleus). In this study, in order to compare the results of our algorithm with the other state-of-the-art methods, we consider the reliability threshold of 30%.
Jackknife Test. Table 1 depicts the comparison between the results of PMLPR algorithm with the results of WP (WOLF-PSORT) and PC (prediction channel of compartment) on three species RAT, FLY and HUMAN.
The predefined Measure 1 (Recall, Precision, OrderedPrecision, MP, F-measure, F ordered -measure and F MP -measure) is used to compare the performances of algorithms in Table 1. This table reveals that on RAT and FLY proteins, PMLPR dramatically improved the results in all tests. In RAT and FLY, PMLPR improved the performance by at least 0.1 and 0.3, respectively. For instance, PMLPR improved the F ordered -measure and F-measure on RAT proteins by 0.1 and 0.18 with respect to the results of WP, which has the best result between the other methods. As can be seen from Table 1, on Fly dataset, PMLPR has a noticeable improvement in all tests. For example, PMLPR bucked up the F ordered -measure results for 0.31. Albeit, Table 1 demonstrates comparable results on HUMAN dataset. On HUMAN, PMLPR indicate the best F ordered -measure, PC shows the highest F-measure and F MP -measure. To sum up, in most cases, Table 1 shows that the Recall, Precision, OrderedPrecision, F-measure, F ordered -measure and F MP -measure values have been increased significantly by PMLPR algorithm with respect to other algorithms, which implies the efficiency of our method.
The other comparison used to evaluate the performance of PMLPR is the one introduced by Simha et al. 36 and we defined it in section 3, measure 2. Table 2 shows the result of this comparison (F 1 -score c ) between different algorithms, per each 9 locations on RAT, FLY and HUMAN proteins.
As it can be distinguished from Table 2, PMLPR has the best performance on RAT and FLY proteins and on HUMAN the results are quite competing, WP has the best performance in five of the locations and PMLPR has the best performance on four of the locations. Based on the results of Table 2, PMLPR has the best performance on all locations or a score close to the best performance. Overall, it can be said that PMLPR has acceptable performance on all locations. Table 3 illustrates the F 1 -score, the average F 1 -score c over all 9 locations. This table shows that, PMLPR has the best overall performance on RAT and FLY, competing results on HUMAN.
Overall, all these tests depicted the efficiency of PMLPR method. PMLPR has a significant improvement on RAT and FLY datasets. Furthermore, on HUMAN dataset, PMLPR has almost the same performance as other reported state-of-the-art methods.
Whereas Du et al. did not provide their software, we were unable to obtain their result for any protein to perform Measure 1 and Measure 2. In order to compare our method with them, we applied the same evaluation test as they performed. Hence, we would be able to use their result in our comparison. The results are shown in Table 4. Since we used a threshold of 0.3 in this test, PMLPR has wider range of predictions. Consequently, this would cause a higher recall and Absolute False-Rate(AFR) and lower precision, ACCuracy( ′ ACC ) and Absolute

Cross-validation test on DBMLoc and Höglund datasets.
In order to further evaluate PMLPR on other species based on the existing datasets, two of the well stablished datasets, DBMLoc and Höglund has been used. A similar 5-fold cross-validation test as the one performed by Zhou et al. in their publication has been used. This 5-fold cross-validation test has been repeated thirty times, and the average outcome is represented in Table 5. The ACC which is used in this evaluation is introduced in measure 4. While using these multi-species datasets, we faced the problem of building the similarity vector between proteins. It is trivial that there could be no protein-protein interaction between two proteins from two different species. DBMLoc and Höglund contain different proteins from different species, and in some species these two datasets have very few proteins. As mentioned in step 3 in section 2.2, we used the protein-protein interaction dataset, STRING, in order to build the similarity vector between proteins. Thus, the similarity vector built based on STRING was too sparse, and insufficient. To overcome this problem, we decided to use the sequence similarity of these proteins. For this purpose, a smith-waterman 42 sequence alignment between proteins has been applied, to obtain the protein-protein similarity for these two datasets.
As can be seen from   Cross-validation test on RAT, FLY and HUMAN datasets. We performed a 10-fold cross-validation test on PMLPR results. Since the implementation of the other existing methods are not available, we were unable to make change to the training data to compare the methods by 10-fold cross validation test. Besides, as the authors do not provide all the details of their implementations in their papers, re-implementing these methods may cause in unreliable results. Hence, we performed a 10-fold cross validation on PMLPR results for thirty times. The average outcome of this test, demonstrates that there is a negligible difference between the results of jackknife and cross-validation test. For instance, Table 6 and Table 7 display the average results of 10-fold cross-validation test on RAT, FLY and HUMAN proteins. As can be seen from these two tables, the results of the 10-fold cross-validation test are similar to the results of jackknife test. Therefore, we can consider jackknife as a reliable evaluation method for this problem.
Specific proteins. Table 8 shows 8 proteins with their subcellular locations and Gene Ontology information. These proteins are believed to be important in different cancers [43][44][45][46][47][48][49] . We have selected these proteins in order to have a transpicuous comparison between PMLPR and the 4 other methods.

Discussion
We presented an efficient protein localization method using personal recommender systems and protein-protein interactions. Using such approach for protein localization problem is the main contribution of this paper. The results demonstrate the utility of using recommender systems and protein-protein interactions in the prediction process. PMLPR not only improves the results, but also has a fast algorithm. The related algorithm is implemented using C++/R languages.
To the best of our knowledge, there are no available subcellular prediction software using protein-protein interactions, especially on HUMAN proteins. PMLPR software is available online and it is useable by biologist and other scientist.

Future Works
NBI is one of the basic recommender systems, there are more complex recommender systems, such as content-based methods 30 , collaborative filtering 50 , matrix factorization 51 and etc. These methods can be applied in this problem, and they may improve the prediction results.
In recent methods such as MDLoc, the interdependency of the locations has been taken into the account, because some of the locations have high interaction with each other and many proteins travel between these locations constantly. These interdependencies can be used in the future studies of this problem. Moreover, a fusion between our method and the other best existing methods will improve the results.  Table 9. Results of each method for 8 selected protein (Nuc = Nucleus, Cyt = Cytoplasm\Cytosol, Mem = Membrane, Mit = Mitochondrion, ER = Endoplasmic Reticulum, ExR = Extracellular Region, Per = Peroxisome, GA = Golgi apparatus).