A Highly Efficient Approach to Protein Interactome Mapping Based on Collaborative Filtering Framework

The comprehensive mapping of protein-protein interactions (PPIs) is highly desired for one to gain deep insights into both fundamental cell biology processes and the pathology of diseases. Finely-set small-scale experiments are not only very expensive but also inefficient to identify numerous interactomes despite their high accuracy. High-throughput screening techniques enable efficient identification of PPIs; yet the desire to further extract useful knowledge from these data leads to the problem of binary interactome mapping. Network topology-based approaches prove to be highly efficient in addressing this problem; however, their performance deteriorates significantly on sparse putative PPI networks. Motivated by the success of collaborative filtering (CF)-based approaches to the problem of personalized-recommendation on large, sparse rating matrices, this work aims at implementing a highly efficient CF-based approach to binary interactome mapping. To achieve this, we first propose a CF framework for it. Under this framework, we model the given data into an interactome weight matrix, where the feature-vectors of involved proteins are extracted. With them, we design the rescaled cosine coefficient to model the inter-neighborhood similarity among involved proteins, for taking the mapping process. Experimental results on three large, sparse datasets demonstrate that the proposed approach outperforms several sophisticated topology-based approaches significantly.


Supplementary Information: Hyper parameter-sensitive tests for RCF on experimental datasets
Note that RCF employs two hyper-parameters, i.e., C Y and d, to control the rescaling process. Obviously, these two parameters will affect the final output. Hence, it is necessary to conduct parameter-sensitive tests with respect to them. We first fix C Y =0, and test the performance of RCF with d ranging in [0,100] to check the effect brought by rescaling coefficients relying on I-Sim supports. Then we fix d=1, and test RCF with C Y ranging in [0, 600] to see its effect. During such tests, the performance of RCF in HTS-PPI assessment and prediction on both datasets is recorded.
Assessment. Figures A1-A2 depict the performance of RCF in assessing the reliability of HTS-PPIs on both datasets with d varying. As shown in Figures A1-A2, d has critical impact on the performance of RCF in HTS-PPI assessment. On D1, with d=10, about 77.0% of the top 50% of HTS-PPIs selected by RCF are supported by functional homogeneity, and about 78.4% of them are supported by cellular co-localization; however, with d = 0, these two ratios reduce to about 70.6% and 73.3%, respectively, as shown in Figures A1(a) and A1(c). Similar phenomena can be also observed during the test on D2, as shown in Figure A2(a) and A2(c).
During our tests, RCF tends to obtain stronger correlation with function homogeneity and localization coherence a d increases at the very beginning; once d reaches a certain threshold, the outputs of RCF stabilize. This can be seen more clearly from the average values of the correlation with function homogeneity and localization coherence by RCF, as depicted in (b)   Figures A3-A4 depict the performance of RCF in the task of HTS-PPI assessment on both datasets with C Y changing. From these results, we see that the performance of RCF also relies on the value of C Y . As shown in Figures A4(a) and A4(c), by setting C Y =600 on D2, about 87.9% of the top 50% of HTS-PPIs selected by RCF are supported by functional similarity, and about 85.8% of them are supported by localization coherence. However, with C Y =0, these two ratios reduce to about 82.4% and 81.9%, respectively. Similar phenomena can be observed in the tests on D1, as shown in Figure A3.
Meanwhile, the performance of RCF also stabilizes when C Y gets large enough. On both datasets, the correlation with functional homogeneity and localization coherence of RCF stabilizes with C Y bigger than 150, as shown in (b) and (d) of Figures A3-A4. Figure A3. Performance of RCF in assessment with d=1 and C Y changing on D1. Figure A4. Performance of RCF in assessment with d=1 and C Y changing on D2.

(a) Functional homogeneity (b) Average F-homogeneity (c) Localization coherence (d) Average L-coherence
To summarize, when dealing with the task of assessment, the performance of RCF relies on d and C Y . In general, when they are large enough, RCF achieves fine performance in assessing the reliability of HTS-PPIs.
Prediction. Figures A5-A6 depict the performance of RCF when predicting missing interactomes with d changing. From these two figures, we see that when processing the task of prediction, d plays an important role in affecting the performance of RCF. Generally, its prediction accuracy increases with d increasing. As shown in Figures A5(a) and A5(c), with d=100 on D2, about 42.5% HTS-PPIs of the 20,000 interactomes predicted by RCF are supported by functional similarity, and about 64.4% of them are supported by localization coherence; however, with d=0, these two ratios drop to about 22.5% and 39.6%, respectively. Similar situations can also be found in the experiments on D2, as shown in Figures A6(a) and A6(c). Generally, with d large enough, the prediction accuracy of RCF tend to stabilize, as shown in (b)