Using contrast patterns between true complexes and random subgraphs in PPI networks to predict unknown protein complexes

Most protein complex detection methods utilize unsupervised techniques to cluster densely connected nodes in a protein-protein interaction (PPI) network, in spite of the fact that many true complexes are not dense subgraphs. Supervised methods have been proposed recently, but they do not answer why a group of proteins are predicted as a complex, and they have not investigated how to detect new complexes of one species by training the model on the PPI data of another species. We propose a novel supervised method to address these issues. The key idea is to discover emerging patterns (EPs), a type of contrast pattern, which can clearly distinguish true complexes from random subgraphs in a PPI network. An integrative score of EPs is defined to measure how likely a subgraph of proteins can form a complex. New complexes thus can grow from our seed proteins by iteratively updating this score. The performance of our method is tested on eight benchmark PPI datasets and compared with seven unsupervised methods, two supervised and one semi-supervised methods under five standards to assess the quality of the predicted complexes. The results show that in most cases our method achieved a better performance, sometimes significantly.


Important properties of subgraphs Supplementary
This group includes one feature which represents the number of nodes in the subgraph. The feature is denoted as nodeSize.
Graph density This group includes one feature which represents the density of the subgraph. The feature is denoted as graphDensity.
Degree statistics This group includes four features: mean degree, degree variance, degree median and degree maximum. They are denoted as meanDegree, varDegree, medianDegree and maxDegree, respectively.

Degree correlation statistics
This group includes three features: mean degree correlation, degree correlation variance and degree correlation maximum. They are denoted as meanDegreeCorrelation, varDegreeCorrelation and maxDegreeCorrelation, respectively.

Clustering coefficient statistics
This group includes three features: mean clustering coefficient, clustering coefficient variance and clustering coefficient maximum. They are denoted as meanClusteringCoeff, varClusteringCoeff and maxClusteringCoeff, respectively.

Topological coefficient
This group includes three features: mean topological coefficient, topological coefficient variance and topological coefficient maximum. They are denoted as meanTopologicCoeff, varTopologicCoeff and maxTopologicCoeff, respectively.

First Eigenvalues
This group includes three features representing the first three largest singular values of the candidate subgraph's adjacency matrix. They are denoted as eigenValue_1, eigenValue_2 and eigenValue_3, respectively.

Protein weight/size statistics
This group includes four features representing average and maximum protein length and average and maximum protein weight. They are denoted as aveLength, maxLength, aveWeight and maxWeight, respectively.

Nondense complexes
Three complexes are selected from the MIPs complex catalogue database. Supplementary Table 4 shows that the density of these three complex subgraphs in experimental PPI networks. These three complexes do not appear in the Collins and Gavin PPI networks．The asterisk in the table indicates that the corresponding complex does not appear in the corresponding PPI network.

The size distribution of the non-complex subgraphs
Subgraphs generated by randomly selecting nodes in a given PPI network satisfy the following conditions.
(1) Subgraphs are generated by the random generator. But we ensure that each generated subgraph is not a true complex.
(2) The size range of generated subgraphs is the same as that of known complexes appearing in the PPI network. The size distribution of true complexes in MIPS, SGD and TAP06 is distributed as a power law, respectively. Generated subgraphs following the same power law distribution.
(3) Subgraphs generated may be connected; maybe some proteins in a random subgraph are not directly connected to the rest of the subgraph.
A connected subgraph may be an unknown complex, even the subgraph is connected by a linear shape (see Supplementary Figure 1(c)). If an unknown complex subgraph is selected, then the subgraph will be regarded as a negative example. Ideally, an unknown complex should be regarded as a positive instance.
That is to say, an unknown complex subgraph is the noise. But NEPs is a special EPs which can handle with the noise in dataset. Thus, we do not care whether a random subgraph is connected. The size distribution of true complexes and these random subgraphs in experimental PPI networks are shown in Supplementary Figure 2, respectively.

Support threshold parameters setting for mining NEPs
Mining NEPs of complexes and non-complexes requires two support thresholds > 0 and > 0 , respectively. Different thresholds and will generate different number of NEPs. NEPs generated for the given support thresholds should fulfill these properties such as discriminative power and simplicity [2][3].
Given a PPI network, the instances constructed by true complexes in the PPI network are regarded as the positive class, while the instances constructed by non-complexes (random subgraphs) in PPI are considered as the negative class. In our experiments, the minimum support threshold in the complexes and the maximum support threshold in the non-complexes for mining NEPs of complexes; the minimum

Parameter settings of ClusterEPs and other tested algorithms
During the process of searching complexes by ClusterEPs, the merging threshold was set as 0.9. The size threshold of the cluster was set as 100 when searching complexes. We did not tune these parameters to a particular dataset in our experiments.
Except ClusterONE, parameters in other six tested algorithms have been preliminarily optimized to a specific dataset in order to obtain the best possible results.
These optimized parameters were obtained by trying all possible combinations in [4].
In our experiments, we used the optimized parameter values which were tested for each algorithm on each dataset in [4].

Mined NEPs in each dataset
NEPs mined in each PPI network are listed in the Supplementary Tables 11-16.

Results on five yeast PPI Datasets
Our experiments were conducted on two different personal computers. The specifications of these two personal computers (PC) are provided as follows: (1) PC1: 4 Intel(R) Core(TM) i5-3230 CPU, 2.6GHz each and 8GB of RAM. The Operation System : window 8.
We ran ClusterEPs 10 times in each personal computer for each dataset. The results obtained using MIPS and SGD on five datasets are shown in the Supplementary Tables 17-26. The first column of each  We use a supervised learning method to search complexes in ClusterEPs. In our method, the positive dataset was constructed by true complexes in , and the negative dataset was constructed by randomly selecting non-complex subgraphs from . Since ⊆ , a random subgraph would be likely selected from . For a subgraph s ∈ , if s is selected to construct a negative instance, then we look on s as a non-complex, although s is a true complex. That is, s is regarded as the noise (false negative). Although NEPs consider the potential noise, a large amount of noise will possibly affect the performance of ClusterEPs.
As more and more complexes are confirmed in the future, the number of complexes in will become larger, the number of complexes in will become less and less. This means the negative datasets would contain less and less noise, the performance of ClusterEPs would become better and better.  Figure 5). This means that it has a higher boundary weight.

Supplementary
Attributed to these facts, ClusterONE was not able to exactly recover this complex, but instead it added 10 nearby proteins almost to double the true size of the complex.  The DASH complex was thoroughly examined in [4]. Among other seven methods, only ClusterONE is able to detect this complex completely and correctly [4].
ClusterEPs is also able to detect this complex completely and correctly. The result are shown in Supplementary Figure 6.
The RSC and the SWI/SNF complexes as a particular overlapping complex pair were also investigated in [4]. Both ClusterEPs and ClusterONE obtained a prediction result closest to the original RSC and the SWI/SNF complexes. The results are shown in Supplementary Figure 7. Other six methods are not able to achieve a good results [4].
(6) organism/annotation: Saccharomyces cerevisiae is selected for three yeast complexes, Homo sapiens is selected for two human complexes.