CDAP: An Online Package for Evaluation of Complex Detection Methods

Methods for detecting protein complexes from protein-protein interaction networks are of the most critical computational approaches. Numerous methods have been proposed in this area. Therefore, it is necessary to evaluate them. Various metrics have been proposed in order to compare these methods. Nevertheless, it is essential to define new metrics that evaluate methods both qualitatively and quantitatively. In addition, there is no tool for the comprehensive comparison of such methods. In this paper, a new criterion is introduced that can fully evaluate protein complex detection algorithms. We introduce CDAP (Complex Detection Analyzer Package); an online package for comparing protein complex detection methods. CDAP can quickly rank the performance of methods based on previously defined as well as newly introduced criteria in various settings (4 PPI datasets and 3 gold standards). It has the capability of integrating various methods and apply several filterings on the results. CDAP can be easily extended to include new datasets, gold standards, and methods. Furthermore, the user can compare the results of a custom method with the results of existing methods. Thus, the authors of future papers can use CDAP for comparing their method with the previous ones. A case study is done on YGR198W, a well-known protein, and the detected clusters are compared to the known complexes of this protein.

comparison of the number of internal interactions with external interactions, but also the total weight of internal interactions should be higher than the total weight of external interactions. Many algorithms have been proposed to detect protein complexes using this idea, among which the most important ones are the followings: • CFinder 9 forms clusters by a greedy method for finding maximal cliques with at least k vertices.
• ClusterONE 10 uses an iterative greedy method to detect protein complexes. In each iteration, it selects a vertex as a core and extends it through the neighboring vertices with the aim of increasing the density of the growing cluster. • RNSC 11 uses the idea of partitioning the network into subgraphs with the aim of maximizing a cost function.
The obtained subgraphs are introduced as candidates for protein complexes. • IMHRC 12 removes a part of network hubs from the network. Then, by using a greedy evolutionary method such as ClusterONE, computes the initial clusters. Then, it returns some of the removed hubs to the network and applies a filtering step at the end. • MCL 4 utilizes random walk theory and Markov chains rule by iterating two stages of inflation and expansion, which are implemented using matrix multiplication and summation. • RRW 13 forms each cluster by selecting its core and adding the closest and most probable vertices via the random walk. The decisions for adding vertices to clusters are taken based on the maximization of flow in clusters. • ProRank 14 ranks the proteins based on their importance. The essential proteins have high interaction and evolutionary similarity with the others. After ranking proteins, the complexes are detected using the spoke model. • ProRank+ 15 has the similar steps to ProRank. But it has several variations. ProRank requires a similarity matrix as an input which indicated the similarities between proteins, while ProRank+ do not need such input. Furthermore, ProRank+ can detect complexes with overlap and has some post-processing steps for refining complexes. • PEWCC 16 assesses the reliability of the interaction data, then predicts protein complexes based on the concept of weighted clustering coefficient.
Since there are numerous methods for detecting protein complexes, it is essential to evaluate and compare their results. The inherent properties of complexes, such as the overlapping feature, have led to defining various evaluation metrics for comparing and ranking these algorithms. Although different evaluation criteria have been introduced in recent years, none of them are able to evaluate the results of the algorithms comprehensively and cover their shortcomings. Therefore, introducing a new criterion for analysis methods both quantitatively and qualitatively is requisite. Furthermore, an efficient online package for an extensive comparison of protein complex prediction algorithms was not available.
In this paper, we present a survey of previously defined criteria and scrutinize their drawbacks. Also, we define a new evaluation metric for ranking protein complex detection algorithms.
We introduce "CDAP" an online package for analysis and comparison of such algorithms. "CDAP" ranks algorithms on various PPI datasets (Collins, Gavin, Krogan-core and Krogan-extended) based on three well-known gold standard (MIPS, SGD and CYC2008). It is able to rank methods according to each of the evaluation metrics. The functionality of the package can be easily extended to incorporate new datasets, gold standards and compare novel methods with previously developed methods. Moreover, it visualizes the performance of each method based on evaluation metrics, which can help the user to get a better visual intuition of the performance of methods. Additionally, it has several custom options for integrating and filtering the results of methods.

Application
The CDAP server can be accessed from http://www.eslahchilab.ir/softwares/cdap. In order to compare the results of protein complex detection algorithms, the user should set up the desired setting in 5 steps.
www.nature.com/scientificreports www.nature.com/scientificreports/ presentation of results. Detected clusters by the selected algorithm are downloadable from the output section of the server. The computed criteria for each of the algorithms are tabulated and also displayed via multi plots. For each criterion, the comparisons are presented in a tab that contains a multiplot and a table. The multiplot shows the computed criteria of all selected algorithms indexed by various thresholds used in the computation of criteria. The areas under these curves are projected in a table. An example of outputs is shown in Fig. 2. In case of selecting (SN, PPV and ACC) or (AUMF) sets of criteria, no plot is shown, since setting a threshold for computing these criteria is meaningless. Thus, their absolute values are reported in a table.
custom options. In addition to the listed algorithms, the user may submit the detected clusters obtained by any method. It should be noted that when the results of a (or several) new method(s) are uploaded, the selected dataset should be the one that is used the process of obtaining the results. Moreover, this server has the ability to run the mentioned algorithms on other PPIN datasets. So the user can upload custom datasets. It is also possible  www.nature.com/scientificreports www.nature.com/scientificreports/ for the user to upload new gold standards for conducting comparison based on different complexes. The valid formats for uploading custom algorithms, PPINs and gold standards are described in Supplementary Table 2. Additionally, one can extend the server with a new criterion, protein complex detection algorithm, PPIN or gold standard, in such a way that future users can make use of them, by sending them to the contact mail to be adjusted and added to the server. The guidelines are represented in Supplementary Table 3. It is also possible to set a threshold θ for computing criteria based on it. To be more accurate, a cluster will be considered to be similar to a gold standard complex, if their overlap is greater than this threshold.
Moreover, the package has some option for filtering the detected clusters. The user can specify the minimum and maximum size of detected protein clusters. On the top of that, a very practical option of this package is that user can enter the STRING ID of a protein in order to receive all clusters detected by each method that contains the specified protein, an example of such applicable option is presented in the following. Another type of filtering can be done based on reliability, such that the results of methods will be filtered with the clusters that are detected by at least β methods.
integration. The package has the capability to integrate the methods' results. It can yield the intersection or union of the clusters detected by various methods. To do this, it constructs a k-partite graph (k is the number of selected methods). Each part contains the nodes that are representatives of its clusters. There will be an edge between two nodes of two different parts if their corresponding clusters have more than φ% overlap. Then, all maximal cliques with size greater than ψ are considered and the clusters are used for intersection or union. So, the integration of results will contain the clusters with high reliability that are detected (with some alterations) by most methods. φ and ψ can be specified by user, otherwise their default values are considered which are 0.5 and k/2, respectively. Once the union or intersection is selected, the computed criteria for the integrated results are presented in the Results tab as shown in Fig. 2 and its file can be downloaded.

An example of integration.
Integrating the results of several methods may yield better results. As an example, we selected all methods and integrate them with different values of hyperparameters φ and ψ. The best results were obtained in two cases: ProRank is excluded from this analysis, because it uses extra biological information beside topology, while other methods are use only topological data. Thus, integrating ProRank with topological-based method is not suitable. It should be mentioned that when φ = 1, union and intersection yield the same results. The criteria value for Integration11, Integration12 and other methods are presented in Table 1. It can be seen that integration can improve the results of almost all methods. Integration11 improves all methods in terms of MMR, Sn, ACC, Recall, Recall + , and Precision N . Moreover, Integration12 improves all methods in terms of MMR, AUMF, Recall, and Recall N . An example of filtering by reliability. As an example of the package capability of filtering method results by reliability, we applied this filtering on ClusterOne method with β = 3. The criteria values are presented in Table 2. It can been seen that this filtering improved the values of Precision, F -measure, Precision + , F -measure + , and Recall N . It should be noted that the filtering may decrease the quality of performance in terms of some criteria, and it highly depends on the value of hyperparameters.
Complexes of YGR198W: An example of filtering by protein ID. YGR198W/YPP1 is an essential protein in Saccharomyces Cerevisiae. It also performs fundamental functions in Human, Mouse and other species. It is a Cargo-transport protein involved in endocytosis. Furthermore, it plays role in the assembly and recruitment of multiple copies of the kinase into phosphoinositide kinase (PIK) patches at the plasma membrane [17][18][19][20][21] . www.nature.com/scientificreports www.nature.com/scientificreports/ Knowing the complexes that included this protein, helps us to have better insight into its functionality. Four complexes have been reported for this protein in Yeast Resource Center (YRC) and Database of Interacting Proteins (DIP) which are listed in Supplementary Table 4. CDAP can help us to retrieve all detected clusters including this protein quickly using its filtering by protein ID option. It should be noted that this protein is included just in Krogan-core and Krogan-extended PPI datasets. Using CDAP, we executed all methods on these two datasets based on both gold standards.
Among all methods, just Cfinder, RNSC, and MCL returned clusters that contain YGR198W. Each of these methods on each setting found at most one cluster containing YGR198W, but with different size. As it is shown in Table 3, eight clusters were detected when YGR198W was provided as a query, 3 of which share common proteins with reported complexes in YRC; 2 of these clusters are detected by MCL and another cluster is detected by RNSC. It should be noted that both clusters detected by MCL are verified with the same complex, i.e. there is a complex reported in YRC that all its proteins are detected by MCL, but in the second cluster, MCL identified more proteins as the constituent. Five other clusters have not been documented in earlier studies as known complexes of YGR198W. Such clusters may give clue to discovering other complexes of this protein in further studies. The detected clusters are presented in Supplementary Table 5.

Discussion
In our survey of previously defined criteria, it has alluded that some of the criteria values are dependent to threshold θ. The dependency of criteria to the value of θ has several drawbacks.
• By changing the values of threshold θ, the meaning of "true" and "false" for detected clusters is altered.
• Setting an appropriate value for θ is crucial, challenging and hard.
• For a specified value of θ, the amounts of the difference of edge weights to θ are not considered. If two methods have the same bipartite graph that all weights are greater than θ and in the second graph, all edge weights are duplicated. However these two methods do not have equal performance, the calculated values of criteria for these two methods are equal; since in the calculation of threshold-based criteria, the quantities of weights are ignored.
There is a critical point that should be noticed in utilizing threshold dependent criteria that is by changing the value of the threshold, a ranking of method performances will be altered. Therefore, for a specified evaluation metric and the specified value of the threshold, method A may be better than method B; while with a different value of the threshold and the same criterion, method B is better than method A.
In previous papers such as 9,12 evaluations and validations are based on a specific value of the threshold, which is not an appropriate assessment. For example, RNSC is reported in 12 as a method that does not perform well, while this conclusion is based on a specific value of the threshold. Using CDAP, one can realize that RNSC is of the best methods based on the area under curves of the criteria that are listed in Table 4. To overcome this problem, we have used the area under the curve for each threshold dependent criterion. In fact, in CDAP, when the user selects a criterion, the value shown in the output table is the area under the curve of the values of this criterion indexed by values of the threshold.
In addition to the definition of AUMF which is not dependent on the threshold, we revise previous criteria by considering the area under curve concept. For instance, for F -measure criterion, the area under the curve of the graph which plots the F -measure vs. threshold on x-and y-axes is calculated. It is worth mentioning that CDAP can compute the value of threshold dependent criteria for a particular threshold in case of user demand. Furthermore, CDAP computes two new values, namely "AUPR" and "AUPR + ". "AUPR" is the area under curve of Precision vs. Recall and "AUPR + " is the area under curve of Precision + vs. Recall + .    Table 3. Number of verified proteins in each detected cluster. To clarify the notation, x of y means that the method finds a cluster containing YGR198W with y proteins and there exists a known complex that shares x protein in common with the detected cluster.
www.nature.com/scientificreports www.nature.com/scientificreports/ conclusion Methods for detecting protein complexes from protein-protein interaction networks (PPIs) are one of the most critical computational methods which have been widely used recently. Several methods have been proposed for the discovery of protein-protein interactions in recent years, namely Cfinder, MCL, RNSC, RRW, ClusterOne, IMHRC, ProRank, ProRank+, and PEWCC. Evaluating these methods is of high importance. The inherent properties of complexes, such as the overlapping feature, have led to defining various evaluation metrics and criteria for comparing and ranking these algorithms. Many evaluation metrics have been proposed previously such as ACC, PPV, SN, Precision, Recall, F -measure, Precision N , Recall N , and MMR, each of which has some drawbacks and cannot fully reflect the quality of an algorithm. We analyzed these evaluation metrics in this paper and expressed their flaws by presenting some examples. It is reasonable that a protein complex detecting algorithm performs well whenever it returns the clusters the same as the complexes in the gold standard. So, the evaluation metrics should be able to measure the quality and quantity of one-to-one and spanning relationship between the estimated clusters and complexes in the gold standard. In this paper, we introduced a new evaluation metric MMR + Fmeasure + . The newly introduced metric can better express the quality and quantity of the one-to-one and spanning relationship between the clusters and complexes. Nevertheless, its value depends on the value of the threshold. Thus, we defined a new criterion that is invariant with respect to values of threshold, which is the Area Under (MMR + Fmeasure + , θ) curve (AUMF). This criterion has the feature of assessing methods both qualitatively and quantitatively in addition to covering the drawbacks of previous criteria.
We have developed CDAP, an online package for the comparison and visualization of the performance of protein complex detection methods. It is available at http://www.eslahchilab.ir/softwares/cdap. CDAP lets the user select the PPI dataset, gold standard, protein complex detection method and set of criteria for evaluation.
Moreover, we revised previously defined criteria that are dependent on the threshold of θ. In CDAP the calculated value for such criteria is the area under curves of criteria indexed by values of θ.
CDAP also has the ability to compare the results of any custom method when its outputs are uploaded in a valid format on the website. Furthermore, one can upload new datasets or gold standards for training and evaluating algorithms. We tried to include AP 22 and CMC 7 methods in our package, but there was some problem in the compatibility of platforms that we are going to remove them. We are trying our best to add these methods and other newly proposed methods to our package.
One of the most applicable queries that can be done by CDAP is the option of filtering detected clusters by setting minimum or maximum size of their proteins or by specifying a protein as a query, which in this case the CDAP find all clusters detected by methods that contain the specified protein. Another type of filtering is based on the reliability such that filtered results contain only the clusters that are detected by several methods. A useful capability of the package is the integration of methods' results via intersection and union.
Due to the rapid trend of producing and analyzing data, proposing novel algorithms and defining new criteria, anyone can extend CDAP simply by sending new datasets, gold standards, methods or criteria to the contact mail. The guidelines for sending files is described in Supplementary Table 3.
To sum up, CDAP can be a very useful package to compare methods exhaustively and comprehensively and can facilitate the validation process for authors of future papers and help them in comparison of their proposed method to the previous ones.

Methods
Datasets. The datasets of PPINs used in this package are: Collins 5 , Gavin 1 , Krogan-core 23 and Kroganextended 23 . These datasets were used to learn methods in the package. All settings and parameters in every dataset were set based on what the original papers have proposed. We removed self-interactions and isolated proteins from all datasets. The weights of PPIs in the Gavin dataset are Socio-affinity indices, which show the log-odds of number of times that pairs of proteins are observed together as preys, or bait and a prey 24   www.nature.com/scientificreports www.nature.com/scientificreports/ • Gene Ontology-based protein complex annotations from SGD (Saccharomyces Genome Database) 25 . SGD includes Gene Ontology (GO) annotations which provide useful biological information for producing reference complexes. • The catalog of protein complexes from MIPS 26 (the Munich Information Center for Protein Sequences). The MIPS catalog has a hierarchical structure so the complexes may be composed of several subcomplexes 10 . • The benchmark protein dataset CYC2008 that contains manually curated heteromeric protein complexes 27 . previously introduced evaluation metrics. A common approach for comparing and evaluating protein complex detection algorithms is to match their outputs with reference complexes in gold standard sets. We call the true complexes of the gold standard as "reference complex" and detected complexes by methods as "detected clusters". A remarkable point is the existing overlap among protein groups. These overlaps between reference complexes, as well as the existence of similar overlap between the protein clusters detected by the above algorithms, have led to different comparison and evaluation methods. It should be noted that a reference protein complex may correspond to several clusters detected by the algorithms. Conversely, a detected cluster may correspond to more than one reference protein complex. It is worth mentioning that the relation between detected clusters and gold standard complexes is often not complete and is sometimes partial. Thus, some reference complexes may have no relation with any detected clusters and vice versa. These connections can be represented by a bipartite graph, in which the circle nodes (P i ) in the first part are representatives of gold standard complexes and the square nodes (C j ) in the second part are the representatives of detected clusters. Figure 3 shows a schematic illustration of such relations. The number on each node (n i or m j ) shows the number of proteins in that node (cluster or complex). The weight t ij on edge (P i , C j ) reports the number of common proteins between complex P i and cluster C j .
Undoubtedly, these relations must be measured quantitatively and qualitatively. In other words, not only the number of unique matches among the members of the two groups should be taken into account, but also the quality of each matching is very important and should be considered. Three common categories of criteria are introduced in articles for such evaluation. We survey evaluation metrics in two parts: qualitative metrics and quantitative metrics. It is evident that the values of SN and PPV is in range [0,1]. The expected value of PPV is low; since its numerator is the sum of m elements and its denominator is the sum of nm elements. All criteria of this category focus on the quality of detected clusters. In other words, they measure the number of proteins that reference complexes and detected clusters share in common. According to the definitions, SN represents the ratio of proteins in reference complexes that are corresponded to the detected clusters. Ignorance of the impacts of giant components and redundant groups in detected clusters is one of the shortcomings of this benchmark. Suppose that there is a giant component among the detected clusters that contains all the proteins in the reference complexes. It is obvious that the value of SN criterion will be maximized in that case. Figure 4 shows this issue. Additionally, if there are a lot of similar groups between the detected clusters, the value of this criterion is almost equal to value of SN after removing redundant clusters (removing all similar groups except one of them). Although, PPV tends to 1 by repeating similar clusters, while the results of algorithms is not great. This issue can be seen in Figs 5 and 6. www.nature.com/scientificreports www.nature.com/scientificreports/ PPV is defined in such a way that the accumulation of proteins in a group (the existence of a giant component) leads to reduction of PPV by incresing the denominator. Notice to PPV values in Fig. 4. Unfortunately, PPV ignores false negatives, i.e. if there are so many complexes that are not matched by clusters well, this issue do not decrease PPV value, but it declines SN value. Figures 7 and 8 shows this problem. Therefore, both of these criteria has some drawbacks. So the accuracy criterion (ACC) is used to balance the values of the two criteria of SN and PPV.
However, ACC cannot be considered as a flawless benchmark for evaluating and comparing protein complex detection algorithms. As mentioned above, in case of adding redundant clusters to the results, which leads to low performance of algorithm, SN does not change and PPV increases, so ACC ascends. Furthermore, consider a perfect complex detection algorithm that its outputs are exactly the same as the reference complexes in a gold standard. We expect that the criteria values to be the maximum. It is evident that the detected clusters may have some overlaps. For such results, SN value will be maximized, while this is not the case for PPV. Thus, ACC value of this perfect method is not 1. In other words, due to overlaps between the reference complexes, the value of ⁎ t i , which is defined as follows, will be greater than n i .  One can conclude from the equivalence of detected clusters and reference complexes that: In this case, PPV and SN are: In fact, due to the overlap between the reference protein complexes, there exist some proteins that belong to more than one protein complexes. Therefore, the numerator of PPV formula is almost always smaller than its denominator, so the PPV value will never be maximized. This issue is illustrated in Fig. 9. As a result, since we know that the existence of overlap is an intrinsic property of the protein groups, PPV, SN, and ACC criteria are not appropriate metrics. Another important problem is the ignorance of false positives by this category of criteria. This means that detection of any number of incorrect protein clusters have no effect on the values of the criteria in this www.nature.com/scientificreports www.nature.com/scientificreports/ category. Thus, methods that detect a huge number of clusters (containing a lot of incorrect clusters) often acquire high values of ACC. Figure 10 shows this drawback. In sum, false positives does not affect PPV and SN. And False negatives does not affect PPV, but they decline SN. Therefore, it is essential to analyze methods with another criterion.
• Pre N , Rec N , F -measure N Zaki et al. 14 defined a set of criteria which is almost analogous to the previous set (PPV, Sn, ACC):  In this definition, NA(P i , C j ), which is known as neighborhood affinity score, shows how the reference complex P i has been able to match the detected cluster C j 3 . By applying a weighted bipartite graph matching algorithm on this graph, a one-to-one mapping between the first and second sections is obtained. The MMR value is calculated by normalizing the total weight of the maximal matching edges on the number of reference complexes, so its value is in range [0, 1]. The defect of this criteria is that it ignores false positives; i.e. the detection of many spurious clusters do not affect the value of MMR. Because adding similar clusters to a set of detected clusters do not change the maximum matching. Thus, both the numerator and denominator do not change. Dependency of this criterion to the value of θ is its another disadvantage.

Quantitative metrics • Precision, Recall and F -measure
This category of criteria includes F -measure, Precision and Recall. F -measure is the harmonic mean of Precision and Recall criteria. These criteria are defined as follows: where  and  are representatives of the sets of reference complexes and detected clusters, respectively. Zaki et al. used slightly different versions of Precision and Recall as follows 14 : Both of these category of criteria attempt to examine the results of protein complex detection algorithms quantitatively. In fact, in these set of criteria, the main attention is to the number of detected clusters and less attention is paid to the internal quality of these clusters. Thus, the matching size more than a specified threshold cannot influence the quality measured by these criteria. This category has several drawbacks: • The existence of exorbitant overlapping complexes does not have a significant effect on quality reduction. In other words, the low values for θ threshold increase the error rate when the size of the groups get bigger. For example, if θ = . 0 25, for confirmation of a 10-member cluster, it is sufficient that half of its proteins exist in another 10-member gold standard complex. Thereby, the mentioned algorithm can detect similar groups with the mentioned core, all of which have 5 members in common with the gold standard, or detect some redundant groups. Thus, all redundant groups are validated. Although Recall does not change but Precision increases which lead to an increase in F -measure. The reason for this drawbacks is that several clusters can be validated by one complex • The values of the mentioned criteria is dependent on the value of θ.
In order to address this issue, we introduce a new set of criteria.
A new set of evaluation metrics. As mentioned in the previous section, each set of criteria has some drawbacks. One way to have a more expressive set of criteria is to gather various criteria. In order to cover MMR (2019) 9:12751 | https://doi.org/10.1038/s41598-019-49225-7 www.nature.com/scientificreports www.nature.com/scientificreports/ shortcomings, ACC and Recall have been used in 2 . As explained above, ACC has a fundamental defect due to the presence of overlap between groups. In addition, ACC measures the groups qualitatively similar to MMR. Thus, there is no need for ACC criterion while using MMR. Moreover, usage of Recall criterion solely and without Precision criterion does not make sense. Therefore, the use of F -measure instead of the Recall criterion is more logical. One can add quantitative evaluation to the main evaluation process by using F -measure; however, all the deficiencies have not been resolved yet due to the mentioned defects for the second category. We used MMR criterion alongside the modified F -measure to have a more accurate evaluation system. The modified F -measure, Precision, and Recall are represented by F -measure + , Recall + , and Precision + , respectively.  In this definition, θ Match( , , ) P C contains the set of edges obtained by applying maximum non weighted bipartite graph matching algorithm on the bipartite graph that consists of edges between reference complexes and detected clusters that has the affinity score greater than θ. Since both MMR and F -measure + are in range [0, 1], we can consider the sum of these two criteria as the benchmark for ranking protein complex detection methods.
Obviously, the best algorithm for identifying protein complexes is an algorithm that has a one-to-one and spanning relationship between the detected clusters and the complexes within the gold standard. Therefore, this metric is suitable for evaluation, which can measure the quantity as well as the quality of this relation, while the old metrics do not have this property. The criterion introduced in this section explicitly examines the quality and quantity of such a relationship. It can be inferred that this criterion takes its maximum value 2 when the algorithm returns the clusters the same as the gold standard complexes. The values of this criterion for all methods applied on Collins dataset and compared with Mips gold standard is shown in Fig. 11 (ProRank is not considered in this comparison since it uses additional biological data).
This criterion still depends on the value of θ. Thus, for having a metric that is invariant with respect to θ, we consider the Area Under (MMR + Fmeasure + , θ) curve (AUMF) as a criterion for evaluating the results of the algorithms. The values of AUMF for 8 methods with default parameter values on each pair of dataset and gold standard are reported in Supplementary Table 1.

Data Availability
The package is implemented in Java and from the "CDAP" website (http://www.eslahchilab.ir/softwares/cdap) Figure 11. The values of the MMR + Fmeasure + for all values of θ.