Nature Biotechnology
22, 98 - 103 (2003)
Published online: 7 December 2003; | doi:10.1038/nbt921
Unraveling protein interaction networks with near-optimal efficiencyMichael Lappe1, 3
& Liisa Holm21 EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. 2 Institute of Biotechnology, PO Box 56 (Viikinkaari 5), FI-00014 University of Helsinki, Finland. 3 Present address: University of Cambridge, Department of Biochemistry, Tennis Court Road, Cambridge CB2 1GA, UK.
Correspondence should be addressed to Michael Lappe lappe@ebi.ac.ukThe functional characterization of genes and their gene products is the main challenge of the genomic era. Examining interaction information for every gene product is a direct way to assemble the jigsaw puzzle of proteins into a functional map. Here we demonstrate a method in which the information gained from pull-down experiments, in which single proteins act as baits to detect interactions with other proteins, is maximized by using a network-based strategy to select the baits. Because of the scale-free distribution of protein interaction networks, we were able to obtain fast coverage by focusing on highly connected nodes (hubs) first. Unfortunately, locating hubs requires prior global information about the network one is trying to unravel. Here, we present an optimized 'pay-as-you-go' strategy that identifies highly connected nodes using only local information that is collected as successive pull-down experiments are performed. Using this strategy, we estimate that 90% of the human interactome can be covered by 10,000 pull-down experiments, with 50% of the interactions confirmed by reciprocal pull-down experiments.
High-throughput pull-down experiments (such as yeast two-hybrid and tandem-affinity purification with subsequent mass-spectrometry (TAP-MS)) make it possible to determine the neighborhood of a given 'bait' protein within a protein-protein interaction network within a reasonable margin of error. Yet, despite some significant advances, a vast majority of protein-protein interactions either remain to be experimentally determined or have not yet been made available in public databases. How can coverage of protein interaction space be extended through a minimal number of pull-down experiments? To simulate the discovery of an unknown interaction network, we used real interaction data sets from yeast, and explored them by virtual pull-down experiments. Although no complete data set of interactions is available for a single organism yet, all observations indicate that protein interaction networks are scale-free (see Supplementary Fig. 1 online). Because randomly selected subsets of edges from a scale-free network also follow a power-law distribution, and all interaction data sets available represent different subsets of the overall interactome, we concluded that interaction space as a whole has the same distribution shape as any major subset. The interaction networks of higher organisms, such as humans, can thus be assumed to be scale free, and simulation results based on incomplete interactomes should provide a reasonable approximation of the complete interaction network.
We define protein interaction space as the set of all specific interactions among proteins in a cell. We model interaction networks as graphs where all proteins are nodes linked by edges representing the protein interactions. Thus, coverage of interaction space translates to edge coverage in graphs (Fig. 1). Here, edge coverage means selecting a subset of nodes such that every edge is adjacent to at least one node in this subset. There are many different solutions for finding a set of nodes covering all edges (interactions) within the same graph (interaction network). In biological terms, a pull-down experiment reveals the adjacent edges (interacting proteins) of one node (the bait). Consequently, a minimum edge-covering set would allow us to map the interactome with the minimal experimental effort (minimal number of baits). The bad news here is that the problem of finding the minimum edge-covering set has been shown to be NP-complete, and hence cannot be computed efficiently even on a graph of known topology1,
2. In the biological setting, matters are complicated further, because the topology of the interactome graph is initially unknown.
 | | Figure 1. Edge covering and the graph-weight of a node ordering. |  |  |  | (a) Two edge-covering sets (C1, C2) for the same graph. The graph G = (V, E) consists of nine nodes (V = {a, b, c, d, e, f, g, h, i}) linked by ten undirected edges (E). The edge-covering sets are indicated by the cone-shaded nodes. Both sets C1 (left) with six nodes (a, b, d, e, f, h) and C2 (right) with three nodes (c, f, h) satisfy the condition that every edge is adjacent to a node in C1 or C2, respectively. (b) In the adjacency matrix representation of the graph, an entry mij is marked (in red) to denote an interaction between proteins i and j. The matrix MC2 shows that C2 = (c, f, h) is a covering set, because the respective rows and columns (gray) of this set cover all interactions. The matrix Mord illustrates how the weight of a graph is calculated from any given ordering of the nodes. In this example, the virtual pull-down experiments are conducted such that 1st:c, 2nd:f, 3rd:h, 4th:a, and so on are used as bait. The ordering of the nodes is marked on the diagonal of Mord. The upper right triangle of Mord denotes at what time-step an interaction is seen (for the first time) while the lower left triangle of the matrix records at what time-step the interaction is confirmed (caught as prey the second time). (c) A performance plot is generated by plotting the number of edges seen and confirmed by time-step x. The more edges that are seen and confirmed earlier, the lower the sum of the weight of all entries in Mord becomes. Hence, this graph-weight (the sum over all entries in Mord) can be used as an indicator for the overall efficiency of an ordering in revealing the topology of the network.
Full Figure and legend (67K) |
|  | Here we exploit the fact that a subset of highly connected nodes covers a larger portion of the network than another subset of the same size consisting of nodes that are less connected. This allows our strategy to tackle a computationally hard problem effectively by computationally relatively simple means. We assume cost and time to be in a constant proportional relationship to the number of pull-down experiments performed, which is equivalent to the number of proteins used as bait. We are well aware that this simplification leaves out some experimental detail, but it does lead to a concise model of the overall process of information gain in proteomics. The time-course efficiency of covering the interactome depends only on the order in which the baits are used. The performance of different strategies in seeing and confirming interactions was measured on four different data sets (see Methods for a more detailed description of the strategies and data sets used). There are two ways to define edge coverage: (i) an interaction has been seen or detected once by using one of its interactors as bait, or (ii) an interaction has been confirmed or detected in two directions after using both interactors as bait. This last point directly addresses the presence of a significant number of false-positive and false-negative interactions in current experimental datasets, so that an interaction is usually not regarded as confirmed unless it has been detected at least twice in independent experiments.
We implemented a number of strategies to achieve edge coverage. A baseline for the performance of all other strategies is defined by a random ordering of baits. For a random strategy one needs no prior information about the proteins, as baits are selected at random from the set of proteins that have not yet been used as bait. An upper limit on the performance of any strategy was determined by greedy strategies, which assume complete information about the network topology and choose, at each time step t, the bait that will yield the highest number of novel edges either seen or confirmed. When optimizing for detecting interactions, the greedy strategy already identifies all interactions with as few as 30% of nodes used as bait and is significantly better than random in confirming interactions (Fig. 2 and Supplementary Fig. 3 online). Following the insight that the highly connected nodes (hubs) are the most useful baits, the prior information requirements can be reduced to information only about the degree of each node (that is, the total number of interactors of each protein, regardless of who its partners are). The degree-guided strategy follows a predetermined ordering of the baits where the highly interacting proteins (hubs) are analyzed first. This method is slightly inferior to greedy-seen or greedy-confirmed methods in seeing or confirming interactions, respectively. Still, the degree-guided strategy yields an efficient approximation to the minimum covering set. In terms of time-course efficiency, it represents a good compromise in the apparent trade-off between seeing and confirming interactions, as it maximizes the sum of the integrals of both performance curves (seen + confirmed).
 | | Figure 2. Network coverage achieved by different strategies. |  |  |  | (a) The performance curves measured in the DIP data set of the following different strategies are plotted: greedy seen, greedy confirmed, degree guided, pay-as-you-go and random. The average of ten independent simulations is shown for every strategy, together with the maximum and minimum deviation indicated by the error bars. For all five strategies their performance in seeing (red curves) and confirming (blue curves) interactions is shown (see Supplementary Fig. 3 online for the corresponding figures for the data sets CZ, CORE and MDS). An upper limit is formed by the greedy-seen strategy in seeing interactions, which uses the information on the global network topology. The degree-guided strategy roughly realizes the so-called '80/20' rule by covering about 80% of the interaction network using less than 20% of the proteins as bait. The pay-as-you-go strategy consistently performs as well as the greedy-confirmed strategy in confirming and seeing interactions, even though the pay-as-you-go strategy has no prior information about the networks topology. It is noteworthy that the pay-as-you-go strategy slightly outperforms the degree-guided strategy in confirming interactions. The greedy-seen strategy performs less well in terms of confirmed edges. The dent in the performance curve (at about 40% of nodes used as bait) coincides with the point when all edges have been seen by this strategy and there are only interactions left to be confirmed (see Methods for details). In all data sets, the observed performance of the random strategy in confirming interactions roughly follows a quadratic increase (y = x2), where x is the fraction of the proteome used as bait and y the fraction of interactions confirmed up to this point. Using the performance of the random strategy as a baseline, the relative performance advantage of the pay-as-you-go strategy over the random strategy is measured as the ratio of the integral of their respective performance curves. (b) Plot of the relative performance in seeing and confirming a certain fraction of the edges of the networks. The pay-as-you-go strategy reaches 50% coverage in confirming interactions at least twice and up to 4 times faster than random, whereas in the initial stages it is between 6 and 13 times faster. Up to 80% coverage is reached by the pay-as-you-go strategy about 1.8 times faster on the CZ data, 2.5−3 times faster on the DIP and CORE data, and 4 times faster on the MDS network. (c) Plotting the relative speed over the fraction of nodes used as bait demonstrates the huge initial coverage provided by our strategy. The initial coverage and performance advantage of the pay-as-you-go strategy is even more pronounced in confirming interactions. On all data sets, about ten times more interactions are confirmed by the pay-as-you-go strategy at 20% of proteins used as bait than by the random strategy.
Full Figure and legend (57K) |
|  | At this point, the conclusion is that achieving completeness in functional genomics seems to hinge on establishing a reliable ab initio estimate of the degree of all proteins. However, to date no such sufficiently reliable measure has been available3,
4,
5. Below, we present a useful strategy that is entirely different from the ones discussed above in that it requires no prior information whatsoever about the network topology. All decisions about bait selection are made on the go as the graph is gradually revealed by successive experiments. Our pay-as-you-go strategy is based on the following observation: given the scale-free degree distribution, it is unlikely that an arbitrary node will be detected as prey several times, unless it is highly connected. Here, we choose as the bait for the next experiment the protein that has been seen as prey most often so far. The scale-free and small-world properties of protein interaction networks ensure a short path between any two nodes and account for the quick convergence towards the hubs in the network, independent from the starting point. The pay-as-you-go strategy performs significantly better than random in seeing interactions, but is inferior to a degree-guided strategy (Fig. 2). Apart from efficiently covering interaction space without prior knowledge, the real strength of the method lies in its ability to generate confirmed interaction information. The performance in confirming interactions is almost identical to the greedy-confirmed and even slightly outperforms the degree-guided strategy (Fig. 2).
The pay-as-you-go strategy uses local information to select highly connected nodes as bait from the right-hand side of the degree distribution. Even on a random network, the pay-as-you-go strategy works more efficiently than a random ordering of the baits in confirming interactions (data not shown), as it tends to use above-average connected nodes first. A scale-free distribution boosts the performance of the method because, owing to its broad tail, the right-hand side of the distribution is much more pronounced than in a random distribution.
The method is resilient to the removal of hubs from the data (that is, truncation of the tail). It has been argued, at least for the yeast two-hybrid data, that a large fraction of the interactions adjacent to the hubs are due to systematic errors such as 'sticky' proteins or auto-activation and hence do not represent physiologically relevant data6,
7. For example JSN1 (YJR091C) has a total of 282 interactions reported in DIP, whereas the more reliable CORE subset contains only 6 of these interactions. The results labeled CORE (Supplementary Fig. 3 online) were obtained by simulations on the full DIP data set, but only counting the coverage of edges within the more reliable CORE subset. The pay-as-you-go strategy yields performance rates on this subset similar to that of the overall DIP network. We conclude that the pay-as-you-go strategy does not have a tendency to pick out less reliable information from the DIP network.
The strategy is also robust towards errors of different experimental techniques. The relative performance advantage over a random strategy stays level over wide areas of noise (Fig. 3). This opens the possibility of repeating experiments using the highly interacting proteins identified in the initial exploratory stages of the pay-as-you go strategy in order to further enhance the reliability of the resulting data. The performance of the pay-as-you-go strategy seems to improve with the density of the network used. One reason for the weaker performance of the pay-as-you-go strategy on the CZ data set may be that this data set is, albeit quite reliable, less dense and falls into several disjoint components. This could be addressed by starting parallel efforts in key pathways simultaneously. At the same time, the current data sets probably represent an underestimate of the density of the true interactome8. In practice, not a single but a whole batch of experiments is set up at a time. Hence, we analyzed running pay-as-you-go simulations in batch mode. In batch mode, a list of up to 200 nodes were used as baits in virtual pull-down experiments before the ordering of prospective baits was updated and the next batch of baits was selected. The overall performance of the pay-as-you-go strategy remained at the same level (see Supplementary Fig. 2 online).
 | |  | Like whole-genome shotgun sequencing9 or structural genomics10, the pay-as-you-go strategy provides a conceptual framework for tackling the whole interactome. The maximum impact of the pay-as-you-go strategy is realized in the initial explorative stage of covering the interactome. It is crucial in public large-scale proteomics efforts to coordinate the community effort from the start. We propose that a publicly funded central body (the HUman Proteome Organization, HUPO; http://www.hupo.org/) distribute the workload effectively to a number of associated research labs. A list of the most promising targets with the highest projected return of interaction information can be computed using the methods presented here and the data collected in public repositories, such as BIND11, DIP12 or IntAct13. Such a prioritized and continuously updated list is available from the IntAct website (see Supplementary Materials online). The pay-as-you-go strategy could detect up to 90% of the human interactome with less than a third of the proteome used as bait (Fig. 2). Thus, about 10,000 TAP-MS experiments should suffice to provide a scaffold for functional prediction for the rest of the human proteome. At the same time, our strategy would yield over 50% of the interactome as confirmed interactions. A random, uncoordinated strategy would require 2−4 times the experimental effort to yield the same coverage in terms of interactions seen and confirmed. As a consequence, a ten-year effort could be finished in three years with the same throughput of pull-down experiments performed per unit time.
To our knowledge, the algorithmic strategy proposed here is the first one to actively exploit the scale-free distribution in order to tackle a biologically relevant problem. Other methods use the small-world property14 or the topology15 of interaction networks for subsequent validation of experimental interaction information. Novel methods for accurate prediction of protein function16 and structure17 based on interaction networks are available, highlighting the need to obtain reliable interaction information for the whole proteome in an efficient manner. At the same time, the pay-as-you-go strategy could be applied to any network with a scale-free distribution. Therefore, the implications go beyond proteomics and functional genomics. For example, protein structures are found to form scale-free networks18, which suggests that such an algorithm could help to improve or speed up calculations for the prediction and comparison of protein structures.
Methods Virtual pull-downs. To model the information gain from an individual experiment (tandem affinity or yeast two-hybrid), we introduce the procedure of a virtualPullDown(p, t). First, protein p is marked as bait in time point t. Subsequently, all direct neighbors of p (in an experimentally determined interaction graph) are marked as prey if they have not yet been seen as prey previously from another bait. Similarly, all the edges adjacent to p are marked as "seen at time-point t" if they have not been seen before and "confirmed at time-point t" otherwise (Fig. 1). The latter corresponds to what has been termed 'reverse tagging' in TAP/MS experiments. As every undirected edge links exactly two nodes, we can model the complete process of covering an interactome with these additional attributes (that is, seen and confirmed).
We use the time-weight as an objective function to distinguish faster from less efficient strategies. Given an ordering of the all the n proteins (nodes) from 1 to n, we can run successive virtual pull-down experiments of every protein at time-point t and can measure how quickly a given fraction of the edges has been detected or confirmed. The time-weight will be lower when more edges are covered earlier in the process (Fig. 1). Hence, this indicates how quickly the interactions of the network are discovered by a given strategy. Now we are able to compare different orderings in terms of the time-weight of a graph by taking the sum of all time points across all edges. This model allows us to answer different questionsfor example:
Given that we have funding (or time) for x number of experiments, how large is the fraction of the interaction network that we can expect to detect?
How many experiments are needed to confirm a certain fraction (e.g., 50%) of the interactome?
Procedure of a virtualPullDown(p, t). The procedure virtualPullDown models the information gain from a pull-down experiment with protein p as bait in time-point t. All nodes in the graph representing the protein-interaction network have the two additional attributes 'bait' and 'prey' containing the time-points when this protein was used as bait and when it was first detected as prey. All edges (a b) of this graph have the additional attributes 'seen' and 'confirmed' to denote when this interaction has been detected (first hit) and confirmed (second hit). All these attributes are initially set to zero.
VirtualPullDown(Protein p, TimePoint t)
- p.bait = t
- N = getNeighbours (p, fp, fn)
- FOR ALL (n
N) - IF (n.prey = 0) n.prey = t
- IF ((p
n).seen = 0) (p n).seen = t - ELSE (p
n).conf = t - NEXT n
getNeighbours(Protein p, FalsePositiveRate fp, FalseNegativeRate fn)
- N = {n: (n
p) E} - FOR ALL (n
N) - IF (nextRandomNumber() < fp)
- n = selectRandomNode()
- IF (nextRandomNumber() < fn)
- n.delete()
- NEXT n
- return N
The function getNeighbours() delivers the set of all nodes adjacent to the bait protein p. Any rate of false-positive (FP) or false-negative (FN) data can is accounted for by the getNeighbours() function. First, FP% of the neighboring set is replaced by randomly selected nodes, then FN% of the nodes in the set are deleted at random. The result containing the generated errors is delivered back to the virtualPullDown() function, which then operates on this erroneous set of neighboring proteins.
Implementation of the strategies. Although the procedure of a virtual PullDown is identical for all the strategies we analyzed, the strategies differ in the way in which they determine the bait protein from all the remaining set of nodes that have not yet been used as bait.
- Random. A protein is picked at random if it has not been used as a bait before.
- Greedy seen/confirmed. Given the overall topology of the network, for every node the number of adjacent edges not seen yet (k1) and the number of edges seen (k2) can be determined at any stage of coverage. By selecting the node with the highest k1, this strategy maximizes the number of edges seen, whereas selecting the node with the highest k2 maximizes the number of edges confirmed. In both cases, the process starts with the most highly connected node. In greedy seen the strategy runs out of edges to detect at about 30−40% of nodes used as bait and then resorts to optimize k2.
- Degree guided. All proteins are sorted according to their number of interactions in descending order. Subsequently, the nodes are used as bait in the resulting order.
- Pay-as-you-go. The next bait is determined by dynamically estimating the degree based on what is known about the network from experiments carried out so far. Details follow.
'Pay-as-you-go' strategy. The graph-theoretic principles of the pay-as-you-go strategy are based on the observation that during the entire screening process we can distinguish the following three subsets of proteins (nodes):
- P: the proteins that have been used as bait. For this subset we know the identity of their prey, and hence their degree, from experiments.
- Q: the proteins that have been detected in experiments as prey. Although the exact number of interactions of these proteins is unknown, we know how often each protein has occurred as prey so far.
- R: the rest of the proteome, about which we do not know anything yet.
In the beginning, there is no information about the network available. Hence P and Q are empty and all proteins are contained in R. The procedure is started by a few randomly selected baits (or baits also selected on the basis of external information) to seed the network. As more and more interactions are revealed in successive experiments, P and Q contain ever-larger fractions of the network. The algorithm finishes when both Q and R have been exhausted after every protein has been used as bait and P covers the whole network.
The bait-prey relationships resulting from the experiments can be modeled as directed edges leading from P into Q. So the information derived so far is stored as a directed graph. Although there are no edges leading into R, for all the prey proteins contained in Q we can determine the number of times they have been detected as prey from previously used bait proteins in P.
The next bait is determined by selecting the node with the maximum indegree (number of times this protein was detected as prey) and the maximum k (see below). Every time a pull-down is performed we get to know the number of neighbors k (degree) of the bait protein. This degree is 'distributed' over all its prey proteins by adding 1/k to the respective k of every prey, while the indegree of every prey protein is increased by 1. Because the number of times a protein has been seen as prey (indegree) is an integer number, at any given stage there can be a number of proteins in Q with the same maximum indegree. To break the tie among all the baits in Q with the same maximum indegree, the one with the highest value of k, which is initially set to 0 for all nodes, is chosen. If no prey protein with an above-average indegree is found, the strategy resorts to choosing the next bait at random. This condition helps to seed the whole process in the initial stage and holds toward reaching full coverage in the final phase.
The rationale behind this comes from the observation that hubs are less likely to be linked to other hubs. This is an additional local measure we developed so that prey proteins that interact with less connected proteins get higher k values than prey proteins that are connected to hubs. So k is an empirical measure that guides the strategy towards choosing baits that are linked to proteins not highly connected, and hence these baits are more likely to be hubs. It is noteworthy that k as a single indicator of degree without the indegree also performs better than random. However, the combination of indegree with k as described above is the most efficient local strategy we found so far.
Data sets. The most comprehensive data on protein-protein interactions are available for Saccharomyces cerevisiae (baker's yeast), and for simulation purposes, we focus in this analysis on this model organism only. In detail, we analyzed four data sets:
- DIP12 (Database of Interacting Proteins) is a resource that has gathered information by manual annotation and also includes the high-throughput yeast two-hybrid data sets19,
20. Excluding self-interactions, this set contains 14,844 interactions for 4,711 proteins in yeast.
- CORE21 is a subset of DIP which contains validated interactions. Hence the interactions contained in this set are more reliable. Consequently, this data set in itself is smaller and less dense; it contains only 4,357 interactions on 2,129 yeast proteins.
- CZ22 is a data set of 2,743 interactions determined for 1,297 proteins using tandem affinity purification and subsequent characterization by mass spectrometry. This data set, mainly targeted at the yeast orthologs of human genes, is the least dense and most fractionated of the networks analyzed.
- MDS23 is a data set that was created following a similar approach but with bait proteins overexpressed. This may explain why it is denser than the CZ network; it contains 8,040 interactions for 1,695 proteins.
For the simulations on the CORE data set, the actual simulation was done within the DIP network. The number of edges seen and confirmed was then determined only within the CORE subset of validated interactions. This analysis addresses the concern that by targeting highly connected proteins first, the resulting information would be less reliable, which is not the case (see Supplementary Fig. 3 online).
For the sake of simplicity, we have assumed reflexivity by using an undirected graph for modeling interaction space. This means that an interaction is as likely to be detected from both interacting proteins in a virtualPullDown. This has been shown to be not necessarily true both for both TAP/MS and Y2H data6,
7. Our analysis methods could adjust for this by using a directed, weighted graph instead, where the weights on the edges represent the probabilities of detecting a protein along this relationship.
Note: Supplementary information is available on the Nature Biotechnology website.
Received 9 June 2003; Accepted 10 October 2003; Published online: 7 December 2003.
REFERENCES
- Karp, R.M. Reducibility among combinatorial problems. in Complexity of Computer Computations (eds. Miller, R.E. & Thatcher, J.W.) 85103 (Plenum Press, New York, 1972).
- Cormen, T.H., Leiserson, C.E. & Rivest, R.L. NP-Completeness. in Introduction to Algorithms 916946 (MIT Press, Cambridge, Massachusetts, USA, 1999).
- Rung, J., Schlitt, T., Brazma, A., Freivalds, K. & Vilo, J. Building and analysing genome-wide gene disruption networks. Bioinformatics 18 Suppl 2, S202210 (2002).
- Mrowka, R., Liebermeister, W. & Holste, D. Does mapping reveal correlation between gene expression and protein-protein interaction? Nat. Genet. 33, 1516; author reply 1617 (2003).
- Ge, H., Liu, Z., Church, G.M. & Vidal, M. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet. 29, 482486 (2001).
- Maslov, S. & Sneppen, K. Protein interaction networks beyond artifacts. FEBS Lett. 530, 255256 (2002).
- Aloy, P. & Russell, R.B. Potential artefacts in protein-interaction networks. FEBS Lett. 530, 253254 (2002).
- Sali, A., Glaeser, R., Earnest, T. & Baumeister, W. From words to literature in structural proteomics. Nature 422, 216225 (2003).
- Weber, J.L. & Myers, E.W. Human whole-genome shotgun sequencing. Genome Res. 7, 401409 (1997).
- Vitkup, D., Melamud, E., Moult, J. & Sander, C. Completeness in structural genomics. Nat. Struct. Biol. 8, 559566 (2001).
- Bader, G.D. & Hogue, C.W. BINDa data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics 16, 465477 (2000).
- Xenarios, I. et al. DIP: the database of interacting proteins. Nucleic Acids Res. 28, 289291 (2000).
- Orchard, S. et al. Progress in establishing common standards for exchanging proteomics data: the second meeting of the HUPO Proteomics Standards Initiative. Comparative and Functional Genomics 4, 203206 (2003).
- Goldberg, D.S. & Roth, F.P. Assessing experimentally derived interactions in a small world. Proc. Natl. Acad. Sci. USA 100, 43724376 (2003).
- Saito, R., Suzuki, H. & Hayashizaki, Y. Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics 19, 756763 (2003).
- Vasquez, A., Flammini, A., Maritan, A., Vespignani, A. Global protein function prediction from protein-protein interaction networks. Nature Biotechnology 21, 697700 (2003).
- Lappe, M., Park, J., Niggemann, O. & Holm, L. Generating protein interaction maps from incomplete data: application to fold assignment. Bioinformatics 17 (Suppl 1), S149156 (2001).
- Vendruscolo, M., Dokholyan, N.V., Paci, E. & Karplus, M. Small-world view of the amino acids that play a key role in protein folding. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 65:061910, published online 25 June 2002. DOI: 10.1103/PhysRevE.65.061910
- Ito, T. et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 98, 45694574 (2001).
- Uetz, P. et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623627 (2000).
- Deane, C.M., Salwinski, L., Xenarios, I. & Eisenberg, D. Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol. Cell Proteomics 1, 349356 (2002).
- Gavin, A.C. et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141147 (2002).
- Ho, Y. et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180183 (2002).
Acknowledgments We thank P. Akan, R. Apweiler, M. Ashburner, D. Bolser, D. Bray, G. Cesareni, A. Griffiths, A. Heger, H. Hermjacob, F. Hollfelder, V. Kunin, M. Louis, L. Montecchi-Palazzi, C. Ouzounis, K. Paszkiewicz, T. Schlitt, J. Schulz, E. Ukkonen, M. Vendruscolo, M. Vingron, C. Webber, N. Wyatt, I. Xenarios, Cellzome AG and the IntAct team at the European Bioinformatics Institute for providing resources, valuable feedback and insightful discussions. M.L. is supported by the EMBL International PhD program and Biotechnology and Biological Sciences Research Council (BBSRC) grant 8/C19399.
Competing interests statement:
The authors declare that they have no competing financial interests. |