Extreme multifunctional proteins identified from a human protein interaction network

Moonlighting proteins are a subclass of multifunctional proteins whose functions are unrelated. Although they may play important roles in cells, there has been no large-scale method to identify them, nor any effort to characterize them as a group. Here, we propose the first method for the identification of ‘extreme multifunctional' proteins from an interactome as a first step to characterize moonlighting proteins. By combining network topological information with protein annotations, we identify 430 extreme multifunctional proteins (3% of the human interactome). We show that the candidates form a distinct sub-group of proteins, characterized by specific features, which form a signature of extreme multifunctionality. Overall, extreme multifunctional proteins are enriched in linear motifs and less intrinsically disordered than network hubs. We also provide MoonDB, a database containing information on all the candidates identified in the analysis and a set of manually curated human moonlighting proteins.

Each of the figures below shows our candidates compared to all nodes (Network), network hubs (nodes whose degree is at least twice the network average), nodes that belong to multiple clusters but are not candidates (Multi NC), all nodes that belong to multiple clusters (Multi), nodes that belong to one cluster alone (Mono) and all non-candidate nodes (NC). The width of the boxes is proportional to the number of nodes in each group. The numbers in parentheses under each group are the Wilcoxon p-values. Red dots indicate the mean values and yellow dots the values of known moonlighting proteins. Outliers are not shown. Figure 6: Stretches of consecutive disordered residues of different lengths

Supplementary
The number of consecutive stretches of disordered aminoacids as predicted by disopred was measured using sliding windows of different lengths. The results are shown in the figure below. Note that the candidates consistently have fewer stretches of any given length than hubs.
Supplementary Figure 6: Stretches of consecutive disordered residues as predicted by disopred. Figure b is figure a with only the three groups, for clarity

Protein disorder as predicted by different software
Each of the figures below shows protein disorder as calculated by different predictors for our candidates compared to all nodes (Network), network hubs (nodes whose degree is at least twice the network average), nodes that belong to multiple clusters but are not candidates (Multi NC), all nodes that belong to multiple clusters (Multi), nodes that belong to one cluster alone (Mono) and all non-candidate nodes ( The coordinates of all known Eukaryotic Linear Motifs (ELMs) in our network's proteins were downloaded from the ELM database (http://elm.eu.org, Van Roey et al., 2013). The number of ELMs that were completely within disordered regions was calculated for each protein and divided by the total number of disordered residues in that protein (to correct for differences in protein/disordered region length). The results are shown below.
Supplementary Figure 17: The number of ELMs per disordered residue for each group.

Results on unbiased network
As a control against bias introduced by extensively studied proteins, the analyses on the human interactome were repeated on a smaller human PPI network (15617 interactions between 4494 proteins), built exclusively from large-scale yeast two-hybrid data (CCSB network;Yu, 2011). We include the results of these analyses here.
Each of the figures below shows our candidates compared to all nodes (Network), network hubs (nodes whose degree is at least twice the network average), nodes that belong to multiple clusters but are not candidates (Multi NC), all nodes that belong to multiple clusters (Multi), nodes that belong to one cluster alone (Mono) and all non-candidate nodes (

Supplementary Note
On the choice of similarity measure In the work presented here, we have chosen not to use semantic similarity measures (SSMs) to assess the similarity of GO terms. Instead, we have developed two new metrics based on GO term co-occurrence as explained in the Methods section. Given the popularity and wide spread use of SSMs, an explanation of this choice is warranted.
We chose not to use semantic similarity measures first because they are particularly ill-suited for the detection of moonlighting or extreme multifunctional proteins. Semantic similarity measures rely on the structure of the Gene Ontology (GO) Direct Acyclic Graph (DAG). They consider two terms to be similar either by measuring their distance as the number of edges connecting them, or by evaluating their information content. These methods can therefore identify semantically similar GO terms, cases where the terms are linked in the structure of the DAG. A trivial example would be "cell death" and "cytolysis" since the latter is a child term of the former. However, the terms "response to tumor necrosis factor" (GO:0034612) and "positive regulation of apoptotic process" (GO:0043065) share no parent terms apart from the root of the ontology and are therefore semantically different. Yet, from a biological perspective they are clearly connected since TNF is a well known inducer of apoptosis. The terms can, therefore, be considered functionally similar since they are different descriptions of the same or tightly linked biological processes. Because our metrics are based on term co-ocurrence, they are independent of the structure of the DAG and can therefore identify such non-semantic, yet functional similarity.
Whether SSMs are the right tool for the job depends entirely on the biological question being asked. SSMs are the tool of choice when, for example, reducing a large list of GO term annotations to a more concise one. On the other hand, as evidenced by their name, semantic similarity measures are geared towards identifying similar terms. In the work presented here, we needed a method for identifying dissimilar pairs. SSMs are not suited for this task since the absence of similarity is not the same as the presence of dissimilarity. In other words, while two terms may not be semantically similar, that does not necessarily make them dissimilar.
To illustrate this particular point, which is crucial for the rationale of our approach, we picked 1000 random Biological Process GO terms pairs (since the BP ontology is used in our work) computed their PrOnto probabilities on the one hand and their SSMs values on the other, and plotted their values (see Figure 1, next page).
In the plots below, each point represents a randomly chosen GO term pair, the y-axis shows one minus the probability returned by PrOnto and the y-axis the semantic similarity value. Because we are plotting 1-PrOnto, high values mean high similarity. Two points are immediately obvious: i) there is no significant correlation between the two measures and ii) all the SSM methods consider only a small minority of terms to be similar whereas the vast majority clusters at 0-0.1. Therefore, using SSM approaches to classify dissimilar functions, as opposed to similar functions, which is what they have been developed for, would result in most term pairs being considered dissimilar.
In addition, we wanted a way to asses term similarity that was i) based on real biological data as opposed to the mathematical abstraction of the DAG and ii) species-specific. Since SSMs are based on the structure of the ontology, they do not take into account the different frequencies of GO terms in different species.
Finally, we also wanted a method that took into account the frequency of interaction between proteins annotated to each of a given pair of terms. Since one measure of the similarity of two biological processes is how much cross talk there is between them, the PrOnto interaction probabilities give yet another dimension that the SSMs cannot offer.