MTGO: PPI Network Analysis Via Topological and Functional Module Identification

Protein-protein interaction (PPI) networks are viable tools to understand cell functions, disease machinery, and drug design/repositioning. Interpreting a PPI, however, it is a particularly challenging task because of network complexity. Several algorithms have been proposed for an automatic PPI interpretation, at first by solely considering the network topology, and later by integrating Gene Ontology (GO) terms as node similarity attributes. Here we present MTGO - Module detection via Topological information and GO knowledge, a novel functional module identification approach. MTGO let emerge the bimolecular machinery underpinning PPI networks by leveraging on both biological knowledge and topological properties. In particular, it directly exploits GO terms during the module assembling process, and labels each module with its best fit GO term, easing its functional interpretation. MTGO shows largely better results than other state of the art algorithms (including recent GO-based ones) when searching for small or sparse functional modules, while providing comparable or better results all other cases. MTGO correctly identifies molecular complexes and literature-consistent processes in an experimentally derived PPI network of Myocardial infarction. A software version of MTGO is available freely for non-commercial purposes at https://gitlab.com/d1vella/MTGO.

1 Details of the MTGO algorithm 1

.1 MTGO local metrics
The MTGO algorithm is based on two main functions: Modularity Variantion M V , as regards topological aspects; and Selection γ, as regards biological aspects. These two functions are applied locally on each single module during the building process. In contrast, their global counterparts Modularity (topological) and QGO (biological) are applied to the whole network (see Materials and Methods in the main text). γ is used to assign to a single topological module c h a pair t B = (l B , δ B ), where l B is the GO term best describing the biological meaning of c h , and δ B is (i) the list of network nodes associated to l B , and (ii) the functional module linked to c h . As consequence, nodes are added/removed to let the topological module fit its assigned functional module. The main constraint of module modification is represented by the M V function. M V allows the topological nature of the module to be preserved. In fact, it is possible to add a node only if, by adding it, the module topology is ameliorated.

Selection function
We define the γ function as where c k h is the h-th topological module of the iteration k; δ k p,h is the p-th element of ∆ k h ; and ∆ k h is a subset of the set ∆ for the nodes of c k h . The aim of Selection γ function is to choose a GO term (represented by δ k B,h ) as model to drive the building process of a topological module.
The GO term assigned to a topological module should assure a good fitting, thus proving both a high overlapping and high specificity. In general, if a GO term has high degree of overlapping, it is very likely that the GO term is little specific for the topological module and vice-versa. In fact, the choice of a GO term involving all the topological module nodes could lead to select a low specificity GO term. A low specificity GO term contains many nodes not belonging to the topological module (i.e. included in other topological modules in the network). This problem is depicted in Supplementary Figure 1. Selection γ is designed, therefore, to find a trade off between overlapping and specificity. Supplementary Figure 2 shows the behavior of the Selection function.

MV function
We define the M V function as Supplementary Figure 1: Two cases of fitting of δs to a topological module c k h . In the first case, δ k 1,h has a good overlap with the topological module, involving almost all the nodes (green nodes), but it tags many other nodes (yellow nodes) outside of the topological module as well. Thus, δ k 1,h is not very specific to this topological module. In the second case, δ k 2,h is very specific for the topological module, in effect almost all δ k 2,h nodes are included in it, but it has a low overlap (green nodes).
Where q (S4) represents the contribute of a single topological module to global function Modularity Q (S5); c k h + v i indicates the topological module including v i , while c k h − v i indicates the topological module without v i . Defining q as the modularity contribute of c k h , M V calculates the variation of q due to adding the node v i .
C k is the k-th partition and H k is its total topological module number; e k h is the total number of edges in the h-th topological module; d k h is the sum of the node degrees of the h-th topological module. Modularity Q is the sum of q(c k h )s over the c k h s of a partition C k .

Details of the Iteration phase
MTGO follows an iterative process. At the k − th iteration, a pair (C k , Φ k ) is computed: C k by re-assigning the nodes of the previous partition C k−1 , and Φ k Supplementary Figure 2: The Selection function operates a trade-off to associate a δ p to a topological module, in particular mediating the δ p nodes internal to the topological module vs. the external ones (Supplementary Figure 1). This graphic shows the behavior of the Selection function when the size of the topological module is 10 nodes, and the size of the δ p s ranges from 1 to 100. Each curve shows the function value for different values of intersection between topological module and δ p , i.e. how many nodes are internal to module.
The intersection values are showed in the right. The Selection value decreases according to both the δ p size shrinking, and to the intersection value increasing.
by selecting elements from T that best describe C k . Each partition is made of topological modules c k h with h representing the index of the single topological module and 1 ≤ h ≤ H k ; (The total number of functional modules H k varies at each iteration.) Ideally, MTGO aims to assign nodes such that topological modules coincide with functional modules. In detail, the iteration phase is performed with two main sub-processes.
Topological modules are randomly processed at each iteration k. Each topological module with size < minSize (see main test Section 1.5 for a parameter description) is discarded and its nodes are added to the Temporary Node List (TNL). The rationale behind TNL is to use it as a temporary repository for discarded nodes. Discarding small topological modules is the way of MTGO to decrease H, i.e. to decrease the number of modules between two consecutive iterations. For each remaining topological module c k h , the following four steps are executed: and V c , as follows: V a are the nodes shared by δ k B,h and c k h ; V B are the nodes belonging to c k h but not to δ k B,h ; V c are the nodes belonging to δ k B,h but not to c k h . Note that V c nodes belong to other topological modules of the partition. At the end of Step 1, each c k h has its associated δ k B,h , generating V a , V B and V c .
1.3 Set V B is moved to the TNL. V B nodes are excluded from the δ k B,h , and therefore they are not significantly related to the biological meaning assigned to c k h .
1.4 A single node v i ∈ V c belongs to either another topological module of the current partition C k , or the TNL (e.g. it has been assigned to the TNL by processing a previous c k m and δ k B,m pair). v i is moved to the topological module c k h either if (i) it belongs to the TNL; or where c k m is the other topological module of C k containing it. In other words, v i is added to c k h if the topological quality is increased by adding it. Figure 5 in the main text graphically describes this process.
When these four steps are executed for all the topological modules of C k , MTGO performs another size check. Topological modules with size < minSize are discarded, and their nodes added to the TNL. At this point, all nodes have been assigned either to the topological modules, or to the TNL. Finally, we need to empty the TNL.
In this step the nodes of the TNL are re-assigned to the topological modules: 2.1 All the TNL nodes with at least an associated δ p , N GO set, are used to create a new topological module c T LN (Figure 6 in the main text).
2.2 Each node v i without any associated δ p is assigned to the existing topological module maximizing its M V (c k h , v i ) (S3) ( Figure 6 in the main text).

Parameters
The parameter minSize ranges from 2 to 15 and is used to limit the minimum number of nodes in modules. The parameter maxSize ranges from 30 to 300 and is used to limit the maximum number of nodes in modules. Tuning these parameters is useful to adjust the final output in accordance to the needs of a specific study. For example, if the user is interested in identifying a specific process or small protein complexes, minSize and maxSize should be set to small values (e.g. 2 and 5). In order to identify the general/high level biological processes involved in the network, on the other hand, the user should set minSize and maxSize to high values (e.g. 10 and 200). Supplementary Figure 4 shows an example of how different results can be obtained from the same network by changing the parameters.

Density optimization result
MTGO provides also an alternative output R D based on density optimization, to be used in case of poorly GO-enriched networks (see MTGO User Manual). For example, when the GO term list is very poor and the node percentage covered by the GO terms are under 50%. The solution R D = (C D , Φ D ) is taken as the one with maximum value of mean density over topological modules. For this solution, the set Φ D is scarce in term of biological quality, while the set C D preserves its quality in term of topological properties.

Evaluation Formulas
We evaluated MTGO and other algorithms according to Recall, Accuracy [1] and Maximum Matching Ratio (MMR). Recall is the fraction of true complexes matched by at least one predicted complex over the affinity score (AS) [2], defined as where TC stands for Target Complex (module), and PC Predicted Complex (module). For this work, we set the AS threshold to 0.5. Recall is calculated as where N APC is the set of P C with AS > 0.5. Accuracy is defined as Where Sensitivity and Positive Predictive Value (PPV) are: where i indicates the i-th Target Complex (TC) and j the j-th Predicted Complex (PC) [2].
MMR has been introduced as a specific measure for module identification algorithm [3]. It is based on the maximal one-to-one mapping between P C and T C. It was proposed to overcome an Accuracy-related issue in the specific case of module identification algorithms, i.e. the misleading role of Positive Predictive Value if some proteins in a T C are present in either more than one P C or in none.

Q and QGO trends
Supplementary Figure 5 shows the trend of the functions Q and QGO. These two functions evaluate the global topological and functional properties of each partition. Each red line in Supplementary Figure 5 depicts the value of Q computed with the fast greedy modularity optimization algorithm [4]. The red line can be considered the reference value of the maximum reachable modularity for each network. In the initial iterations the modularity shows a fast increment and it almost reaches its maximum value, while QGO shows a slower, steady increment. After reaching its peak, the modularity decreases, allowing a re-arrangement of the partition in order to improve the GO quality of the topological modules (as supported by the slowly increment of global QGO). However, reaching the last iteration, when QGO reaches its maximum value, Q remains positive (i.e. GO quality increases at the expenses of the best modularity, but without cripple the topological properties of the partition). MTGO provides also an alternative model R D based on density optimization, to be used in case of poorly GO-enriched networks (see MTGO User Manual and Section 1.6 in the main text).

Case Study Figures
Response to metal ion Supplementary Figure 6: Detail of the network represented in Figure 11 in the main text. Small nodes represent proteins excluded from the GO term assigned to the topological module. Big nodes (both circles and rectangles) indicate proteins associate with the same GO term assigned to topological module, while rectangles indicate nodes assigned to more than one functional modules.   Table 1 shows the time needed to build networks with different number of nodes and edges on a Intel − Core TM i3-4170 (CPU 3.7GHz) PC with 4GB of RAM. The main aim of MTGO is to evaluate a specific biological context (represented by a PPI Network model) by identifying the main biological functions (represented by GO terms) acting at the base of the system. For this reason, it has been designed to analyze small/middle size PPI networks, which represent a specific tissue or biological condition, as the Myocardial Infarction Network. MTGO scales well with the increase of edges, but not with the increase of the number of nodes. MTGO is able to provide very good results also in big networks, as we showed in the Results section, but with a consistent increase of computation time. In case, the user can take the partial results provided by MTGO each 30 iterations, without waiting for the algorithm to plateau (see User Manual on link https://gitlab.com/d1vella/MTGO).
Computational complexity has been evaluated taking into to the number of nodes (n), the number of iterations (k), the number of nodes in a module (g) (for simplicity, we suppose all the modules contain the same number of nodes), and the number of GO terms (j). Simplifying, the method can be seen as the iteration of two main steps k[Step1 + Step2] (these phases are described in Method Section). The two main functions of the algorithm have complexity M odularityV ariation → O(ng) (S3) and Selection → O(j) (S1). In Step1 the two main processes involve firstly (i) the computation of the Selection for each module (the number of modules is n g ) and secondly (ii) the computation of M odularityV ariation for each module. Therefore, the time complexity of the Step1 is O(n 2 ). The Step2 consists in the computation of the M odularityV ariation for each node in the Temporary Node List (TNL) (in the worst case the TNL contains all the network nodes) and for each module, therefore the complexity of the Step2 is O(n 3 ). This implies that each of the k iterations has time complexity O(n 3 ).