Assessment of community efforts to advance network-based prediction of protein–protein interactions

Comprehensive understanding of the human protein-protein interaction (PPI) network, aka the human interactome, can provide important insights into the molecular mechanisms of complex biological processes and diseases. Despite the remarkable experimental efforts undertaken to date to determine the structure of the human interactome, many PPIs remain unmapped. Computational approaches, especially network-based methods, can facilitate the identification of previously uncharacterized PPIs. Many such methods have been proposed. Yet, a systematic evaluation of existing network-based methods in predicting PPIs is still lacking. Here, we report community efforts initiated by the International Network Medicine Consortium to benchmark the ability of 26 representative network-based methods to predict PPIs across six different interactomes of four different organisms: A. thaliana, C. elegans, S. cerevisiae, and H. sapiens. Through extensive computational and experimental validations, we found that advanced similarity-based methods, which leverage the underlying network characteristics of PPIs, show superior performance over other general link prediction methods in the interactomes we considered.


A. Similarity-based methods
In similarity-based methods, each non-observed node pair (i, j) is assigned a similarity score s ij . A higher score is assumed to represent a greater probability of link existence. The similarity score can be defined in many different ways and may rely on local information only, or some combination of local and global information Common Neighbors. The common neighbors algorithm quantifies the overlap or similarity of two nodes as follows [12]: where Γ(i) denotes the set of neighbors of node i, ∩ denotes the intersection of two sets and |X| denotes the cardinality or size of set X.
Adamic-Adar Index. This similarity index measures the similarity between two entities based on the connectivity of their shared neighbors [13]: Jaccard Index. The Jaccard index measures the overlap of two nodes normalized by their total number of neighbors [14]: where ∪ denotes the union of two sets.
Cosine Similarity (i.e. Salton Index). The cosine similarity is closely related to the Jaccard index, with the normalization as the product of the node degrees: Preferential Attachment Index. This index assumes that the existent likelihood of a link between two nodes is proportional to the product of their degrees [15]: Resource Allocation Index. This index is based on a resource allocation process between pairs of nodes [16,17]. The similarity between a node pair (i, j) is defined as the amount of resource j received from i through their common neighbors: Here, we assume each common neighbor has a unit of resource and will equally distribute among all its neighbors.
Hub Promoted Index. This index is used to measure the link formation between hub nodes and low-degree nodes [18]: Hub Depressed Index. This index is used to measure the link formation between hub nodes [18]: Sorensen Index. This index is very similar to Jaccard index but with more robustness against the outliers [19]: L3 Index. This index is based on the network paths of length three [20]: where a iu = 1 if node i and u interact, and zero otherwise.
Sim Index. This index combines the Jaccard index and the L3 index together [21]: where A is the adjacency matrix and J is the Jaccard similarity matrix.
Katz Index. The Katz index is based on a weighted sum over the collection of all paths connecting nodes i and j [18]: where β is a damping factor that gives the shorter paths more weights, and A is the adjacency matrix of the network. The N × N similarity matrix S = (s ij ) can be written in compact form as [22]: where I is the identity matrix. The damping factor β is a free parameter and should be less than the reciprocal of the absolute value of the largest eigenvalue |λ max | of A.
Structural Perturbation Method. This method assumes that a group of links is predictable if removing them has only a small effect on the eigenvalues of the network's adjacency matrix, which serve as proxies for the network's more general structural features [23].
This method proceeds by dividing an existing network A into a "training" subset that is used to calculate the eigenvalues and eigenvectors of the remaining network A R ij = N k=1 x ik x kj λ k , and a "validation" subset that will be used to characterize the predictability, ∆E, which has size L [23]. The ∆E is projected onto the eigenvectors, ∆λ k = l m x kl ∆E lm x mk l x kl x lk , where ∆λ k is the shift in eigenvalues that best approximates ∆E. The best-fit adjustments are added to the eigenvalues and multiplied by the eigenvectors to obtain a perturbed adjacency matrix:Ã ij = N k=1 (λ k + ∆λ k )x ik x kj . The L largest elements ofÃ ij are selected, such that A ij = 1, resulting in a set of edges called E L . The Structural Perturbation Index is then When applying the structural perturbation method to a given network, a byproduct is the so-called structural consistency index: which can be used to quantify the predictability of the network.
MPS. This method is based on three measures: MaxSimScore Topological , PAscore, and MaxSimScore Biological [8,21,24,25]. More formally, given two proteins u and v, their Jaccard Index J(u, v) is defined as: where Γ(u) is the set of u's neighbors in the PPI network. The Jaccard Index can be used to identify proteins with similar interfaces, and we can leverage this information to score potential protein interactions.
To better understand how Jaccard Index can be useful in scoring potentially interacting pairs, consider the following scenario: assume we have a protein pair (u, v) for which we want to assess likelihood of interaction. If u is similar to the neighbors of v (i.e., they largely share the same interaction partners) and v is similar to the neighbors of u, then it is more likely for u and v to share complementary binding sites. Thus, they are more likely to interact.
More formally, given proteins u and v, we define their interaction likelihood as follows: where Γ(u) is the set of neighbors of u. The proposed method is a modification of the sim method proposed by Chen et al [21], where the sums are replaced by max functions. PAscore is defined as the product of degrees of a protein pair (u, v), i.e., PAscore = k u × k v .
Assuming that proteins that are biologically similar to other interacting proteins are more likely to interact, we designed a framework that leverages proteins' primary sequence and priori knowledge of interacting proteins to identify new candidate interactions. This framework is heavily inspired by the PIPE algorithm [24], a well known protein interaction prediction engine that leverages recurring short polypeptide sequences between known interacting protein pairs to score candidate pairs. In other words, given a candidate pair (u, v) and a known interacting pair (a, b), if multiple regions u i of u's primary structure are similar to analogous regions in a and the same happens for v's primary structure resemble with respect to b, u and v are more likely to be a interacting protein pair.
Starting from PIPE [24], we designed a simplified framework in which we don't directly consider all co-occurrences of two sub sequence pairs (u i , v j ), but we use them to define a protein similarity measure. In more detail, we define the biological similarity of two proteins u and a as follows: where J (s u , s a ) is the Jaccard Index between the set of k-mers of the primary structures of protein u and a. Given a pair of no-interacting proteins (u, v) and the knowledge graph of interacting proteins G(V, E), the algorithm computes the likelihood of the interaction as the following steps: • Step 1: generate A, the set of proteins that are biologically similar to u. More formally, A is defined as follows: • Step 2: generate R, the set of proteins that interact with at least one protein in A.
We can define R as follows: • Step 3: generate B, the set of proteins that are biologically similar to v. This set is defined the same way we defined A: • Step 4: compute the likelihood of the interaction (u, v) as the size of intersection between R and B( i.e. DirSimScore Biological (u, v) = |R ∩ B|). • Step 5: compute DirSimScore Biological (v, u) by executing, another time, steps from 1 through 4, this time switching u with v.
• Step 6: return the following score as a measure of the interaction likelihood of proteins u and v: For each protein u, our framework pre-computes the set A with a computational cost of O(P 2 L) where P is the number of proteins and L is the average protein primary sequence length. This way, steps 1 and 2 have constant time complexity. Finally, it finds R in and computes the size of the intersection in O(P ). Thus, the overall time complexity is O(P 2 L) that is very small if we compare it with the time complexity of PIPE that is O(P 3 L 2 ).
Finally, we create a framework that leverages both biological and topological information, by aggregating the topological and the biological scores of a candidate pair (u, v) via a convex combination. More formally, we defined the final likelihood of a candidate pair of proteins (u, v) as: where β ∈ 0, 1 , and with both scores suitably normalized before the computation of the sum.
We RNM. This method is an extension of the L3 method, which has previously been shown to be effective in predicting PPIs [20,26]. The L3 method has no free parameters, and thus makes predictions irrespectively of the amount of incompleteness in the data. In practice, we need less predictions when the measured network is almost complete and more if the dataset is missing most connections. We expect that a method with at least one adjustable parameter can account for the amount of incompleteness and should be able to out-perform the standard L3 method.
Our starting point is an edge prediction method of the form where A is the n × n adjacency matrix of the observed PPI network, X is an n × n matrix to be determined, and n is the number of proteins in the screen. In the case of L3, X L3 = Not knowing a priori which method is expected to perform best, we choose to use an average of the predictions of the three methods. For the internal cross-validation, we define r ij to be the normalized rank i th interaction predicted by the j th method, with the highest score being r ij = 1. Then, we user = j r ij /3 to rank the predictions and evaluate against the test set in terms of the required statistics.
For the external validation, predictions are made for all three methods for each of 3 assays in the human interactome. Within each assay, the predictions are merged and averaged as described in the preceding paragraph, except that the ranks are unnormalized. Since the assays cover largely disjoint sets of genes, the absence of a prediction of a particular pair in one assay can result from that assay's failure to cover the pair in question, or it may reflect a genuine absence of interaction. To merge the predictions of the three assays, we define the rank to ber ik , where k th is an index over assays. We take the top 10, 000 predicted ranks for each assay and assignr ik = 0 if the i th pair is queried but missing from these entries in the k th assay, andr ik = min(R 0 , min k =lril ) with R 0 = 2, 500, otherwise. The final rankings are thenr = kr ik /3. The treatment of the missing entries prioritizes pairs that appear in multiple assays as these pairs are likely more robust than the top predictions in each assay without allowing the absence of a pair in a given assay to increase its average rank.
Other Similarity-Based Methods in the Literature. There are some additional traditional similarity-based methods in the literature, e.g., the Local Leicht-Holme-Newman Index [27], the Individual Attraction Index [28], the Mutual Information [29], the CAR-Based Indices [30], and the Global Leicht-Holme-Newman Index [27], which can also be used to tackle the link prediction problem (see surveys [31][32][33] for details).
Most of the similarity and diffusion based link prediction methods are available in the MATLAB implementation of SEAL [34]: github.com/muhanzhang/SEAL/tree/master/MATLAB.

B. Probabilistic methods
These methods are based on the concept of maximum likelihood, assuming that real networks have some structure, i.e., hierarchical or community structure. The goal of these algorithms is to select model parameters that can maximize the likelihood of the observed structure.
Stochastic Block Model. As one of the most general network models, the stochastic block model (SBM) assumes that nodes are partitioned into groups, and the probability that two nodes are connected depends solely on the groups to which belong [35,36]. The SBM assumes that a link with higher reliability has higher existent probability, and the reliability of a link is defined as [37]: where P represents the partition space of all possible partitions, σ i is the group to which that node i belongs in the partition P , l O σ i σ j is the number of links between groups σ i and σ j in the observed network, r σ i σ j is the maximum possible number of links between them, and the function H(P ) ≡ α≤β [ln(r αβ + 1) + ln Hierarchical Structure Model. Many real networks have hierarchical structure, which can be represented by a dendrogram D. One can assign a probability p r to each internal node r of D with the connecting probability of a pair of leaves given by p r , where r is the lowest common ancestor of these two leaves. Denote E r as the number of edges in the network whose endpoints have r as their lowest common ancestor in the dendrogram D. Let L r and R r be the number of leaves in the left and right subtrees rooted at r, respectively.
The likelihood of D associated with a set of probabilities {p r } is, then, given by [38]: For a specific D, the probabilities {p r } that maximize L(D, {p r }) are edges between the two subtrees of r that are present in the network: Evaluating the likelihood L(D, {p r }) at this maximum yields RepGSP. This method is based on the Graph Signal Processing (GSP) technique [39].
The method consists of a Multiscale Markov Random Field (MRF) of the interactome.
MRF is a Markov technique modeling the network's edges based on features associated with the nodes [40]. In order to capture latent information of the network, signal of graphs (SoGs) are exploited. The SoGs are designed to capture the topological patterns of the network by resorting to a Markovian model taking into account either the pathways of length 3 between two nodes (i.e., proteins) and the community structure of the graph. The method is completely unsupervised and it is based only on the structural information of the network. In particular it weights attractive or repulsive behaviour of graph nodes belonging to the same community [41]: herein, RepGSP rewards links between "repulsive nodes" (i.e., nodes belonging to different communities).
Other Probabilistic Models in the Literature. There are also many probabilistic model based link prediction methods (which typically have high time complexity [42]), such as the Probabilistic Relational Model [43], the Probabilistic Entity Relationship Model [44], and the Stochastic Relational Model [45], as well as many other methods.

C. Factorization-based methods
These methods posit that the observed network structure has can be generated from a lower-dimensional latent space and use matrix factorization or completion techniques to find a mapping to embed the original network, into a lower dimension such that the similar nodes in the original network tend to have similar latent representation features.
Matrix Factorization. Matrix factorization aims to decompose the observed adjacency matrix or node attribute data into two or more matrices through supervised and unsupervised approaches. Denote a data matrix X, with p rows and n columns. Each column represents a sample and each row represents a particular feature. The data matrix can be factorized as [32,46]: where X ∈ R p×n , F ∈ R p×k , and G ∈ R n×k . Matrix F represents the bases of the latent space, and matrix G contains combinations of coefficients of the bases and k is a parameter that specifies the dimension of latent space (k < n). The matrix factorization methods can be categorized by the sign constraints on the above three matrices, which are denoted by a subscript. For example, singular value decomposition (SVD): X ± ≈ F ± G T ± ; Non-negative matrix factorization (NMF): Solving the factorization problem can be modeled as a constrained and potentially regularized least squares optimization problem.
Low-Rank Matrix Completion. The goal of matrix completion is to recover a lowrank matrix L from a large matrix A, which can be used to infer the missing links of a network. The matrix L can be calculated by solving the convex optimization problem [47]: Here, S is a sparse matrix. · * denotes the nuclear norm (sum of singular values) of a matrix, and | · | 1 represents the sum of the absolute values of each matrix element and λ is a positive parameter. The matrix S is defined as follows: if two nodes (i, j) are connected in the observed network, then the score S ij = A ij ; otherwise, S ij = L ij .

GLEE. The Geometric Laplacian Eigenmap Embedding (GLEE) method uses the Lapla-
cian matrix to find the embedding using (simplex) geometric properties, rather than spectral properties, through leveraging the simplex geometry of the network [48]. The Geometric Laplacian Eigenmap Embedding is available at: github.com/leotrs/glee. We used the default parameters in GLEE and dimensionality is set as d = 128.

D. Diffusion-based methods
For a given network, a diffusion process captures the latent structure in the spread of information over the connections. For example, random walks propagate information by sampling network paths following a stochastic process. Diffusion techniques are used in a variety of settings and are integrated with similarity measures or deep learning methods like the Skip-Gram model [49]. Here, we briefly introduce some widely used diffusion-based algorithms.
Average Commute Time. The average commute time index is motivated by a random walk process on the network [7]: where l + ij is the entry of the pseudoinverse of the Laplacian matrix L ≡ D − A, where D = diag{k 1 , k 2 , · · · , k N } is the degree matrix and A is the adjacency matrix.
RWR. This method involves a similarity measure based on a random walk with restart sampling. Random walks with restart are random walks with a restart probability [50].
SimRank This method involves a network propagation of node structural context similarity based on object-to-object relationships [51]. SinRank measures similarity of the structural context in which objects occur, based on their relationships with other objects.
Deepwalk. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences, consisting of two main components: random walk sampling and embedidng learning [52]. Once the random walk sequence are sampled, these are fed to a Skip-Gram neural network to learn node embeddings [53].
LINE. LINE is a novel network embedding method, with a carefully designed objective function that preserves both the first-order and second-order proximities, which is scale to large directed, undirected and weighted networks [54].
Node2vec. node2vec learns a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. By choosing an appropriate notion of a neighborhood, node2vec can learn representations that organize nodes based on their network roles and/or communities to which they belong [55].
RW2. This method is divided into two steps: representation learning and classification. In the representation learning step, RW2 is applied to the interactome to generate node embeddings. Since RW2 needs categorical features associated with nodes, each node is enriched by the following labels, node ID and biological properties expressed in terms of GO-Terms (biological functions, biological processes and cellular locations). In the classification step, an XGboost method is trained with the embeddings generated by RW2. The method is supervised and is based on structural and biological information in the network (Level-2).
DNN+node2vec. This method consists of two steps. First, it employs node2vec to learn node embeddings [55]. Next, the node embeddings are fed to a deep neural network for PPI prediction. The deep neural network is invariant to the node order in input. The code of DNN+node2vec is available at: https://github.com/spxuw/PPI-Prediction-Project.

E. Machine Learning method
Machine learning (ML) is a growing field based on learning techniques to extract regular patterns in the given data. Deep learning (DL) is a sub-field of Machine Learning composed of multi-layered neural network methods. Here, we introduce classic machine learning and deep neural network architectures.
Coding Proteins and K-Nearest Neighbors (Code4+KNN). This method first represents each protein sequence as a vector by using local protein sequence descriptors, and then concatenates the information of the two proteins in the pair into a new pairwise feature vector using four "coding" functions. Finally, a KNN method is trained using the pairwise feature vectors as input [56].
Auto Covariance and Support Vector Machine (AC+SVM). This method is a sequence-based method that combines the auto covariance (AC) measure with the support vector machine (SVM) classifier. AC accounts for the interactions between residues of a certain separation distance in the sequence in an attempt to account for local interactions between amino acids. [57].
Local Protein Sequence Descriptors and Support Vector Machine. This approach combines local protein sequence descriptors with the SVM classifier. Local descriptors account for the interactions between sequentially distant but spatially close amino acid residues in an attempt to capture multiple overlapping continuous and discontinuous binding patterns within a protein sequence [58].

Multi-Scale Continuous and Discontinuous feature Representation and Support
Vector Machine (MCD+SVM). This approach uses a novel Multi-scale Continuous and Discontinuous (MCD) feature representation and Support Vector Machine (SVM). The MCD representation gives adequate consideration to the interactions between sequentially distant but spatially close amino acid residues; thus, it can sufficiently capture multiple overlapping continuous and discontinuous binding patterns within a protein sequence. An effective feature selection method mRMR was employed to construct an optimized and more discriminative feature set by excluding redundant features. Finally, a prediction method is trained and tested using the SVM classifier to predict the interaction probability of protein pairs [59].
Deep-Forest (GcForest). This is a deep-forest-based method for PPIs prediction. PCA-EELM. This method involves a novel hierarchical principal component analysisensemble extreme learning machine (PCA-EELM) model to predict protein-protein interactions using the information contained in protein primary amino acids sequences. In the proposed method, 11,188 PPIs retrieved from the DIP database were encoded into feature vectors by using four kinds of protein sequences descriptors. Next, the PCA method was employed to construct the most discriminative new feature set. Finally, multiple extreme learning machines were trained and then aggregated into a consensus classifier by majority voting [62].
DeepPPI. This method, called DeepPPI (Deep neural networks for Protein-Protein Interactions prediction), employs deep neural networks to learn effectively the representations of proteins from common protein descriptors [63].
iPPI. First, iPPI proposes the amino acid properties of PPI closely related protein sequences and the influence of an imbalanced training set on prediction accuracy. Next, it encodes the primary sequence with the properties of 20 amino acids and employs a random undersampling strategy to solve the imbalance issue. In the training phase, it builds a hybrid deep neural network to model the encoded sequence profiles of the bootstrapping-based training datasets [64].
DPPI. This method applies a deep, twin-like convolutional neural network combined with random projection and data augmentation to predict PPIs, leveraging existing high-quality experimental PPI data and evolutionary information of a protein pair under prediction [65]. This method does not rely on any domain-specific heuristic and works for general undirected or directed complex networks [66].
cGAN. This method, as described by Balogh et al. [67] consists of training a conditional Generative Adversarial Network (cGAN) method that uses Wasserstein distance-based loss with gradient penalty [68][69][70]. The method is divided into two steps: representation learning and adversarial learning. In the representation learning step, the authors generate node embeddings by applying node2Vec [55] to the interactome. Next, in the adversarial learning step, they train the cGAN by feeding the node embeddings and adjacency matrices of the subgraphs (generated starting from each node of the interactome) to the cGAN's generator.
In the meanwhile, they feed the same matrices given to the generator and the confidence matrices produced by the generator to the cGAN's discriminator that serves as a critic, scoring the performance of the generator. The cGAN architecture does not perform classification directly for the purpose of link prediction, rather to provide feedback for the training of the cGAN's generator. Thus, the generator (an unsupervised method) can be trained using the loss of the discriminator (a supervised method). A further refined version, titled cGAN2, was also evaluated that uses only the adjacency-based topological information without the node embedding part. The code of cGAN used to generate the presented results is available at: https://github.com/spxuw/PPI-Prediction-Project , and a continuously updated version is available at: https://github.com/semmelweis-pharmacology/ppi_pred.
GraphSAGE. GraphSAGE is a popular graph representation learning method that uses node feature information to generate node embeddings for previously unseen data. Unlike most of embedding methods that require all nodes in the graph to be present, GraphSAGE trains individual embeddings for each node through learning a function that generates embeddings by sampling and aggregating features from a node's local neighborhood [71].
SEAL. SEAL is a representative GCN-based link prediction method, which combines graph structural features, latent features, and explicit features into a single GCN [34]. In particular, the input to GCN is a local subgraph around each target link. Those local subgraphs capture graph structure feature related to link existence. The latent and explicit features can be naturally combined in GCN by concatenating node embedding and node attributes in the node information matrix for each subgraph. We used the the code found at: github.com/muhanzhang/SEAL/tree/master/Python. The parameters used for SEAL are number of hop h = 1, the maximum node per hop is 50, the maximum training links is 10,000 for HuRI to reduce the memory requirement, while the default parameters are used for the other 4 organism-specific interatomes. The embedding features of each node were obtained by node2vec: https://github.com/eliorc/node2vec with default parameters.
The dimensionality of HuRI is d = 4 and d = 128 for other 4 interactomes. The node attributes of each interactome are obtained from the approach proposed in Ref. [61]; this study developed a method, called FCTP-WSRC, to predict PPIs. Here, we used the first 30 dimensions in PCA analysis as the node attributes of each protein. The sequences were downloaded from the Uniprot [72].
CGNN. CGNN uses a variational auto-encoder procedure designed to obtain node embeddings using neural networks [73]. The observed variable A, evidence feature X and the latent variable Z are connected by the joint distribution representing the probability of A and X with a given Z. The neural network is used to learn the joint probability via MCMC through supervised learning.
SkipGNN. SkipGNN a graph neural network approach for the prediction of molecular interactions [74]. SkipGNN predicts molecular interactions by not only aggregating information from direct interactions but also from second-order interactions, which the authors     Precision is defined as: Precision=Positive count/(Positive count+Negative count). we generated synthetic interactomes using duplication-mutation-complementation model [77]. Size of the synthetic interactome is 5,000 with a tuning divergence parameter. Error bar represents the standard deviation among ten realizations of HuRI.  [78]. Stacking ranking (Dowdall): ranking aggregation using Dowdall method [79]. Note that, the performances of stacking topology on the BioGRID database was not evaluated due to the prohibitive computational cost. We marked its performance as N/A.  Right: functional modules discovered by SAFE [82]. Gene Ontology [83] (GO) terms for each gene were extracted from FuncAssociate [84]. The neighborhood radius is set to be 0.15 in SAFE. Note that the PPIs in the first and the second modules are mostly predicted by MPS(B&T) and MPS(T) but the rest of the PPIs are predicted by the other methods.