Structural measures of similarity and complementarity in complex networks

The principle of similarity, or homophily, is often used to explain patterns observed in complex networks such as transitivity and the abundance of triangles (3-cycles). However, many phenomena from division of labor to protein-protein interactions (PPI) are driven by complementarity (differences and synergy). Here we show that the principle of complementarity is linked to the abundance of quadrangles (4-cycles) and dense bipartite-like subgraphs. We link both principles to their characteristic motifs and introduce two families of coefficients of: (1) structural similarity, which generalize local clustering and closure coefficients and capture the full spectrum of similarity-driven structures; (2) structural complementarity, defined analogously but based on quadrangles instead of triangles. Using multiple social and biological networks, we demonstrate that the coefficients capture structural properties related to meaningful domain-specific phenomena. We show that they allow distinguishing between different kinds of social relations as well as measuring an increasing structural diversity of PPI networks across the tree of life. Our results indicate that some types of relations are better explained by complementarity than homophily, and may be useful for improving existing link prediction methods. We also introduce a Python package implementing efficient algorithms for calculating the proposed coefficients.

Note that Eq. (5) in the Main Text implies that t W i j + t H i j s i j = 2T i j . Moreover, since each triangle including i is shared with two other neighbors we have that: On the other hand, t W i j + t H i j is the number of 2-paths traversing the (i, j) edges so it can be written as t W i j + t H i j = d i + d j − 2. Hence, it is easy to see that: Finally, substituting (S2) and (S3) into (S1) we confirm the desired equality. Now, we use a common definition of structural equivalence in terms of Sørenson Index (normalized Hamming similarity) and note its direct connection to our notion of edgewise structural similarity s i j : The above implies that H i j < s i j for all (i, j) edges for which s i j is defined. And since we established that s i is a weighted average of s i j 's with j ∈ N 1 (i) we have that: Note that for large values of d i + d j the above is approximately equivalent to: In other words, we showed that the similarity coefficient of a node i is approximately bounded between minimum and maximum structural equivalence (Sørenson Index) between itself and any of its neighbors.

Complementarity and structural equivalence
Here we derive the relationship between complementarity coefficients c i j and c i and structural equivalence. We start by showing that c i is a weighted average of the edgewise coefficients c i j 's for j ∈ N 1 (i), that is: Using Eq. (11) from the Main Text we can write 2Q i j = q W i j + q H i j c i j . Moreover, each strong quadrangle including a node i is shared with exactly two other neighbors. Hence, we have that: Next, note that each 3-path starting at an (i, j) edge defines a unique ordered quadruple of the form (i, j, k, l) or ( j, i, k, l). The first form is counted as a head quadruple of the node i and a wedge quadruple of the node j and in the second case the order is reversed. And since q W i j + q H i j is the number of 3-paths starting at the (i, j) edge it must hold that: Finally, note that (S8) and (S9) jointly mean that (S7) must be true. As a result, for j ∈ N 1 (i) we have that: Now, in order to derive the connection between complementarity coefficients and structural equivalence we need first to introduce one additional quantity. For a connected triple (k, i, j) we define Asymmetric Excess Sørenson Index: which measures how many of the connections of k are also shared by j while disregarding edges (i, k), (i, j) and ( j, k). Next, we also need to use the notion of weak quadrangles allowing for any number of chordal edges. Let W i j ≥ Q i j be the number of quadrangles with any number of chords incident to the (i, j) edge. We also define weak edgewise complementarity to be h i j = W i j /(q W i j + q H i j ) ≥ c i j . It is easy to see that: On the other hand, the number of 3-paths starting at the (i, j) edge is: since q W i j is the number of ( j, i, k, l) and q H i j of (i, j, k, l) quadruples. The second equality comes from the fact that n i j = ∑ k a jk = ∑ l a il . Now, we can use (S11), (S12) and (S13) to rewrite the weak edgewise complementarity as: As a result, for k ∈ N 1 (i) − { j} and l ∈ N 1 ( j) − {i} we have that: Using (S10) we can write: Finally, since by definition c i j ≤ h i j this implies: as well as: In other words, we just showed that c i is bounded from above by the maximum Asymmetric Excess Sørenson Index between any two of its neighbors or itself and any neighbor of its neighbors. Moreover, in the weak case we also have a lower bound of the same nature. We leave a more detailed analysis of the notion of weak complementarity for future work.

Formulas and algorithm
Edge-level counts of triples, quadruples, triangles and quadrangles are computed with the algorithm S1. Node and global counts can be obtained by aggregating edge counts. The rules of aggregation are summarized in Table S1. Table S2 presents detailed formulas for all structural coefficients expressed in terms of the aggregated counts.
All global measures are equivalent.

4/14
Algorithm S1 PathCensus algorithm. It takes an undirected graph G = (V, E) with |V | = n and |E| = m as input and returns an array of edgewise counts of wedge and head triples and quadruples as well as triangles and (strong) quadrangles. For better performance E can be defined (without loss of generality) to ensure that for all edges (i, j) it holds that d i ≤ d j .
1: Initialize empty C ▷ m × 8 array for storing path counts 2: Initialize R such that R i = 0 ∀i ∈ V ▷ n × 1 array for keeping track of node roles 3: Let D be the degree sequence of G ▷ n × 1 array 4: Initialize u = 0 5: for e = (i, j) ∈ E do ▷ the loop may be parallelized 6: Set u = u + 1 7: ▷ counts of triangles and wedge and head triples 8: Initialize Q i j , q W i j , q H i j = 0 ▷ counts of strong quadrangles and wedge and head quadruples 9: Initialize Star i , Star j , Tri i j = / 0 ▷ Empty sets for keeping track of nodes with different roles 10: for Add k to Star i and set R k = 1 12: if R k = 1 then 15: Remove k from Star i , add k to Tri i j and set R k = 3 17: else 18: Add k to Star j and set R k = 2 19: for k ∈ Star i do ▷ This internal nested loop determines computational complexity 21: for (l ̸ = i) ∈ N 1 (k) do 22: if R l = 2 then 23: Calculating counts for reversed edges Note that in our implementation t W i j counts the number of (k, i, j) and t H i j tracks (i, j, k) triples. Thus, we have that t W i j = t H ji . Similarly, q W i j counts ( j, i, k, l) and q H i j (i, j, k, l) quadruples, so again we have that q W i j = q H ji . On the other hand, counts of triangles and quadrangles are symmetric. As a result, for the purpose of counting we can assume that all edges are of the form i < j and still be able to count everything correctly. In other words there is no need to consider each undirected edge twice.

Computational complexity
It is clear from the structure of the three nested loops that the asymptotic worst-case computational complexity of both algorithms is O(m∆Sd max ) where m is the number of edges, ∆S is the maximum size of a Star i set (see Algorithm S1) and d max is the maximum node degree. This agrees with the analysis presented by the authors of the more general graphlet counting method 1 , which inspired our PathCensus algorithm. However, in practice the runtime can be reduced by enforcing that edges are defined to satisfy the condition d i ≤ d j (note that this can always be done without loss of generality). The impact of this optimization can be quite significant for networks with highly heterogeneous degree distributions. For instance, in the case of the PGP web of trust network 2 (n = 39796, ⟨d i ⟩ = 9.91, d max = 1696) it yields almost 4 times shorter runtime on average.

Structural diversity analysis
Here we provide additional details for the corresponding analysis in the Main Text (Section: Structural diversity across the tree of life). We present results for three different choices of significance level, α = 0.01, 0.05, 0.10 used for detecting nodes with high structural similarity and complementarity. We show that qualitative results are stable for all values of α, even though quantitative details change in some cases.

Both
Spearman correlations with interactome size 1 Figure S2. General characteristic of the structure of interactomes in different domains/groups. Apart from some minor details the results for all values of α are the same.

Linear model stability and diagnostics
We described the relationship between structural diversity, y = S α (G), defined in Eq. (11) in the Main Text, and interactome size n (number of proteins) using a transformed linear model of the form: where η(x) = x/(1 − x) is the logit transformation. This parametric form ensured model predictions bounded in (0, 1), which was necessary as the structural diversity index ranges from 0 to 1. On the other hand, since the logit transformation is not defined for 0's and 1's we had to drop some part of the observations, namely, 119, 94 and 88 cases for α = 0.01, 0.05, 0.10 respectively. These observations corresponded almost exclusively to organisms with small interactomes as indicated by small average numbers of nodes (35.07, 30.98 and 28.15) as compared to the overall average of 544.06 nodes. Moreover, all of them were dropped because of S α (G) = 0, which is consistent with the hypothesis that less complex organisms tend to have less structurally diverse interactomes. As evident in Fig. S3 the qualitative trend is the same for all values of α. However, the goodness-of-fit of the model is highest for α = 0.01. This is not surprising. Higher values of α correspond to higher type I error rates, meaning that the estimated fractions of nodes with significantly high values of s i and c i are more noisy. Table S3 presents estimated parameters and other numerical details.
Moreover, since the accuracy of interactome networks may depend on the extent to which a given species has been studied, we also fitted extended models including the logarithm of the number of publications about a given species as the second predictor. This enabled a partial control for possible biases induced by differences in terms of the incompleteness of the data available for more and less frequently studied organisms. As Table S3 shows, the effects of publication count (b) were insignificant in all cases.

Network datasets
All network datasets were downloaded from the Netzschleuder repository 3 . In all analyses only largest connected components were used and networks were simplified by removing multilinks and self-loops. Moreover, in the case of directed or weighted networks edge directions and weights were ignored.
Individual network datasets are described below and their unique names within the repository are provided in the following subsection headings. Each dataset can be accessed via a generic link of the form: networks.skewed.de/net/<name> where <name> is a placeholder which should be substituted with the name of a specific dataset.
Below we list datasets in groups corresponding to the three empirical analyses presented in the Main Text. Each section ends with a table with most important descriptive statistics for all networks. The statistics were calculated for the largest connected component.

Experimental assessment of the performance of PathCensus algorithm
We assessed the performance of our implementation of PathCensus algorithm by measuring the average runtime for each of the 1840 interactome networks studied in the paper as well as their randomized counterparts sampled from UBCM. For each network an average runtime over 5 runs was calculated for the observed network and a corresponding randomized version sampled from UBCM (see Fig. S4).
Our analysis shows that the runtime scales approximately linearly with respect to |E|∆Sd max , which agrees with the previous theoretical analysis of the computational complexity of PathCensus algorithm. However, the proportionality constant is lower for smaller networks and then increases for networks with about 100 nodes. After that, the linear scaling seems to be stable.
The experiment was run on a machine with Ubuntu (20.04.4 LTS) and Intel(R) Core(TM) i5-8300H CPU @ 2.30GHz. We did not use the parallelized version of PathCensus algorithm.  Figure S4. Experimental assessment of the performance of PathCensus algorithm. Computations used the d i ≤ d j optimization discussed in Algorithm S1. (A) Average runtime plotted against the product of the number of edges (|E|), maximum node degree (d max ) and the maximum size of a Star i set (∆S; see Algorithm S1). (B) Average runtime and network size.