Abstract
The characterization of topology is crucial in understanding network evolution and behavior. This paper presents an innovative approach, the GHuST framework to describe complexnetwork topology from graphlet decomposition. This new framework exploits the local information provided by graphlets to give a global explanation of network topology. The GHuST framework is comprised of 12 metrics that analyze how 2 and 3node graphlets shape the structure of networks. The main strengths of the GHuST framework are enhanced topological description, size independence, and computational simplicity. It allows for straight comparison among different networks disregarding their size. It also reduces the complexity of graphlet counting, since it does not use 4 and 5node graphlets. The application of the novel framework to a large set of networks shows that it can classify networks of distinct nature based on their topological properties. To ease network classification and enhance the graphical representation of them, we reduce the 12 dimensions to their main principal components. Furthermore, the 12 dimensions are easily interpretable. This enables the connection between complexnetwork analyses and diverse real applications.
Similar content being viewed by others
Introduction
The analysis of complexnetwork topology can support the understanding of the principles that guide network evolution and that condition network behavior^{1}. The characterization of network structure has traditionally been done through a set of global or local statistics such as degree distribution or motifs^{2,3}. Both global and local metrics complement each other, since different communities may coexist in the same network with different topological properties (what is known as structural subunits)^{4}. Global metrics, such as network diameter or characteristic path length, provide a panoramic view of networks that may have implications on their dynamics. For instance, the particular degree distribution of computing networks, they are scalefree networks, makes them relatively resistant to accidental failures but vulnerable to targeted attacks^{5}. However, global metrics disregard the complexity of local structures that might be crucial to understand the behavior of networks, as it has been shown for the case of the internet network^{6}. Furthermore, local processes condition the development of network topology^{7}. Consequently, topological analyses should include the use of local statistics that zoom in the local structure of complex networks.
An example of a localtopological statistic is the motif distribution. Motifs are recurring subgraphs patterns that appear more often in a given network than in a random one. Motifs were proposed to understand the evolutionary design principles of complex networks from a local perspective^{8}. They search for key local structures that determine network behavior. However, the choice of the null model (random networks to which a network is compared) to detect motifs in real networks may be misleading^{9}. Furthermore, motifs are partial subgraphs (they do not necessarily include all the connections between a set of considered nodes); this leads to a loss of information that may be compelling to understand network structure^{10}.
Unlike motifs, graphlets allow for network decomposition in small subgraphs that preserve all connections among nodes. Graphlets are small connected induced subgraphs of a large network^{11}. The presence of graphlets in a network is not conditioned by a null model; they can appear at any frequency. This is a strength with respect to motifs studies. Although graphlets may be comprised of an arbitrary number of nodes, the most commonly studied graphlets are 2 to 5node subgraphs, given that higher degrees entail higher computational complexity. The automorphism orbit of a graphlet is defined as the set of nodes that are topologically symmetric in the graphlet^{12}. Orbits, therefore, define the relative position of nodes with respect to the rest of the nodes in the graphlets. Figure 1 shows all 2 to 5node graphlets and their automorphism orbits. Finally, the description of network topology is limited by graphlet size. Although larger graphlets may complete the description of network topology, this would be unmanageable from a computational point of view. Recent works have proposed efficient algorithms for graphlet counting^{13,14,15,16,17}.
Several models developed for the network alignment problem prove the adequacy of graphlet as a local topology descriptor^{18,19,20,21}. The network alignment problem aims to find corresponding nodes between different networks. Nodes that play a similar role in both networks from a topological point of view. In this field, graphlet decomposition has been revealed as a crucial tool with a view to solving the problem. The basis of those models is the degree signature of a graphlet^{12}. The graphlet degree signature is an extension of the node degree that quantifies the number of times a node in the network appears in an orbit (referred to as “touch an orbit”). Consequently, graphlets provide a complete description of local network topology (the orbits each node touches) that enhances the solution of network alignment problem. Similarly, graphlets might support the comparison among networks or the study of the role played by nodes in the network^{22,23}. Despite being a good descriptor of local properties, the use of graphlet distribution (or graphlet degree signature) is not enough to have an insight into the global topological properties of networks. Yaveroğlu et al. propose the analysis of orbit correlation to characterize network structure and to ease the interpretation and implications of topological properties in real applications^{24}.
This paper improves the characterization and understanding of network topology by proposing the GHuST framework that extends graphlet analysis. The advantages of this novel method are enhanced topological description, size independence, and computationally simplicity. First, the 12 dimensions fully describe the structure of networks, covering the most relevant aspects of local and global topology from a systematic manner. Second, the GHuST framework explains network properties regardless of network size. This supports the comparison among networks with different number of nodes and edges. Third, it only considers 2node and 3node graphlets and they follow easily from the adjacency matrix. It reduces computational complexity with respect to prior analyses that require counting highernode graphlets.
The application of the method to a set of five real networks demonstrates the accuracy of the framework to explain network topology. Furthermore, this new metric enhances network classification and can be used as a tool to confirm the topological accuracy of synthetic networks. This validation is usually missing in the generation of synthetic power grids, where there is a weak topological validation or it is done only by a few global statistics^{25}. Therefore, this tool can be introduced to compare the topology of both real and synthetic networks in a systematic manner.
The rest of the paper is organized as follows: “Understanding network structure from local properties” section presents the GHuST framework. “Explaining the topology of real networks” section illustrates its application to explain the topological structure of networks from different nature. “A panoramic view offered by local properties” section uses dimensionality reduction methods to evaluate the performance of the proposed metric when applied to a large sample of networks. Finally, “Conclusion” section presents paper conclusions.
Understanding network structure from local properties
As explained above, graphlets can be a convenient tool for explaining the local structure of networks. Unfortunately, graphlet decomposition does not consider any interaction between graphlets. In addition, in large networks, counting graphlets is computationally intensive. It also supplies a substantial number of dimensions that are difficult to interpret (30 graphlets and 73 orbits in the case of using from 2 to 5node graphlets). Motivated by this desire to simplify and improve topological analyses through graphlet decomposition, this section proposes a novel method that reduces the topological analysis of networks to a 12dimensional metric, the GHuST framework. This metric can be calculated in any nondirected and unweighted network.
The 12 dimensions are obtained from the decomposition of networks in 2node and 3node graphlets, comprising three graphlets (\({G}_{0},{G}_{1}\) and \({G}_{2}\)) and four orbits (\({O}_{0}\), \({O}_{1}\), \({O}_{2}\), \({O}_{3}\)). The adjacency matrix succinctly reveals the number of times a node touches those orbits (see Supplementary Information Sect. 1). As explained in the prior section, recent works focused on counting graphlets efficiently in large networks^{13,14,15,16,17}.
In addition, for the four orbits, \({P}_{t,i}\) is a binary variable that is 1 if node \(i\) is at least once in orbit \(t\) or 0 otherwise (1).
To enhance readability, the 12 dimensions are classified into four categories: Global connectivity, Hubs, Strings, and Triangles. Those categories cover different aspects of network structure that might condition network behavior. Furthermore, these categories allow for an intuitive interpretation of topology implications in realworld applications. For instance, in power networks, the higher presence of strings might mean a lower level of network robustness (higher probability of having energy not supplied in the network in case of line failure, given that when there is a failure in a string all the downstream nodes will be affected). Similarly, the presence of large strings in an email graph (nodes stands for community members and edges connect the people who send an email with the people who receive the email) will show that the community may follow a clearly defined hierarchical structure.
To enhance network comparison, it is desirable that the 12 dimensions of the metric range between 0 and 1. In cases where a dimension does not do it, we propose a scaling factor. The 12 dimensions are defined as follows.
Global connectivity
Linesurplus coefficient, \({{{\rho}}}_{1}\)
It stands for the surplus of lines in the network with respect to the minimum number of lines needed to build a connected graph (2). Given a set of nodes, \(N\), the minimum number of lines, \({L}_{0}\), to have a connected graph is \({L}_{0}=N1\), in case of large networks \({L}_{0}\approx N\). As we only consider connected graphs, \(N=\sum_{i}{P}_{0,i}\). The number of lines installed in a network is \(\frac{\sum_{i}{O}_{0,i}}{2}\). This dimension is therefore related to the average node degree and it supplies information about line density in a network. In networks with a radial structure (trees), \({\rho {^{\prime}}}_{1}\) tends to zero. The higher the value of \({\rho {^{\prime}}}_{1}\) the more meshed a network is.
We define \({\rho }_{1}\) (3) to scale \({\rho {^{\prime}}}_{1}\) between 0 and 1. Networks with \({\rho }_{1}\) close to 1 have a highly meshed structure.
\({\rho }_{1}\) can be rewritten as (4).
Leaf rate, \({{{\rho}}}_{2}\)
This ratio compares the proportion of nodes with just one connection, known as leaf nodes, to the rest of nodes in the network that are not vertices of a triangle. This ratio discerns between networks in which edges may form a homogenous mesh that touches most nodes and networks characterized by the presence of hubs connecting lowdegree nodes. This metric is calculated as the complementary of the ratio between the number of nodes that touches \({O}_{1}\) but does not touch \({O}_{3}\) and the number of nodes that touches \({O}_{2}\) but does not touch \({O}_{3}\) (5).
All sets of threeconnected nodes are either in graphlets \({G}_{1}\) or \({G}_{2}\). For those nodes that belong to \({G}_{2}\) and they are not part of \({G}_{3}\), they may touch \({O}_{1}, {O}_{2}\) or both simultaneously. A node is only in \({O}_{2}\) if it is the center of an isolated star, that is, the rest of the network nodes are connected to it. By assuming that networks have a more complex structure, no nodes can touch exclusively \({O}_{2}\). However, a node can touch exclusively \({O}_{1}\). This occurs in cases where nodes have only one connection, or they are the noncommon vertex of two triangles that share one or two vertices. Accordingly, leaf nodes are defined by: \({P}_{1,i}=1\), \({P}_{2,i}=0\) and \({P}_{3,i}=0\). Nodes that are not leaf nodes or vertices of a triangle are defined by: \({P}_{1,i}=1\), \({P}_{2,i}=1\) and \({P}_{3,i}=0\). When \({\rho }_{2}\) is close to one, the presence of leaf nodes is high. The lower this coefficient, the lower the number of nodes that have just one connection; this is characteristic of star graphs.
Leafbase strength, \({{{\rho}}}_{3}\)
This ratio analyses if leaf nodes are connected to either hubs or lowdegree nodes. This is the average number of times leaf nodes touch \({O}_{1}\)(6). The value of \({O}_{1}\) for leaf nodes is equal to the degree of its neighbor. Thus, the higher the value of \({O}_{1}\), the higher the degree of the node to which they are connected. Large values of \({\rho {^{\prime}}}_{3}\) may signal the presence of hubs in the network.
This dimension might be scaled with the maximum value of node degree, \(\text{max}({O}_{0,i})\), in the network (7). \({\rho }_{3}\) can be rewritten as (8). If \({\rho }_{3}\) tends to zero, leaf nodes are connected to lowdegree nodes. They may be the end nodes of node strings.
Hubs
Hub coefficient, \({{{\rho}}}_{4}\)
This dimension studies whether there is a tendency to form hubs in the network or not. It measures the average number of times nodes touch \({O}_{2}\) (9). All nodes touch \({O}_{2}\) except for leaf nodes and nodes that are only in \({G}_{2}\)(they are only vertices of triangles). The larger the number of connections of a node, the larger the value of \({O}_{2,i}\). Large values of \({\rho {^{\prime}}}_{4}\) therefore shows there is a tendency to make hubs in the network. Unlike \({\rho }_{3}\), the hub coefficient does not linearly correlate with node degree; \({O}_{2,i}\) is given by the binomial coefficient \(\left(\genfrac{}{}{0pt}{}{n}{2}\right)\) where \(n\) is the number of nonconnected edges attached to node \(i\) when the \({O}_{0,i}\) is greater than 2. If two networks have similar values of \({\rho }_{1}\), but different values of \({\rho {^{\prime}}}_{4}\), there is a higher tendency to make hubs in one network than in the other.
To range between 0 and 1, \({\rho {^{\prime}}}_{4}\) can scale with the maximum value of \({O}_{2,i}\) in the network (10). \({\rho }_{4}\) can be rewritten as (11).
Hubconnectivity coefficient, \({{{\rho}}}_{5}\)
It analyzes if hubs tend to connect among them. This dimension is defined by the Spearman’s rank correlation between \({O}_{1}\) and \({O}_{2}\), (12) where \(cov({rg}_{{O}_{1}},{rg}_{{O}_{2}})\) is the covariance of the rank variables of \({O}_{1}\) and \({O}_{2}\) and \({\sigma }_{{rg}_{{O}_{1}}}\),\({\sigma }_{{rg}_{{O}_{2}}}\) are the standard deviation of both rank variables. This is one of the correlations proposed by Yaveroğlu et al.^{24}. If \({\rho {^{\prime}}}_{5}\) tends to 1 means that nodes with high \({O}_{2}\) are also nodes with high values of \({O}_{1}\). The number of times a node touches \({O}_{1,i}\) increases with the degree of a node and its neighbors’ degree. However, the value of \({O}_{2,i}\) only depends on node degree; the higher the number of connections of a node, the higher the value of \({O}_{2,i}\). Consequently, nodes with a high value for \({O}_{1}\) and \({O}_{2}\) have a high node degree, they are hubs, and they are connected to other hubs. Therefore, a value close to 1 means that hubs tend to connect among them.
This dimension is also scaled to range from 0 to 1 (13). \({\rho }_{5}\) can be rewritten as (14).
Strings
String coefficient, \({{{\rho}}}_{6}\)
This coefficient measures the proportion of nodes in the network that are in the middle of a string. A string is formed by two end nodes (one or both nodes are linked to the rest of the network and there is no edge connecting them) and a set of intermediate nodes that are connected consecutively and have no links with the rest of the network. Consequently, a node is in the middle of a string if it has two connections, it touches \({O}_{2,i}\) only once (\({U}_{2,i}=1)\) and it is not a vertex of a triangle (\({U}_{3,i}=1\)). Therefore, \({\rho }_{6}\) is the ratio between the number of nodes that are in the middle of a node string and the total number of nodes that touch \({O}_{2}\) (15). Not all degreetwo nodes touch \({O}_{2}\) once (triangle vertices do not touch \({O}_{2}\)). In addition, not all nodes that touch \({O}_{2}\) once are in the middle of a node string. A node might touch \({O}_{2}\) only once if it is a shared vertex of a triangle (\({O}_{3,i}>0\) and \({U}_{3,i}=0)\), so the node is not part of a string.
Characteristic string length, \({{{\rho}}}_{7}\)
This dimension is the average length of node strings (considering only middle nodes and disregarding the end nodes of the string) in the network as shown in (18), where \(n\) is the number of node strings in the network.
To enhance network comparison, \({\rho {^{\prime}}}_{7}\) is scaled as its inverse (19). If \({\rho }_{7}\) is equal to zero, it means that all node strings have two end nodes and only one middle node.
Triangles
Triangle rate, \({{{\rho}}}_{8}\)
This coefficient studies whether there is a tendency to make triangles in the network or not. It measures the proportion of triangles (\({G}_{2}\)) in a network with respect to the total threenode graphlets (20). The number of \({G}_{2}\) in the network is equal to \(\frac{\sum_{i}{O}_{3,i}}{3}\) and the number of \({G}_{1}\) is equal to \(\sum_{i}{O}_{2,i}\). This ratio is similar to the global clustering coefficient. However, many works in the literature use the network average clustering coefficient to analyze network properties. The network average clustering coefficient weights more nodes with a low degree (as discussed in the Supplementary Information Sect. 2). Thus, it is not a correct measure to analyze network with a nonhomogenous degree distribution. The average network clustering coefficient, therefore, differs from the value of \({\rho }_{8}\) which considers the whole topology of the network.
Triangle concentration, \({{{\rho}}}_{9}\)
This coefficient shows if triangles tend to be concentrated in networks. Triangles are concentrated when there are nodes that are vertices of two or more triangles. The dimension \({\rho }_{9}\) is complementary to the ratio between the number of nodes that are vertices of triangles and the number of triangles in the network (21). The higher the number of triangles that share some vertices the lower the value of \({\rho }_{9}\). If triangles have no shared vertices, the maximum value of \({O}_{3,i}\) is 1, and \({O}_{3,i}= {P}_{3,i}\). Therefore, the number of nodes that are in a triangle is three times the number of \({G}_{2}\) in the network (\({3 G}_{2}= \sum_{i}{O}_{3,i}=\sum_{i}{P}_{3,i}).\) However, if triangles share vertices, \(\sum_{i}{P}_{3,i}<{3 G}_{2}\). As \({\rho }_{9}\) converges to 0, the number of graphlets of type \({G}_{7}\), \({G}_{8}\), \({G}_{17}\), \({G}_{19}\), \({G}_{22}\), \({G}_{23}\), \({G}_{24}\), \({G}_{25}\), \({G}_{26}\), \({G}_{27}\), \({G}_{28}\) and \({G}_{29}\) (graphlets composed of triangles with shared vertices) converges to 0 too.
Triangle pervasiveness, \({{{\rho}}}_{10}\)
This dimension analyzes if triangles tend to cover the whole network or if they are concentrated around a few nodes. It measures the proportion of nodes in the network that are vertices of triangles (22). If a node is a vertex of a triangle, \({P}_{3,i}=1.\) As explained, in connected graphs, the number of nodes in a network is \(\sum_{i}{P}_{0,i}\). This coefficient compliments \({\rho }_{8}\) and \({\rho }_{9}\), since it sheds light on whether triangles form a mesh that comprises most nodes in a network or not. A high value of \({\rho }_{8}\) might be a consequence of networks in which triangles are connected to hubs and lowdegree nodes have a nonmeshed structure or networks in which all nodes are connected by a triangle mesh. Therefore, \({\rho }_{10}\) allows for the discernment between those types of networks, this coefficient would have a low value in the first case, and it would be close to one in the second network.
Triangle connectivity, \({{{\rho}}}_{11}\)
It measures if triangles are isolated in the network or they are part of a highly meshed structure. A triangle is isolated if one or two of its vertices are not connected to the rest of the network. Consequently, those vertices have only two connections, they touch \({O}_{1,i}\) and \({O}_{3,i}\) and they do not touch \({O}_{2,i}\). Thus, \({\rho }_{11}\) is the ratio between the number of triangle vertices that are not connected to other nodes (\({U}_{2,i}\) = 1) and the total number of nodes that are vertices of triangles (\(\sum_{i}{P}_{3,i}\)) (23). The lower the value of \({\rho }_{11}\), the lower the number of isolated triangles in the network.
Triangle degree, \({{{\rho}}}_{12}\)
This dimension shows if triangles tend to be connected to hubs or to lowdegree nodes. It is the average degree of triangle vertices (24). That is the mean value of \({O}_{0,i}\) for those nodes that are in a triangle (\({P}_{3,i}=1\)). High values of \({\rho }_{12}\) mean that triangles are connected to hubs. The lower the value of \({\rho {^{\prime}}}_{12}\), the lower the average node degree of triangle vertices.
To range between 0 and 1, \({\rho {^{\prime}}}_{12}\) is scaled with the maximum value of node degree (25). \({\rho {^{\prime}}}_{12}\) can be rewritten as (26).
A summary table for the dimensions of the GHuST framework is shown in Supplementary Information Table S1.
Explaining the topology of real networks
To prove the accuracy of the proposed method, this section applies the 12dimensional metric to a set of five real networks. It aims to prove if the information provided by \(\rho\) is consistent with the globaltopological statistics usually used to describe network structure. These five networks have different sizes and display completely different structures, as shown in Supplementary Information Fig. S1. The two social networks and the metabolic network are in the range of 1,000 to 1,500 nodes, and the two infrastructure networks are two and five times larger, respectively. However, the number of edges is much higher in the social networks; in the case of the Facebook network, the number of edges is twenty times larger than in the road networks. Differences in network size obscure the comparison among networks with global statistics. In some cases, as in distancebased metrics, it is not always possible to infer if there is a change in a variable because of network size or network structure.
The five real networks are modeled as nondirected and unweighted networks to apply the GHuST framework. This framework does not consider edge direction or edge weight. Although an extension to weighted and directed networks is not the scope of this paper, the GHuST framework may include potentially both properties. On the one hand, edge direction leads to a different graphlet decomposition as pointed by Aparício et al.^{26}. They propose 39 nonbidirectional directed graphlets of 2, 3, and 4 nodes. Then, graphlet and orbit definitions differ between directed and nondirected networks and new GHuST dimensions would apply. Those new dimensions, therefore, incorporate edge direction. While this will increase the complexity of the GHuST framework, it will provide a sounder analysis of network structure in the case of directed graphs. For instance, the inclusion of edge direction in the analysis of power networks will explain the role of leaf nodes in the network. That is, if leaf nodes inject or withdrawn power in the network. Consequently, the explanation given by the leaf rate (\({\rho }_{2}\)), will be completed with the direction of network edges that might represent power flow through lines. By defining new GHuST dimensions, we will differ between radial networks in which leaf nodes withdrawn power in the network, demand nodes, or nodes that inject power, power plants.
On the other hand, the inclusion of edge weight in the analysis of graphlet decomposition was covered by Azari and Airoldi^{27}. However, the expansion of the GHuST framework to weighted networks would not be as straightforward as in the case of edge direction. The 12 GHuST dimensions should be completed with a set of coefficients that weight the importance of the edge in the network. Regarding the example above, values for the leaf rate should vary between networks with small power plants (e.g., wind and solar farms) and bigger power plants (e.g., thermal power plants) connected to the network through a single line. However, as stated, the inclusion of edge weight requires additional research to be effectively included in the GHuST framework.
This paper highlights the simplicity of the GHuST framework, which has been coded in Matlab as well as the code to count graphlets and orbits. This claims result from the need to count 2node and 3node graphlets, that is four orbits (\({O}_{0}\), \({O}_{1}\), \({O}_{2}\), \({O}_{3}\)). Since the GHuST framework only analyzes graphlets of orders 3 and below, it can be calculated in affordable times. As Hočevar et Demšar show, an increase in the number of graphlet nodes triggers the computation time in all the analyzed methods to count graphlets covered in their study (FANMOD, GraphCrunch, and Orca) ^{15,16,28}. For example, the time needed to count the 5node graphlets is between 10 and 100 times higher than the time required to count the 4node graphlets. The same increase is observed by Melckenbeeck et al. in their comparison when going from 5node to 6node graphlets^{29}. Keeping the counting to order 3 and below allows for a manageable computational burden. The computational complexity of this counting is approximately of the order of \(O\left(\leftV\right{d}^{k1}\right)\), where \(V\) is the set of vertices, \(d\) is the maximum degree and \(k\) is the order of the graphlet counting, therefore yielding \(O\left(\leftV\right{d}^{2}\right)\) for our case. In addition, the calculation is easily parallelizable as expressed in published works^{30}.
A thoughtful analysis of the graphlet distribution of each network has been carried out (see Supplementary Information Sect. 2). The results show that in our case, graphlet distribution is not an accurate tool to infer the topological properties of such complex networks, providing an incomplete description of the underlying network structure.
The proposed method overcomes the limitations of graphlet distributions to explain network topology by a 12dimensional metric. To analyze results, Table 1 shows a set of global statistics used to analyze the five real networks, and Table 2 shows the value of the GHuST framework for those networks. (Values in Table 2 are not scaled, the reader is referred to Supplementary Information Sect. 2 for an indepth analysis). The description provided by the GHuST framework of the topological structure of the networks is consistent with global statistics traditionally used in complex networks, overcoming their main drawbacks.
A panoramic view offered by local properties
The previous section has illustrated the application of the proposed metric as a tool for summarizing the main topological features of complex networks. This section aims at evaluating the performance of this technique using a large sample, 1,404 graphs, of real networks from different domains: Autonomous Systems, Enzymes, Facebook, Power Network, Retweet, Roads, and Web.
The autonomoussystems set stands for 733 daily instances of graphs of routers comprising the internet^{31}. The enzymes, Facebook, retweet, roads, web, and some powernetwork graphs are obtained from an openaccess network repository^{32}. The enzyme dataset includes 476 samples (the analysis only considers graphs with more than 20 nodes). The Facebook set consists of 108 networks of friendship connections. The powernetwork graphs comprise the transmission (220 kV and 400 kV) power networks of fifteen European countries, and a set of power networks (7 graphs) obtained from the openaccess repository (voltages levels are not specified)^{32,33}. The retweet networks form a set of 32 graphs. The road set includes 16 instances. Finally, 17 networks are part of the web graphs.
Once we compute the 12dimensional metric for each network, Principal Component Analysis (PCA) is used to validate the usefulness of the proposed statistic. In addition, to enabling visual inspection of our data, it can be used to verify if some dimensions of the 12dimensional metric could be removed. PCA is a statistical technique that seeks to obtain a linear combination of the original variables in such a way that the maximum variance is explained. This allows us to obtain a lowdimensional representation of the data that captures most of the original information. As there are no null coefficients in the PCA loadings, it can be concluded that there are no redundant dimensions in the proposed metric. Furthermore, any network with unusual topological properties will be highlighted in our analysis, providing a tool for detecting outliers. Varimax rotation was applied to improve the understanding of PCA analysis. The varimax technique pursues obtaining principal components that are easier to interpret, by rotating the original principal components in such a way that each one is strongly correlated with as few of the original variables as possible. However, results obtained with varimax rotation did not improve the results shown in this section.
Figure 2a shows the proportion of variance explained by the principal components. By selecting the first three components, we are able to capture 93.4% of the variance of the original data, allowing us to obtain a lowdimensional view of the distribution of our data. The weights of the 12 dimensions of our metric for each component are shown in Fig. 2d, and they can be used to obtain an interpretation of each component. The first component (68.7% of variance), accounts for a positive contribution of \({\rho }_{4}\), \({\rho }_{10}\), \({\rho }_{12}\) and a negative contribution of \({\rho }_{2}\), \({\rho }_{6}\) and \({\rho }_{11}\). Therefore, the main topological differences among the networks analyzed lie on the proportion of leafnodes, presence of hubs and strings, as well as the triangle pervasiveness and connectivity coefficients and triangle degree. A similar interpretation can be obtained for the second component (19.4% of variance) and the third component (5.3% of variance) based on Fig. 2d.
By projecting the coordinates of our 12dimensional data on the space spanned by the first 3 principal components, we can visualize the distribution of the metric for each network in this new axis system. As seen in Fig. 2b,c, networks from different processes tend to have similar topological properties, hence showing clear groupings in the principalcomponent space.
The autonomoussystem and Facebook networks form two clearly bounded clusters in the threefirst principalcomponent space. Despite being the category with more instances, all the autonomoussystem instances are close to − 0.5 in the first component and to 0 in the second and third components. Since in the first principal component, \({\rho }_{i}\) have positive and negative loadings, we cannot state if those values close to zero are the consequence of low values of all components, or they are the consequence of the balance between positive and negative loadings. Tables include in Supplementary Information Sect. 4 show the range in which the 12 dimensions vary. The analysis of ranges for each type of network allows for the classification of graphs. In the case of Facebook graphs, most analyzed instances have values of \({\rho }_{3}\), \({\rho }_{4}\), \({\rho }_{6}\), \({\rho }_{7}\), \({\rho }_{8}\), \({\rho }_{11}\), \({\rho }_{12}\) that are close to zero and the values of \({\rho }_{1}\), \({\rho }_{5}\), \({\rho }_{9}\), \({\rho }_{10}\) are close to one. In Supplementary Information, the reader can find a detailed explanation of \({\rho }_{i}\) distribution for each type of network.
Regarding the two infrastructure networks, roads and power networks comprise two independent clusters. Although some road networks are close to some power grids in the space defined by the first and second principal components, they are clearly delimited in the other two projections of the three first principal components.
Both roads and power networks have low values for the second component, that is low values of \({\rho }_{1}\), \({\rho }_{9}\) and \({\rho }_{10}\). Accordingly, the number of connections in comparison with the minimum spanning tree is low, there is a low number of triangles in the network and they do not tend to share vertices. The instances of roads and power networks that have similar values for the second principal component have a similar number of edges per node. They are the power networks with the lowest number of lines per node with respect to other power networks and the roads with a higher number of lines per node in their category.
Unlike social networks, connections in infrastructure networks are costintensive and they are conditioned by topological, morphological, technical, economical, permitting, environmental, managerial, and political factors^{34}. Consequently, the influence of all those factors may lead to different topological properties depending on regions. Furthermore, in the case of power networks, graphs may include different voltage levels or they may be the result of different model assumptions^{35}. This uncertainty leads to a lack of consensus about some of the topological properties of power networks^{36}.
The cluster with the most variation among its members belongs to the enzymes group. This shows that a network cannot be classified in the enzyme group as clearly as, for instance, Facebook networks. The green area that shows the range in Fig. 3 almost covers all the dodecagon. The topological properties of enzymes are clearly case dependent.
Finally, we can also see two clusters considering the web and retweet group. In the case of web networks, there is a large variation in the third component. It ranges from − 0.3 to 0.7. This variation is caused by the significant difference in \({\rho }_{11}\) (triangleconnectivity coefficient). Although the median of the analyzed instances has a low value, this coefficient ranges from 0 to 1. In the web case, we also see that although most instances have a triangle coefficient (\({\rho }_{8}\)) close to zero, there is an instance in which \({\rho }_{8}\) tends to 1 (the network is mainly formed by triangles). This coefficient is coherent with the network average clustering coefficient^{32}. Accordingly, this framework also supports the quick detection of potential outliers.
PCA analysis can be used for each set of networks independently. Therefore, the dimensions with larger loadings for the first components are the ones that exhibit the most variance in each original set, hence those dimensions will provide information about the topological differences between networks of the same set. Dimensions that have similar values for all networks in the set will have a low contribution to the first components as they are characteristics of those networks. The explained variance for each principal component and the coefficient that shape the first component are shown in Figs. 4 and 5 respectively. A low dimensional representation of the projections of the metrics in the three first principal components for each set of networks can be seen in (Supplementary Information Sect. 3). In the case of the road networks, the first component explains 88.5% of the variance. This component is mainly defined by \({\rho }_{6}\) and \({\rho }_{7}\). Therefore, the difference among roads networks lies on the number of nodes that are part of a node string in the network and the length of those strings.
When analyzing power networks, we observe that the first component only explains 44% of the variation. Consequently, the number of coefficients to describe and to explain differences among power networks is larger. The first component is mainly described by \({\rho }_{1}\), \({\rho }_{2}\), \({\rho }_{3}\), \({\rho }_{4}\), \({\rho }_{9}\), \({\rho }_{12}\). It is necessary to include five principal components, to explain 95% of the variance of the data. This increases the number of metric dimensions required to have a deep understanding of powernetwork topology. In the case of Facebook, the first component explains 72% of the variance. Consequently, the main differences lie in the leaf coefficient, leafconnection degree, and triangle degree.
Additionally, two more dimensionality reduction techniques have been implemented to compare the results of the PCA. Firstly, Independent Component Analysis (ICA), proposed by Hyvärinen et al.^{37}, is employed to compare its results with those of the PCA. Unlike PCA, ICA tries to project the original data into a subspace where they are maximally independent. This technique is often used to uncover hidden structures in the original data. Secondly, a SelfOrganizing Map (SOM) proposed by Kohonen^{38}, is fitted, and its results are compared to the PCA lowdimensional representation. The SOM is a competitive learning algorithm that tries to find a lowdimensional representation of the data in such a way that the topological ordering properties of the original data are preserved. These two models are analyzed in Supplementary Information Sect. 5, where both models are compared with the PCA study carried out in this section. The results show a similar ordination of the networks in both models, validating the PCA study. Clusters obtained by these new two methods match those obtained with the principal components, highlighting the capabilities of the proposed method to explain network topology.
Results show the strengths of the proposed method to compare networks of different nature and to find the topological differences among samenature networks.
Conclusions
The analysis of network graphlets, a localtopological statistic, gives rise to a new description of the global topology of complex networks. This paper introduces an innovative method that analyzes the interaction among graphlets to explain and characterize network topology. This method is based on 2 and 3node graphlets (three graphlets and four orbits) that are easily derived from the adjacency matrix. Therefore, it overcomes the limitation of counting high degree graphlets that might be costintensive for large networks.
The application of the novel framework to five real networks shows that the proposed method is consistent with the global statistics traditionally used to characterize network structure. Furthermore, it overcomes two of their main drawbacks: the use of metrics based on average values and the application of metrics that do not scale linearly with network size. Accordingly, the comparison among networks of different sizes does not require any analysis of metric scalability.
The proposed method has been also validated with a large sample study of networks that arise in different fields. Results prove that the information provided by this novel metric can be used to identify the underlying topological features of the networks and even to provide us with a visual tool to distinguish networks with different properties.
Consequently, this method might explain the evolution in both local and global properties of networks in which growth affects the whole structure. It can also be used to compare networks where network growth does not necessarily imply a change in local properties. This is common in infrastructure networks.
Finally, this work sets up a systematic analysis consisting of a 12dimensional metric, to explain the properties of the network structure. Moreover, the proposed method allows for the translation of topological properties into other scientific dimensional languages. This is possible because global properties are explained from local structures that are easily interpretable.
References
Boccaletti, S., Latora, V., Moreno, Y., Chavez, M. & Hwang, D.U. Complex networks: Structure and dynamics. Phys. Rep. 424, 175–308 (2006).
Newman, M. Networks: An Introduction (Oxford University Press, Oxford, 2010).
Alon, U. Network motifs: Theory and experimental approaches. Nat. Rev. Genet. 8, 450–461 (2007).
Palla, G., Derényi, I., Farkas, I. & Vicsek, T. Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 814–818 (2005).
Barabási, A. & Bonabeau, E. Scalefree networks. Sci. Am. 288, 60–69 (2003).
Haddadi, H., Fay, D., Jamakovic, A., Maennel, O., Moore, A. W., Mortier, R., & Uhlig, S. On the importance of local connectivity for Internet topology models. in 21st International Teletraffic Congress, 1–8 (2009).
Albert, R. & Barabási, A.L. Topology of evolving networks: Local events and universality. Phys. Rev. Lett. 85, 5234–5237 (2000).
Milo, R. et al. Network motifs: Simple building blocks of complex networks. Science 298, 824–827 (2002).
ArtzyRandrup, Y., Fleishman, S. J., BenTal, N. & Stone, L. Comment on ‘network motifs: Simple building blocks of complex networks’ and ‘superfamilies of evolved and designed networks’. Science 305, 1107–1107 (2004).
Mazurie, A., Bottani, S. & Vergassola, M. An evolutionary and functional assessment of regulatory network motifs. Genome Biol. 6, R35 (2005).
Przulj, N., Corneil, D. G. & Jurisica, I. Modeling interactome: Scalefree or geometric?. Bioinformatics 20, 3508–3515 (2004).
Milenkoviæ, T. & Pržulj, N. Uncovering biological network function via graphlet degree signatures. Cancer Inf. 6, 257–273 (2008).
Ahmed, N. K., Neville, J., Rossi, R. A., Duffield, N. G. & Willke, T. L. Graphlet decomposition: Framework, algorithms, and applications. Knowl. Inf. Syst. 50, 689–722 (2017).
Marcus, D. & Shavitt, Y. RAGE—A rapid graphlet enumerator for large networks. Comput. Netw. 56, 810–819 (2012).
Wernicke, S. & Rasche, F. FANMOD: A tool for fast network motif detection. Bioinformatics 22, 1152–1153 (2006).
Hočevar, T. & Demšar, J. A combinatorial approach to graphlet counting. Bioinformatics 30, 559–565 (2014).
Itzhack, R., Mogilevski, Y. & Louzoun, Y. An optimal algorithm for counting network motifs. Phys. A 381, 482–490 (2007).
Milenković, T., Ng, W. L., Hayes, W. & Pržulj, N. Optimal network alignment with graphlet degree vectors. Cancer Inf. 9, 121–137 (2010).
Crawford, J. & Milenković, T. GREAT: GRaphlet Edgebased network AlignmenT. https://arXiv.org/1410.5103 [cs, qbio] (2014).
MalodDognin, N. & Pržulj, N. LGRAAL: Lagrangian graphletbased network aligner. Bioinformatics 31, 2182–2189 (2015).
MalodDognin, N. & Pržulj, N. GRalign: Fast and flexible alignment of protein 3D structures using graphlet degree similarity. Bioinformatics 30, 1259–1265 (2014).
Hayes, W., Sun, K. & Pržulj, N. Graphletbased measures are suitable for biological network comparison. Bioinformatics 29, 483–491 (2013).
Rossi, R. A. & Ahmed, N. K. Role discovery in networks. IEEE Trans. Knowl. Data Eng. 27, 1112–1131 (2015).
Yaveroğlu, ÖN. et al. Revealing the hidden language of complex networks. Sci. Rep. 4, 4547 (2014).
Espejo, R., Lumbreras, S. & Ramos, A. A complexnetwork approach to the generation of synthetic power transmission networks. IEEE Syst. J. https://doi.org/10.1109/JSYST.2018.2865104 (2018).
Aparício, D., Ribeiro, P. & Silva, F. Network comparison using directed graphlets. https://arXiv.org/1511.01964 [physics, qbio] (2015).
Azari, H. & Airoldi, E. M. Graphlet decomposition of a weighted network. 10.
Milenković, T., Lai, J. & Pržulj, N. GraphCrunch: A tool for large network analyses. BMC Bioinform. 9, 70 (2008).
Melckenbeeck, I., Audenaert, P., Colle, D. & Pickavet, M. Efficiently counting all orbits of graphlets of any order in a graph using autogenerated equations. Bioinformatics 34, 1372–1380 (2018).
Ahmed, N. K., Neville, J., Rossi, R. A. & Duffield, N. Efficient graphlet counting for large networks. in 2015 IEEE International Conference on Data Mining, 1–10 (2015). https://doi.org/10.1109/ICDM.2015.141.
SNAP: Network datasets: Autonomous systems—AS733. https://snap.stanford.edu/data/as733.html. Accessed 15 August 2019.
Rossi, R. & Ahmed, N. The network data repository with interactive graph analytics and visualization. in TwentyNinth AAAI Conference on Artificial Intelligence (2015).
TYNDP Maps + Data. https://www.entsoe.eu/majorprojects/tenyearnetworkdevelopmentplan/mapsanddata/Pages/default.aspx. Accessed 20 September 2019.
Xie, F. & Levinson, D. Modeling the growth of transportation networks: A comprehensive review. Netw. Spat. Econ. 9, 291–307 (2009).
Espejo, R., Lumbreras, S. & Ramos, A. Analysis of transmissionpowergrid topology and scalability, the European case study. Phys. A 509, 383–395 (2018).
Pagani, G. A. & Aiello, M. The power grid as a complex network: A survey. Phys. A 392, 2688–2700 (2013).
Hyvärinen, A. & Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 13, 411–430 (2000).
Kohonen, T. The selforganizing map. Proc. IEEE 78, 1464–1480 (1990).
Acknowledgements
This work was supported by the Spanish Ministry of Economy, Industry, and Competitiveness; project number PCIN2015150 and by the National Natural Science Foundation of China: 51877181.
Author information
Authors and Affiliations
Contributions
R.E. conceived the original idea and wrote the manuscript text with support from G.M. and F.P., R.E., G.M. and F.P. analyzed the data and performed the numerical calculations presented in the paper. A.R. and S.L. encouraged R.E. to investigate this topic, contributing with novel ideas, and supervising the findings of this work. T.H. and E.B. aided in the interpretation of the results and reviewed the manuscript. All authors provided critical feedback and shape the research, analysis, and manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Espejo, R., Mestre, G., Postigo, F. et al. Exploiting graphlet decomposition to explain the structure of complex networks: the GHuST framework. Sci Rep 10, 12884 (2020). https://doi.org/10.1038/s41598020697951
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598020697951
This article is cited by

Disentangling microbial networks across pelagic zones in the tropical and subtropical global ocean
Nature Communications (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.