Introduction

Many social, technological networks evolve over time after they are established. Previous studies have revealed that real networks possess many different structural features, like various degree distribution1, different levels of clustering2, existent or nonexistent communities3, assortative or disassortative mixing pattern4, long or short average shortest distance and so on, which attract much attention on building models to mimic the network evolution5,6. Meanwhile, the latent mechanisms are also fruitful such as the rich-get-richer7, the good-get-richer8, the stability constrains9, homophily10, clustering11 etc. However, using one pure mechanism is usually insufficient to depict real-world networks precisely because of those different aspects of features. Therefore, researchers mixed different mechanisms in order to get better simulation, like the mixture of clustering and preferential attachment11,12, popularity and randomness13, popularity and similarity14, topology distance and geographical distance15 and so on. In all, networks are likely to be driven by multiple mechanisms and we are inspired to raise a question: is it possible to measure the contribution of each mechanism in the network evolution?

The inchoate way to evaluate network model or underlying mechanism is based on the comparison between some selected structural features. It supposes a model is better than another one if its generated network is more close to the target network in terms of those selected features. But such method cannot be well validated since no one has the fair standard to select representative ones from countless structural features. Without considering any specific structural feature, we had proposed a method based on likelihood analysis to fairly evaluate network models16. Therein, we can calculate the appearing likelihood for each newly created link according to the model’s mechanism and then multiply them together to get the likelihood of the set of new links. For a group of models, the one giving the highest likelihood is considered to be the most suitable one. This method is inspired by the link prediction approach, which aims at estimating the likelihood of the existence of a link based on the observed links17. According to this definition, if the principle of a link prediction algorithm is consistent to the mechanism of a given network, this algorithm should provide accurate predictions. Therefore, one can also evaluate the latent mechanisms according to the prediction results of the corresponding link prediction algorithms18,19. In this paper, we take the likelihood analysis and link prediction methods into consideration because they are both free of any specific structural features. To our knowledge, the above methods have only been applied to judge which mechanism is better given a series of mechanisms, but have never been applied to measure the contributions of multiple mechanisms in network evolution.

The core idea of the above methods is to estimate the appearing likelihood of links, which inspires us to measure the contributions of multiple mechanisms by calculating the likelihood using all the mechanisms simultaneously. Therefore, we design a formula to re-calculate the likelihood for every link by assigning each mechanism an tunable weight. The optimal group of weights are the ones maximizing the likelihood of all links (likelihood analysis method) or the prediction accuracy (link prediction method). To testify the effectiveness, we produce numerous model networks which can be controlled to follow multiple mechanisms with different weights, such as popularity, clustering and randomness. Through comparing the estimated contributions with the known weights, we find both of the methods are effective to judge which mechanism is stronger. In particular the one based on likelihood analysis can give very accurate estimations. Further, we discuss the advantage of likelihood analysis method and the disadvantage of the link prediction method which leads to its worse performance. At last, we apply the likelihood analysis method to different kinds of real-world networks to see how popularity and clustering co-evolve in real complex networks. These networks are collected from different domains, including technology networks and social networks and from different countries, e.g. USA and China. The results show that most of these networks evolve with both mechanisms but with quite different weights.

The main contributions are two folds. In the theoretical aspect, we clarify that the multiple mechanisms of complex systems can be measured in a quantitative way and provide a unified, efficient and extensible measurement method. In the aspect of specific conclusions, we find some interesting properties for real-life networks. For example, the clustering mechanism widely exists in any social networks, while in the platform mainly designed for social activities (Facebook and Flickr) the clustering effect is much stronger than in the platform where the primary demands of users are not social intercourse, such as to watch videos in Youtube and to read blogs in ScienceNet. In addition, we showed that the evolving mechanisms may remarkably change in time for some real networks (e.g., Internet), so the links associated with new nodes are created with different reasons by links between old nodes, which are usually ignored in known models, but in accordance with some experimental studies on Internet, such as20,21.

Results

Measurement methods

Given two snapshots of an evolving undirected network at time and (), denoted by and respectively, where () and () are the sets of nodes and links respectively. The set of new links is . In the following we firstly introduce two previous methods of evaluating underlying mechanisms in network evolution and then present how we measure contributions of multiple mechanisms.

One method is based on likelihood analysis16, of which the key idea is to estimate the appearing likelihood for each new link by multiply the probabilities of selecting its two endpoints. For example, if the links are all randomly created, the likelihood of each link can be calculated by where is the number of nodes of the network. Then, we can get the likelihood for all the new links according to . For a group of models, we can calculate for each of them and the one with the highest likelihood is considered to be the most suitable one.

The other method is based on link prediction18,19. The link prediction index would assign a score, following some certain principle, to each non-observed links, including new links and nonexistent links (, where is the universal set containing all links). Then we can rank these links in descending order. A link prediction index is good if it can assign the new links higher rankings compared with the nonexistent links. To measure it in a quantified way, we introduce the AUC value (area under the receiver operating characteristic curve17,27) which will be discussed in detail in Materials and Methods. Then we assume that a mechanism is more suitable to depict the network evolution if the corresponding link prediction algorithm results in a higher AUC.

As described above, the key points are both to estimate the likelihoods of links. We are motivated to re-estimate the likelihood by considering all the mechanisms with tunable parameters (which must sum to 1) indicating their contributions. According to the probability theory, we define the likelihood of link as the expectation of the likelihoods for all the mechanisms, written as

where is the number of considered mechanisms. Thus, for the method based on likelihood analysis, we expect the group of parameters which maximize would indicate the contribution of each mechanism. Similarly, for the method based on link prediction model, the group of parameters which maximize the prediction accuracy (AUC) would indicate the contribution of each mechanism.

Comparisons between the two methods

To examine the effectiveness of the measurement methods, we apply them to model networks of which the evolution can be controlled. Two well-known mechanisms, popularity and clustering, are firstly taken into consideration. Popularity denotes that the nodes with higher degree are more attractive, while clustering suggests that the links which can form more triangles is more preferred. The model network evolves beginning with a loop consisting of five nodes. It grows following two rules at each step:

  1. 1

    add one new node with one new link which connects this new node to one old node;

  2. 2

    add links, but self-loops and multi-links are not accepted.

Every new link is created following either popularity mechanism or clustering mechanism, which is controlled by a tunable parameter ranging from 0 to 1. means all the links are created following popularity mechanism, while means all the links are created following clustering mechanism.

To implement popularity mechanism, we choose preferential attachment which was depicted by Barabási and Albert in7. They defined the probability of selecting node for new links as . Similarly, for clustering mechanism we use the number of common neighbors to measure the likelihood of creating a link between and . In detail, we firstly select a node for the new link and then select the other node preferentially according to the probability , where is the set of neighbors of . Node is selected randomly to differ from popularity mechanism. Notice that, the new link which is added with the new node at each step, cannot be created if following the current clustering mechanism. So we randomly select an old node to form this new link to differ from preferential attachment. By tuning from 0 to 1 with step-length 0.1, we respectively produce 100 model networks for every . Then the question can be simplified to estimate the value for each model network through equation (1).

Link prediction method

Corresponding to the implementation of popularity mechanism, there has been proposed a link prediction index named Preferential Attachment (PA) index which is defined as the product of the degrees of two nodes, written as 7,17,22. There also has been proposed Common Neighbor (CN) index22 which is accordant with the clustering mechanism, written as . Notice that, many node pairs have the same number of common neighbors, or no common neighbor, which leads to the indistinguishable and the degeneracy of states23. To tackle such problems but keeping the predictive power of CN index invariant, we add a small random number to every , rewritten as . Because is much larger than , we must normalize the and when we combine them. Otherwise will not function unless it is strengthened. Thus we define the hybrid index as

where and are the normalized values by the mean and respectively. In detail, and , where is the mean value of . By tuning ranging from 0 to 1, we can easily find the optimal which maximizes the prediction accuracy (AUC). Need to notice that, CN index can not work if any endpoint of a new link appears after . So we remove all the new links with such nodes when to implement the link prediction method. To keep unanimous, such new links are also ignored when applying the likelihood analysis method.

Likelihood analysis method

This method16 defines the likelihood of a link as the multiplication of the likelihoods of selecting node and . Thus, can be easily defined as and can be defined as . Then the likelihood of has the format

This model aims to maximize the likelihood of all the new links, written as

Thus, we can also obtain the optimal which maximizes . Notice that if , will be meaningless. Please see the solution in Materials and Methods, where we also define if we consider new links without the limitation of new nodes.

In Fig. 1, we present the trends of AUC values (subfigure (a) and (b)) and (subfigure (c)-(h)) with the increasing . The contributions of popularity mechanism and clustering mechanism can be estimated through the peak values. We can see that the optimal resulted from both the two methods increase when grows bigger. For intuitive observation, we figure out the correlation between and the optimal in Fig. 2(a). The likelihood analysis method gives very accurate estimation while the link prediction method fails when is large. The reasons of such failure are three folds: (i) CN mechanism embodies the principle of preferential attachment to some extent; (ii) the link prediction method provides too rough descriptions for the links; (iii) the link prediction model is not appropriate to measure the mechanisms’ contributions.

Figure 1
figure 1

Measuring popularity and clustering based on link prediction method and likelihood analysis method respectively. The contributions are estimated through the peak values. Subfigure (a) and (b) present the average values of AUC resulted by link prediction method, which are obtained by averaging 100 implementations through 100 model networks. The others present the values of resulted from likelihood analysis method. Therein, each curve corresponds to one model network. corresponds to the coefficient in equation (1). denotes the contribution of clustering mechanism in the model networks. Because the likelihoods for the networks are not in the same order of magnitude, we use 12xxx instead of the exact values. 12xxx means an uncertain value above 11999 and below 13000.

Figure 2
figure 2

Correlation between the optimal and . is the known proportion of clustering mechanism compared to popularity mechanism. is the estimated value by the measurement method in this paper. Subfigure (a) represents the comparison between link prediction method and likelihood analysis method, where no new links with new nodes are considered. Subfigure (b) only shows the results of likelihood analysis method without the limitation of new nodes.

Firstly, CN mechanism embodies the principle of preferential attachment because two nodes with large degrees have higher chance to have common neighbors. However, PA never considers the number of common neighbors shared by any node pair. When is small, few new links are restricted to form triangles. It’s easy to distinguish CN mechanism from PA mechanism because most new links shares few, even no common neighbors. When becomes larger, although the formation of triangles become popular, the new links with many common neighbors also tend to have high-degree endpoints. There also exist many new links with few common neighbors but high-degree endpoints. These links lead to the failure of the link prediction method. We will explain it in detail through an example along with the third reason. However, this problem caused by the network model restricts the link prediction method but does not influence the likelihood analysis method. That should be due to the advantages of the likelihood analysis method, which are discussed as below.

The second reason is the loser’s rough descriptions of the links compared with the winner. For example, suppose there are two pairs of unconnected nodes and , which both have two common neighbors, but the degrees of and are much higher than those of and . The probabilities that these links appear is obviously quite different, but the CN index assigns them the same values, i.e., . In contrast, the likelihood analysis method can strongly distinguish them by applying probabilistic methods. Following the definition, we can get the likelihoods,

and in the similar form, which are proved in Materials and Methods. Obviously, is far different from , because and are much larger than both and .

At last, in link prediction method, each new link needs to be compared with all the (sampled) nonexistent links. So that we can find the best link prediction index which assigns the new links with higher rankings compared with those nonexistent links. But when we try to improve the new links’ rankings by tuning , there always exist some links whose rankings fall because of the improved rankings of some nonexistent links. That is to say, the nonexistent links, which are indispensable in the link prediction model, become the barriers to measuring the mechanisms’ contributions. By comparison, the likelihood analysis method aims to optimize the overall likelihood of the new links as a whole. Until now, many researches discussed that some properties only emerge at the global level but vanish at the individual level, such as the function of the organs, the power-law distribution of displacement on the group level but not on the individual level24 and so on. In our case, although the new links are created following CN mechanism when , some of them might seem to be following PA mechanism as they have high-degree endpoints. Unless we consider the overall likelihood of these links, we cannot obtain the accurate estimation. Moreover, the likelihood analysis method shakes off the effect of the nonexistent links. In fact, many pairs of nonexistent nodes are deemed to be linked with high probability. These pairs of nodes would lead us astray if they are treated as the reference standard in the link prediction method. For clarity, we generate a small network following CN mechanism to explain such failure. As shown in Fig. 3, new links are marked by red dash lines and New. We also select six nonexistent links marked by Non to make comparisons. Clearly we can see that the node pair with high usually has high , which is caused by the embodied preferential attachment principle. Such effect makes the estimation difficult. At first, we rank the links according to , Non1 and Non2 are only behind New1. Then we introduce , the rankings of New2 and New3 are improved due to their larger , while Non2 with lower gets a lower ranking. Notice that, the prediction accuracy can benefit from such changes. However, we also need to notice the change happened on Non1, which will lower the accuracy. Non1 has both high and but belongs to nonexistent links. This is the ungovernable effect what we referred before. Adopting such link as the reference standard, it is difficult to obtain the accurate estimation.

Figure 3
figure 3

Example network driven by clustering mechanism only and comparisons between the new links and some selected nonexistent links. Red dash links represent new links which are created following the clustering mechanism. New represents the IDs of new links, while Non represents the IDs of nonexistent links. The two end nodes of the link are labeled as and . is the number of common neighbors between and , corresponding to Common Neighbor Index. is calculated through Preferential Attachment Index. The numbers in “rankCN” column are the rankings based on (corresponding to λ = 1), while those in “rankHyb” column are the rankings based on (corresponding to ).

As above, the likelihood analysis method wins due to its two advantages: the exact description of individual link and the global perspective of description of all the new links. These two points are both indispensable. By comparison, the link prediction method is limited by its rough description of individual link and the ungovernable effect of nonexistent links. To be more stringent, we redefine the CN index to get more accurate description of individual link by , which has the same form to the equation of the likelihood analysis method. But it still failed, as shown in Figure S1 in the Supporting Information. The result implies the effect of the nonexistent links is the main reason.

In Fig. 2(b), we show another advantage of the likelihood analysis method. Due to the drawback of link prediction model, we do not consider the new links with new nodes in Fig. 2(a), but such new links do not limit the effectiveness of likelihood analysis method. Actually, they can improve the accuracy of the estimation a little bit.

Verification through model networks with more mechanisms

Without loss of generality, we examine the winner through model networks driven by more mechanisms. Thus we introduce randomness mechanism, which means that the endpoints of new links are all randomly selected. Similarly, the model networks start evolving from a loop consisting of five nodes. At each step, one new node with one new link and three other links are added. Every link is created following Randomness mechanism with probability , clustering mechanism with probability or popularity mechanism with probability , where and .

By calculating the through equation (4), we can plot every group of estimated values , , , in a three-dimensional space. As shown in Fig. 4, red spots denote the estimated values, while green rectangles show the locations of the theoretical values. The tight fitting again reflects the accurate estimation resulted by likelihood analysis method.

Figure 4
figure 4

The fitting degree of the estimated contribution and the theoretical values , and . Red spots denote the estimated values resulted from likelihood analysis method. Green rectangles mean the theoretical values.

Measuring popularity and clustering for real networks

Inspired by the effectiveness of the measurement method, we try to understand how popularity and clustering mechanism affect real-world networks. We collected nine networks including internet, social networks, communication networks and collaboration networks. Each of them is divided into two parts based on time stamps — observed links and new links (see details in Materials and Methods and Table 1).

Table 1 The basic information of the real networks. is the number of nodes and is the number of links before . and are clustering coefficient25 and assortative coefficient4, respectively. is the average degree of network. denotes the degree heterogeneity defined as . and are the numbers of new nodes and links during . denotes the number of new links among old nodes only.

By calculating the likelihood of new links with equation (4), we can also easily find the optimal for every real network, indicated by the peaks of blue dash curves in Fig. 5. Obviously, the clustering mechanism widely exists in any social networks, but takes on different roles. The clustering effect is much stronger in the platform Facebook and Flickr, which are mainly designed for social activities where people tend to form clusters. Differently, in the platform of Youtube, ScienceNet and Epinions, the clustering effect loses to the popularity effect, because the primary demands of their users are not social intercourse but to watch videos (in Youtube), read blogs (in ScienceNet) and rate products (in Epinions). It does make sense because people who have better resources (e.g., excellent videos, great blogs) also hold greater appeal. In the collaboration network (Coauther), clustering and popularity also co-exist. The existence of Clustering mechanism is natural, because many scientists have their own groups where advisors and students usually collaborate with each other. Popularity mechanism is also plausible, because famous groups are more competitive to attract researchers. In the next experiment, we can see that clustering effect would be a little stronger after they created the first link.

Figure 5
figure 5

The optimal of likelihood analysis method for real networks. Blue dash curves represent the likelihood calculated through new links without the limitation of new nodes, while red curves represent the likelihood calculated through new links without new nodes.

We further study the mechanisms for the new links among old nodes only, to observe the effect of new users. As shown by the red curves in Fig. 5, the optimal tends to fall on different positions compared with the blue dash curves. The differences are not obvious in the online social platforms, but is significant in technology networks and collaboration networks. Such differences show that the evolving mechanisms may remarkably change in time and the links associated with new nodes are created with different reasons by links between old nodes. This result on Internet is accordance with some previous experimental results20,21. Similarly, in the collaboration network, after a researcher joins a new group, he will develop more cooperations with other members.

Discussion

Analyzing network evolution is not only a fundamental problem, but also a long-standing challenge in the network science domain. Previous studies focused on uncovering new mechanisms or improving some known mechanisms. In this paper, we started a new question that is to quantitatively measure the contributions of multiple mechanisms which affect the evolution of complex networks simultaneously. Motivated by previous studies, we compared two measurement methods which are based on link prediction and likelihood analysis respectively. Although the core ideas are both to estimate the likelihood for newly created links, the link prediction method fails in some cases. By analyzing their differences, we found the likelihood analysis method successfully captures the characteristics of new links on the individual level and the overall property of new links on the group level as well. In fact, many researches have discussed that some features or functions emerge on the group level but vanish on the individual level, such as the function of the organs, the collective behaviors of the ant colonies, the power-law distribution of displacement on the group level but not on the individual level24, etc. As a result the likelihood analysis method has the ability of producing very accurate estimations.

The likelihood analysis method is promising because it is highly extensible. The likelihood of new links can be easily estimated by counting the probabilities of choosing the two endpoints when given a mechanism. Moreover, this method is very efficient. Most of the computing time is consumed by the process of maximizing the likelihood, but this is a mature question in engineering. Therefore, it is possible to trace the evolution of complex systems in real time.

From the results of the real-world networks, we can clearly observe the combined action of popularity and clustering. The results here match our intuitive knowledge, but are more significant. For example, a network with high clustering coefficient25 is not necessarily driven by clustering mechanism, but probably the byproduct of another mechanism such as the spatially preferential attachment mechanism26. Moreover, the value of clustering coefficient is usually dependent on the scale of networks, i.e., large scale networks usually have small clustering coefficient compared with small scale networks. None of the above cases can limit the likelihood analysis method, because the measurement of the links is directly based on the probability of selecting the endpoints following the given mechanism. In addition, we also showed that the evolving mechanisms may remarkably change in time for some real networks. Due to the efficiency of the likelihood analysis method, it is possible to trace the evolution of the networks and even the mechanisms. Our results suggests that the multiple mechanisms of complex networks can be measured in a quantitatively unified and efficient way. In future, we expect that the framework in this study can be used to provide some insights in understanding complex systems.

Materials and Methods

Link Prediction Method

Given , a link prediction index can assign every non-observed link (including and ) a score, according which we can rank these links in descending order. An index is regarded as better if it can order the links in with higher rankings than another index does. This is how we seek optimal in this paper.

To compare the indices in a quantified way, we introduce AUC (area under the receiver operating characteristic curve27) to measure the accuracy of prediction based on the rankings. It can be interpreted as the probability that a randomly chosen new link (a link in ) is given a higher score than a randomly chosen nonexistent link. In the implementation, among times of independent comparisons, if there are times the new link having higher score and times the new link and the nonexistent link having the same score, we define the AUC value as17:

If all the scores are generated from an independent and identical distribution, the AUC value should be about 0.5. Therefore, the degree to which the AUC value exceeds 0.5 indicates how much better the algorithm performs than pure chance. Need to notice that, the calculation of AUC is based on statistical theory, so the result of equation (5) will be more approximate to the real value if we assign a larger number. We have discussed the proper value of in the book named Link Prediction28. That is, if we expect to get the AUC value with error less than 0.001 at the 90% confidence level, should be no less than 672400. So in our experiments, we set . The derivation process is presented in Supplementary Information.

Likelihood Analysis Method

In this method, we need to consider three cases for a chosen link : (i) either or is a new node, which appears after ; (ii) both and are new nodes; (iii) both of them are old nodes.

For popularity mechanism, if one of them is new node, supposed as , then , where . If both of them are new nodes, . And if both of them are old nodes,

For clustering mechanism, once or/and are new nodes, no common neighbor they would share. Then we define, according to the implementation of clustering mechanism, if one of them is new node and if both of them are new nodes. If both of and are old nodes, . Denote that, if and do not share any common neighbors, here needs be modified to keep away from 0. In such case, we re-define due to two reasons: (i) can not be 0, or else the product will be 0 too; (ii) must be small and may be variant for different networks. So we adopt the certain value which is not more than the probability of select one node following popularity mechanism.

For randomness mechanism, if one of and is new node, . If both of them are new nodes, . And if both of them are old nodes, .

Proof of Equation (5)

The proof of can be reduced to proving . The number of common neighbors between and is equal to the number of the 2-steps paths, denoted as , where if the path exists, namely is the common neighbor of and ; otherwise . Then . Given the nodes and , can be considered as the amount of the 2-steps paths (). That is to say, both and must be the neighbors of . Therefore, the amount of the 2-steps paths is equal to because , namely . Moreover, if is not connected to directly, we can eventually prove that .

Data Description

We collect nine networks and divide every one of them into two parts --- observed links and future links (corresponding to and respectively defined in the previous section), basing on the time-stamps. The basic features are listed in Table 1.

  1. 1

    AS — Autonomous system (AS) within Internet is a collection of connected Internet Protocol networks and routers under the control of one entity. Route-views Project collected the Internet at the AS level at many different times and here we use the data of June 2006 to compose the Observed Links and that of December 2006 to compose the Future Links21,29.

  2. 2

    Internet — The Internet can be viewed as a collection of autonomous systems (AS) whose snapshots was created weekly by CAIDA (Center for Applied Internet Data Analysis). Mislove downloaded the entire history of their measurements which covered the period from January 5th, 2004 until July 9th, 200730. In this paper, we choose the date November 20th, 2006 as the watershed of Observed Links and Future Links so the size of future links can be approximated to 10% of observed links.

  3. 3

    SN — ScienceNet (www.sciencenet.cn) is a virtual community for Chinese-speaking scientists. This data consisting of two snapshots — July 22nd 2013 and August 12th 2013, is newly crawled from the web site by Xing Yu.

  4. 4

    Epinion — Epinions (www.epinions.com) is an online product rating site where users are connected by trust or distrust relationships. In the simplest case, we neglect the types of connections. The earliest link in the initial data31 was collected on September 1st, 2001, while the latest was on August 11th, 2003.

  5. 5

    Youtube — YouTube (www.youtube.com) is a popular video-sharing site that also involves a social network. The initial data, consisting of links created before Jan. 15th 2007, was collected by Mislove30.

  6. 6

    Flickr — Flickr (www.flickr.com) is a photo-sharing site based on a social network. This data is collected by Mislove et al.32 and consisting of users and links in total. Here we only use a small sample by choosing out the links with time stamps 2006-11-02 and 2006-11-03. The links created at 2006-11-03 are considered as future links and the rest of links compose the observed network.

  7. 7

    FB — Facebook (www.facebook.com) is a social networking service and has over one billion users. The initial data in33 are crawled between January 20th, 2009 and January 22nd, 2009. The time of link establishment is signed by a UNIX time-stamp unless it can not be determined. We set all the undetermined time-stamps as 1.

  8. 8

    FBC — This data is from www.facebook.com but different from the friendships in FB. In this data, if a user post to another user 's wall on Facebook, the directed link will be created from to . Since users may write multiple posts on a wall or their own wall, the network collected in33 allowed multiple edges and loops. In this paper, we remove the loops and redundant edges (multiple edges which have appeared before).

  9. 9

    Coauthor — This is a collaboration network from the e-print arXiv, which covers scientific collaborations between authors whose papers are submitted to High Energy Physics - Theory category. The data covers papers in the period from January 1993 to April 200334. Notice that two authors may collaborate multi-times, which is simply represented by an unweighted link in this paper. The time-stamps are determined by their first collaboration.

Additional Information

How to cite this article: Zhang, Q.-M. et al. Measuring multiple evolution mechanisms of complex networks. Sci. Rep. 5, 10350; doi: 10.1038/srep10350 (2015).