Measuring multiple evolution mechanisms of complex networks

Numerous concise models such as preferential attachment have been put forward to reveal the evolution mechanisms of real-world networks, which show that real-world networks are usually jointly driven by a hybrid mechanism of multiplex features instead of a single pure mechanism. To get an accurate simulation for real networks, some researchers proposed a few hybrid models by mixing multiple evolution mechanisms. Nevertheless, how a hybrid mechanism of multiplex features jointly influence the network evolution is not very clear. In this study, we introduce two methods (link prediction and likelihood analysis) to measure multiple evolution mechanisms of complex networks. Through tremendous experiments on artificial networks, which can be controlled to follow multiple mechanisms with different weights, we find the method based on likelihood analysis performs much better and gives very accurate estimations. At last, we apply this method to some real-world networks which are from different domains (including technology networks and social networks) and different countries (e.g., USA and China), to see how popularity and clustering co-evolve. We find most of them are affected by both popularity and clustering, but with quite different weights.


Introduction
Many social, technological networks evolve over time after they are established.Previous studies have revealed that real networks possess many different structural features, like various degree distribution [1], different levels of clustering [2], existent or nonexistent communities [3], assortative or disassortative mixing pattern [4], long or short average shortest distance, and so on, which attract much attention on building models to mimic the network evolution [5,6].Meanwhile, the latent mechanisms are also fruitful such as the rich-get-richer [7], the good-getricher [8], the stability constrains [9], homophily [10], clustering [11] and so on.However, using one pure mechanism is usually insufficient to depict real networks precisely because of those different aspects of features.Therefore, researchers mixed different mechanisms in order to get better simulation, like the mixture of clustering and preferential attachment [11,12], popularity and randomness [13], popularity and similarity [14], topology distance and geographical distance [15], and so on.In all, networks are likely to be driven by multiple mechanisms, and we are inspired to raise a question: is it possible to measure the contribution of each mechanism in the network evolution?
The inchoate way to evaluate network model or underlying mechanism is based on the comparison between some selected structural features.It supposes a model is better than another one if its generated network is more close to the target network in terms of those selected features.But such method cannot be well validated since no one has the fair standard to select representative one from countless structural features.Without considering any specific structural feature, we had proposed a method based on likelihood analysis to fairly evaluate network models [16].Given a model, Wang et al. calculate the appearing likelihood for each newly created link according to the model, and multiply them together to get the likelihood of the set of new links.For a group of models, the one giving the highest likelihood is considered to be the most suitable one.This method is inspired by the link prediction approach, which aims at estimating the likelihood of the existence of a link based on the observed links [17].According to this definition, if the principle of a link prediction algorithm is consistent to the mechanism of a given network, this algorithm should provide accurate predictions.Therefore, one can also evaluate the latent mechanisms according to the prediction results of the corresponding link prediction algorithms [18,19].In this paper, we take the latter two methods into consideration because they are both free of any specific structural features.To our knowledge, the above methods have only been applied to judge which mechanism is better given a series of mechanisms, but have never been adopted to measure the contributions of multiple mechanisms in network evolution.
The key idea of the above methods is to estimate the appearing likelihood of links, which inspires us to measure the contributions of multiple mechanisms by calculating the likelihood using all the mechanisms simultaneously.Therefore, we design a formula to re-calculate the likelihood for every link by assigning each mechanism an tunable parameter.The optimal group of parameters are the ones maximizing the likelihood of all links (likelihood analysis method) or the prediction accuracy (link prediction method).To testify the effectiveness, we produce numerous model networks which can be controlled to follow multiple mechanisms with different weights, such as popularity, clustering and randomness.Through comparing the estimated contributions with the known weights, we find both of the methods are effective to judge which mechanism is stronger.In particular the one based on likelihood analysis can give very accurate estimated values of the weights.Then we apply this method to some real networks to see how popularity and clustering co-evolve.The results show that most of these networks evolve with both mechanisms but with quite different weights.

Measurement methods
Given two snapshots of an evolving network at time t 1 and t 2 (t 1 < t 2 ), denoted by G(V, E) and G ′ (V ′ , E ′ ) respectively, where V (V ′ ) and E (E ′ ) are the sets of nodes and links respectively.The set of new links is E new = E ′ −E.In the following we firstly introduce two previous methods of evaluating underlying mechanisms in network evolution, and then present how we measure contributions of multiple mechanisms.
One method is based on likelihood analysis [16], of which the key idea is to estimate the appearing likelihood for each new link by multiply the probabilities of selecting its two endpoints.For example, if the links are all randomly created, the likelihood of each link (x, y) can be calculated by l xy = 1 N • 1 N where N is the number of nodes of the network.Then, we can get the likelihood for all the new links according to L = (x,y)∈Enew l xy .For a group of models, we can calculate L for each of them, and the one with the highest likelihood L is considered to be the most suitable one.
The other method is based on link prediction [18,19].The key idea is also to estimate the likelihood but for all the non-observed links, including new links E new and nonexistent links where U is the universal set containing all |V |(|V | − 1)/2 links).The accuracy of prediction can be calculated by the probability that a randomly chosen new link is given a higher score than a randomly chosen nonexistent link, named AUC (area under the receiver operating characteristic curve [17,20]).A mechanism is supposed to be more suitable to depict the network evolution if the corresponding link prediction algorithm results in a higher AUC.More details about the two methods are presented in Materials and Methods.
As described above, the key points are both to estimate the likelihoods of links.We are motivated to re-estimate the likelihood by considering all the mechanisms with tunable parameters (which must sum to 1) indicating their contributions.According to the probability theory, we define the likelihood of link (x, y) as the expectation of the likelihoods for all the mechanisms, written as where k is the number of considered mechanisms.Thus, for the method based on likelihood analysis, we expect the group of parameters which maximize (x,y)∈Enew l xy would indicate the contribution of each mechanism.Similarly, for the method based on link prediction model, the group of parameters which maximize the prediction accuracy (AUC) would indicate the contribution of each mechanism.

Comparisons between the two methods
To examine the effectiveness of the measurement methods, we apply them to model networks of which the evolution can be controlled.Two well-known mechanisms, popularity and clustering, are firstly taken into consideration.Popularity denotes that nodes with higher degree are more attractive, while clustering suggests that nodes with more common neighbors will be connected with higher probability.The model network evolves beginning with a loop consisting of five nodes.It grows following two rules at each step: (i) add one new node with one new link which connects this new node to one old node; (ii) add 3 links, but self-loops and multi-links are not accepted.
Every new link is created following either popularity mechanism or clustering mechanism, which is controlled by a tunable parameter p ranging from 0 to 1. p = 0 means all the links are created following popularity mechanism, while p = 1 means all the links are created obeying clustering mechanism.
To implement popularity mechanism, we choose preferential attachment which was depicted by Barabási and Albert in [7].They defined the probability of selecting node x for new links as kx z∈V kz .Similarly, for clustering mechanism we use the number of common neighbors to measure the likelihood of creating a link between x and y.In detail, we firstly select a node x for the new link, and then select the other node preferentially according to the probability , where Γ(x) is the set of neighbors of x.Node x is selected randomly to differ from popularity mechanism.Notice that, the new link which is added with the new node at each step, cannot be created if following the current clustering mechanism.Then we randomly select an old node to form the new link.By tuning p from 0 to 1 with step-length 0.1, we respectively produce 100 model networks for every p.Then the question can be simplified to estimate the value p for each model network through Eq. (1).
Link prediction method.Corresponding to the implementation of popularity mechanism, there has been proposed a link prediction index named Preferential Attachment (PA) index which is defined as the product of the degrees of two nodes, written as  There also has been proposed Common Neighbor (CN) index [21] which is accordant with the clustering mechanism, written as s CN xy = |Γ(x) ∩ Γ(y)|.Then the likelihood of (x, y) can be calculated by a hybrid index where s PA xy and s CN xy are the normalized values, λ ∈ [0, 1].With the increase of λ, we can easily find the optimal λ maximizing the prediction accuracy (AUC).Need to notice that, CN index can not work if any endpoint of a new link appears after t 1 .So we remove all the new links with such nodes when to implement the link prediction method.To keep unanimous, such new links are also ignored when applying the likelihood analysis method.
Likelihood analysis method.This method [16] defines the likelihood of a link (x, y) as the multiplication of the likelihoods of selecting node x and y.Thus, l popu xy can be easily defined as kx k i × ky k i , and l clus xy can be defined as ).Then the likelihood of (x, y) has the format This model aims to maximize the likelihood of all the new links, written as Figure 2. Correlation between the optimal λ and p. p is the known proportion of clustering mechanism compared to popularity mechanism.λ is the estimated value by the measurement method in this paper.Subfigure (a) represents the comparison between link prediction method and likelihood analysis, where no new links with new nodes are considered.Subfigure (b) only shows the results of likelihood analysis without the limitation of new nodes.
Thus, we can also obtain the optimal λ which maximizes L. Notice that if |Γ(x) ∩ Γ(y)| = 0, l CN xy will be meaningless.Please see the solution in Materials and Methods, where we also define l xy if we consider new links without the limitation of new nodes.
In Figure 1, we present the trends of AUC values (subfigure (a) and (b)) and L (subfigure (c)-(h)) with the increasing λ.The contributions of popularity mechanism and clustering mechanism can be estimated through the peak values.We can see that the optimal λ resulted from both the two methods increase when p grows bigger.For intuitive observation, we figure out the correlation between p and the optimal λ in Figure 2 (a).When clustering mechanism plays more important roles (i.e.p > 0.4), the estimations resulted from link prediction method (blue circles) are not so accurate.It is probably because clustering mechanism involves the idea of preferential attachment, namely a node preferentially connects to its neighbor which shares more common neighbors.But it focuses on only the value of |Γ(x)∩Γ(y)| but ignores the information of endpoints of each link.Therefore, the link prediction method meets difficulties when clustering mechanism takes the dominant role, because the node pairs with higher |Γ(x)∩Γ(y)| usually have higher value of k x × k y .Different from link prediction model, the method based on likelihood analysis calculated the probability for every endpoint.It then provides very accurate estimation for these model networks, shown by red lines in Figure 2 (a).Need to notice that, new links with new nodes are not considered in Figure 2 (a) due to the drawback of link prediction model.But such new links do not limit the effectiveness of likelihood analysis method.As shown in Figure 2 (b), we adopt likelihood analysis method to the new links without the limitation of new nodes and find it also very effective.

Verification through model networks with more mechanisms
Without loss of generality, we examine the winner through model networks driven by more mechanisms.Thus we introduce randomness mechanism, which means that the endpoints of new links are all randomly selected.Similarly, the model networks start evolving from a loop consisting of five nodes.At each step, one new node with one new link and three other links are added.Every link is created following Randomness mechanism with probability p rand , clustering mechanism with probability p clus or popularity mechanism with probability p popu , where p rand , p clus , p popu ∈ [0, 1], and p rand + p clus + p popu = 1.
By calculating the L through Eq. ( 4), we can plot every group of estimated values {p rand , p clus , p popu }, in a three-dimensional space.As shown in Figure 3, red spots denote the estimated values, while green rectangles show the locations of the theoretical values.The tight fitting again reflects the accurate estimation resulted by likelihood analysis method.Green rectangles mean the theoretical values.

Measuring popularity and clustering for real networks
Inspired by the effectiveness of the measurement method, we try to understand how popularity and clustering mechanism affect real networks.We collected nine networks including internet, social networks, communication networks and collaboration networks.Each of them is divided into two parts based on time stamps -observed links and new links (see details in Materials and Methods and Table 1).By calculating the likelihood of new links with Eq. ( 4), we can also easily find the optimal λ for every real network, indicated by the peaks of blue dash curves in Figure 4.The results show that nearly all the networks are affected by popularity mechanism and clustering mechanism simultaneously.For example, the communications between people on Facebook (FBC) are affected by both popularity and clustering with nearly equal weights.Autonomous systems prefer to connect to popular sites.As we know, Facebook advises itself as a making friends platform, while other social networks aim to show something interesting to attract people, like photos, movies, blogs and comments.Therefore, the propagation of friendships is obviously observed on Facebook, and people who is more competitive is more attractive in SN (Science Net), Epinion, Youtube and Flickr.
We further study the mechanisms for the new links among old nodes only, to observe the effect of new users.As shown by the red curves in Figure 4, the optimal λ fall on different positions compared with the blue dash curves.For almost all the networks, the peak values of read curves fall on the right of blue ones, which hints that new individuals are more likely to create links to popular ones rather than a randomly chosen individual.And it also implies the importance of common neighbors when new links are created among old individuals.
The optimal λ of likelihood analysis for real networks.Blue dash curves represent the likelihood calculated through new links without the limitation of new nodes, while red curves represent the likelihood calculated through new links without new nodes.

Conclusion and Discussion
This article aims to measure the multiple evolution mechanisms for complex networks.We compare two measurement methods which are based on link prediction and likelihood analysis respectively.Although the key ideas are both to estimate the likelihood for newly created links, there are some differences in details.The link prediction method estimates the mechanisms based on only the scores calculated by link prediction index, which are usually positively correlated between different indices.For example, the node-pair with more common neighbors (Common Neighbor Index) are more likely to have large k x × k y (Preferential Attachment Index).In such case, the estimation of link prediction method can not be accurate.Differing from that, the likelihood analysis method intuitively considers the probability of selecting every relevant node.In this level, the likelihood of each link for different mechanisms can be distinguished obviously.Plenty of experimental results through model networks show that the likelihood analysis method can provide very accurate estimations.This work is an expanded study about evaluating network models in unified ways [16], i.e. without considering any structural features.The advantage of this method is extensible, since the likelihood of new links can be easily estimated by counting the probabilities of choosing the two endpoints when given a mechanism.Whereas the only limitation is the unknown mechanism, without which we can not get the probabilities of choosing endpoints.Besides, we also find some evidences that the mechanisms would change when networks evolve.In the future study, we plan to adopt this method to track the network and mechanism evolution.Hopefully this work could provide some insights in understanding networks.

Link prediction method
Given G(V, E), a link prediction index can assign every non-observed link (including E new and E non ) a score, according which we can rank these links in descending order.An index is regarded as better if it can order the links in E new with higher rankings than another index does.This is how we seek optimal λ in this paper.
To compare the indices in a quantified way, we introduce AUC (area under the receiver operating characteristic curve [20]) to measure the accuracy of prediction based on the rankings.It can be interpreted as the probability that a randomly chosen new link (a link in E new ) is given a higher score than a randomly chosen nonexistent link.In the implementation, among n times of independent comparisons, if there are n ′ times the new link having higher score and n ′′ times the new link and the nonexistent link having the same score, we define the AUC value as [17]: If all the scores are generated from an independent and identical distribution, the AUC value should be about 0.5.Therefore, the degree to which the AUC value exceeds 0.5 indicates how much better the algorithm performs than pure chance.

Likelihood analysis method
In this method, we need to consider three cases for a chosen link (x, y): (i) either x or y is a new node, which appears after t 1 ; (ii) both x and y are new nodes; (iii) both of them are old nodes.For popularity mechanism, if one of them is new node, supposed as x, then l popu ).Denote that, if x and y do not share any common neighbors, l clus xy here need be modified to keep L away from 0. In such case, we re-define l clus xy = 1 kz × 1 N due to two reasons: (i) l cluster xy can not be 0, or else the product will be 0 too; (ii) l cluster xy must be small and may be variant for different networks.So we adopt the certain value which is not more than the probability of select one node following popularity mechanism.
For randomness mechanism, if one of x and y is new node, l rand

Data Description
We collect nine networks and divide every one of them into two parts -observed links and future links (corresponding to E and E new respectively defined in the previous section), basing on the time-stamps.The basic features are listed in Table 1.
(1) AS -Autonomous system (AS) within Internet is a collection of connected Internet Protocol networks and routers under the control of one entity.Route-views Project collected the Internet at the AS level at many different times, and here we use the data of June 2006 to compose the Observed Links and that of December 2006 to compose the Future Links [22,23].
(2) Internet -The Internet can be viewed as a collection of autonomous systems (AS) whose snapshots was created weekly by CAIDA [24].Mislove downloaded the entire history of their measurements which covered the period from January 5th, 2004 until July 9th, 2007 [25].In this paper, we choose the date November 20th, 2006 as the watershed of Observed Links and Future Links so the size of future links can be approximated to 10% of observed links.
(3) SN -ScienceNet (www.sciencenet.cn) is a virtual community for Chinese-speaking scientists.This data consisting of two snapshots -July 22nd 2013 and August 12th 2013, is newly crawled from the web site by Xing Yu.
(4) Epinion -Epinions (www.epinions.com) is an online product rating site where users are connected by trust or distrust relationships.In the simplest case, we neglect the types of connections.The earliest link in the initial data [26] was collected on September 1st, 2001, while the latest was on August 11th, 2003.
(5) Youtube -YouTube (www.youtube.com) is a popular video-sharing site that also involves a social network.The initial data, consisting of links created before Jan. 15th 2007, was collected by Mislove [25].
(6) Flickr -Flickr (www.flickr.com) is a photo-sharing site based on a social network.This data is collected by Mislove et al. [27]  (7) FB -Facebook (www.facebook.com) is a social networking service and has over one billion users.The initial data in [28] are crawled between January 20th, 2009 and January 22nd, 2009.The time of link establishment is signed by a UNIX time-stamp unless it can not be determined.We set all the undetermined time-stamps as 1.
(8) FBC -This data is from www.facebook.combut different from the friendships in FB.In this data, if a user u post to another user v's wall on Facebook, the directed link will be created from u to v. Since users may write multiple posts on a wall or their own wall, the network collected in [28] allowed multiple edges and loops.In this paper, we remove the loops and redundant edges (multiple edges which have appeared before).(9) Coauthor -This is a collaboration network from the e-print arXiv, which covers scientific collaborations between authors whose papers are submitted to High Energy Physics -Theory category.The data covers papers in the period from January 1993 to April 2003 [29].Notice that two authors may collaborate multi-times, which is simply represented by an unweighted link in this paper.The time-stamps are determined by their first collaboration.
Table 1.The basic information of the real networks.|V | is the number of nodes and |E| is the number of links before t 1 .C and r are clustering coefficient [30] and assortative coefficient [4], respectively.k is the average degree of network.H denotes the degree heterogeneity defined as

Figure 1 .
Figure 1.Measuring popularity and clustering based on link prediction model and likelihood analysis respectively.The contributions are estimated through the peak values.Subfigure (a) and (b) present the average values of AUC resulted by link prediction model, which are obtained by averaging 100 implementations through 100 model networks.The others present the values of L resulted from likelihood analysis.Therein, each curve corresponds to one model network.λ corresponds to the coefficient in Eq. (1).p denotes the contribution of clustering mechanism in the model networks.Because the likelihoods for the networks are not in the same order of magnitude, we use 12xxx instead of the exact values.12xxx means an uncertain value above 11999 and below 13000.

Figure 3 .
Figure 3.The fitting degree of the estimated contribution and the theoretical values p rand , p clus and p popu .Red spots denote the estimated values resulted from likelihood analysis method.Green rectangles mean the theoretical values.
z ∈ V .If both of them are new nodes, l popu xy = 1.And if both of them are old nodes, For clustering mechanism, once x or/and y are new nodes, no common neighbor they would share.Then we define, according to the implementation of clustering mechanism, l clus xy = 1 × 1 N if one of them is new node, and l clus xy = 1 if both of them are new nodes.If both of x and y are old nodes, l clus xy = 1 2 ( 1 N × |Γ(x)∩Γ(y)| z =x |Γ(x)∩Γ(z)| + 1 N × |Γ(y)∩Γ(x)| z =y |Γ(y)∩Γ(z)|

xy = 1
× 1 N .If both of them are new nodes, l clus xy = 1.And if both of them are old nodes, l clus xy = 1 N × 1 N .
and consisting of 2570535 users and 33140018 links in total.Here we only use a small sample by choosing out the links with time stamps 2006-11-02 and 2006-11-03.The links created at 2006-11-03 are considered as future links and the rest of links compose the observed network.
and |E new | = |E ′ − E| are the numbers of new nodes and links during (t 1 , t 2 ).|E ′ new | denotes the number of new links among old nodes only.