Abstract
Complex network growth across diverse fields of science is hypothesized to be driven in the main by a combination of preferential attachment and node fitness processes. For measuring the respective influences of these processes, previous approaches make strong and untested assumptions on the functional forms of either the preferential attachment function or fitness function or both. We introduce a Bayesian statistical method called PAFit to estimate preferential attachment and node fitness without imposing such functional constraints that works by maximizing a loglikelihood function with suitably added regularization terms. We use PAFit to investigate the interplay between preferential attachment and node fitness processes in a Facebook wallpost network. While we uncover evidence for both preferential attachment and node fitness, thus validating the hypothesis that these processes together drive complex network evolution, we also find that node fitness plays the bigger role in determining the degree of a node. This is the first validation of its kind on realworld network data. But surprisingly the rate of preferential attachment is found to deviate from the conventional loglinear form when node fitness is taken into account. The proposed method is implemented in the R package PAFit.
Introduction
The study of complex network evolution is a hallmark of network science. Research in this discipline is inspired by empirical observations underscoring the widespread nature of certain structural features, such as the smallworld property^{1}, a high clustering coefficient^{2}, a heavy tail in the degree distribution^{3}, assortative mixing patterns among nodes^{4} and community structure^{5} in a multitude of biological, societal and technological networks^{6,7,8,9,10,11}. Network scientists actively seek to explain these sorts of structural features held in common among complex networks across diverse domains of learning in terms of the ordinary operation of simple mechanistic processes.
An extensive body of literature on the mechanisms of complex network evolution has been amassed in the time since the subject first began to flourish around the turn of the century^{12,13,14}. Various mechanisms have been advanced, including preferential attachment^{15}, node fitness^{16}, node duplication combined with edge duplication and divergence^{17}, homophily^{18}, topological distance^{19} and node birth/death processes^{20}. Among them, preferential attachment and node fitness have garnered special attention, not only because they are the first mechanisms that were proposed to explain structural features observed in realworld complex networks, but also for their easy and attractive interpretations. Preferential attachment (PA) is a “richgetricher” mechanism^{21,22} according to which the amount of some quantity distributed among the members of a population increases with the amount of the quantity they already possess. This is in contrast to fitness, which is a “fitgetricher” effect, whereby the ability of individuals in a population to acquire a given quantity is determined by intrinsic qualities. In this process, the larger the fitness an individual has, the more likely it will be that the individual prospers. Individual node fitness may differ and thus represent heterogeneity in a population.
Network scientists rely on a class of network models, known as generative network models, or sometimes evolving or growing network models, to investigate possible mechanisms underlying complex network formation. In this modelling paradigm, complex networks are generated by means of the incremental addition and deletion of nodes and edges to a seed network over a long sequence of timesteps. This sequence is denoted by with G_{0} the seed and G_{T} the final network. Figure 1 shows an example of a growing network, which is a special kind of generative network model that is defined by a sequence of additions of nodes and edges. The mechanisms according to which a complex network evolves are captured by transition rules governing how G_{t−1} transits to G_{t} at timestep t for t ≥ 1. The rationale behind the study of these models is that if the mechanisms governing node/edge dynamics in a given model produce networks with structural features similar on average to those observed in real networks, then it is within the bounds of possibility that the same mechanisms are also operative in their realworld counterparts.
The BarabásiAlbert (BA) model^{15}, which is closely related to the older Price’s model^{23}, is the most widely known PA based growing network model. It is defined by a simple form of PA in which the probability that a node v_{i} of degree k_{i}(t) = k acquires an edge at timestep t is defined to be proportional to A_{k} = k. The timeindependent function A_{k} is known as the PA function. Historically, the term PA was often used to refer to this special case. But any A_{k} that increases with k on average is in keeping with the spirit of “preferential attachment”. Thus in this paper, we will use the term richgetricher and PA interchangeably to describe the situation when A_{k} is a function that increases with k on average. The functional form of A_{k} has been shown to affect network structure, in particular degree distribution. In a generalisation of the BA model where A_{k} takes the popular loglinear form k^{α} for attachment exponent α > 0, it has been shown that each of the linear (α = 1), sublinear (α < 1) and superlinear (α > 1) subcases result in networks with different asymptotic degree distributions^{11,15,24}. In particular, the case α = 1 generates scalefree networks, which is a class of networks whose frequency of a node with degree k takes the powerlaw functional form k^{−γ} with some positive scaling exponent γ. Although there are some arguments whether realworld networks really are scalefree^{3,25,26,27}, the scalefree property nevertheless serves as an important and founding notion when discussing structural properties of complex networks.
Generative network models based on the fitness mechanism have also been shown to give rise to scalefree networks^{16,28,29,30}. The model of Caldarelli et al.^{28} is the most basic model of this kind. In mathematical terms, each node v_{i} acquires new connections with probability proportional to η_{i}. The timeindependent fitness η_{i} is conventionally interpreted as the intrinsic excellence of node v_{i}. It is important to note that η_{i} is assumed to be independent of any graph theoretic properties, such as node degree. In this paper, we will use the terms fitgetricher and fitness mechanism interchangeably.
Attempts have been made to unify PA and node fitness in a single model. Bianconi and Barabási (BB)^{16} model both PA and node fitness, however, the definition of PA is restricted to that of the original BA model. The General Temporal (GT) model^{31} stochastically models both richgetricher and fitgetricher processes by defining the probability that a node v_{i} with degree k_{i}(t) = k receives new links at timestep t to be proportional to the product:
where A_{k} is a function of degree k and η_{i} the fitness of node v_{i}, respectively. Note that while A_{k} and η_{i} are assumed to be timeinvariant, that is, A_{k}(t) = A_{k} and η_{i}(t) = η_{i} for every degree k, node i and timestep t, the number of new edges m(t) and number of new nodes n(t) at each timestep are free to vary. The GT model includes all of the models mentioned above and more as special cases^{15,16,24,28,32,33}. The landscape of these models is surveyed in Table 1. Holme^{34} provides a recent review of some other temporal network models.
It is generally assumed that a mixture of PA and fitness drive complex network evolution^{16,35,36,37}. But any such mechanism, or combination thereof, no matter how plausible, must be empirically validated using specially designed statistical techniques in order to meet the burdens of science. However, the current crop of statistical estimation methods assumes one of these special cases of the GT model, but never the full model itself. As a result they either ignore the effect of PA or node fitness completely^{19,31,38,39,40,41,42,43,44,45}, or otherwise assume the existence of one in a highly constrained form and work to estimate the other^{29,35,46}. For the problem of estimating fitness in the timeinvariant case, which is the closest to our setting here, Kong et al.’s growth method^{29} is the only existing method we know that estimates η_{i}, albeit under the assumption that A_{k} = k. More details on related works are provided in the Supplementary Information Section S2.1.
The questions as to how PA and node fitness mechanisms could be validated and quantified boil down to the following statistical estimation problem: how are the PA function A_{k} and node fitnesses η_{i} to be estimated from observed network data? It is important to note that no existing work considers the detection or estimation of the joint presence of these richgetricher and fitgetricher effects.
Contrary to previous work, by assuming the GT model in its general form, we let the data speak for itself as regards the quantification of both richgetricher and fitgetricher effects without imposing any assumptions on the functional forms of A_{k} and fitness distribution P(η). For example, we address such questions as: is there evidence for PA in realworld networks even after having taken node fitness into account and vice versa? Another motivation for estimating these effects is that even a rough understanding of the functional forms of A_{k} and P(η) is liable to provide valuable insights into the global characteristics of complex networks. An important theoretical question then arises as to whether the widely accepted loglinear form in k is true of realworld networks, or does A_{k} take other more exotic forms?
Analogous questions arise in the context of fitness. When A_{k} is linear, it has been shown that bounded distributions of node fitness give rise to a powerlaw degree distribution with different scaling exponents, while unbounded distributions lead to a “winnertakesall” scenario, in which a single node absorbs all the newly incoming edges^{16,29,30}. So it is only natural to ask what kind of empirical distributions of node fitness exist in realworld complex networks, after we have allowed the simultaneous estimation of A_{k} free of any assumption on its functional form?
Last but not least, the jointly estimated A_{k} and η_{i} may more accurately reflect the evolutionary mechanisms of a network, than those obtained from a method that estimates either A_{k} and η_{i} in isolation and can be exploited in practical problems. For example, using the estimation result, we are able to calculate the probability a given node receives new links in link prediction problems^{49,50}. Moreover, the η_{i}’s are of particular interests in their own right. Using the η_{i}’s, it is possible to identify the nodes that are really “attractive” based on their intrinsic excellence, after having accounted for the richgetricher effect described by the A_{k} function. This might be of considerably interest, for example, in identifying research papers that have real value^{35,37}.
Our main contributions are twofold. The first contribution is a statistical method called PAFit to simultaneously estimate the PA and node fitness functions without imposing any assumptions on their functional forms. To the best of our knowledge, PAFit is the first ever method in the literature that can do so. Even though there are recent works^{35,44,45,46} that employ a timevarying PA function or node fitness, which at first glance appears to be more general than our timeinvariant setting, all of these works assumed the presence of PA and fitness with functional forms imposed a priori and thus cannot answer the very question about the coexistence of PA and fitness, as well as their true functional forms. While our timeinvariant setting may seem to be restrictive, the nonparametric nature of our method makes it an important first step towards a truly nonparametric timevarying method, if such a method is possible.
In PAFit, we take a Bayesian approach and formulate the estimation problem as the maximization of the loglikelihood function of the GT model with suitably added regularization terms to avoid overfitting. The regularization terms can be interpreted as Bayesian prior distributions of the parameters. Thus the estimated (A_{k}, η_{i}) is the MaximumaPosteriori (MAP) estimate from Bayesian inference. For statistically reliable results, we also implement logarithmic binning over the degrees when estimating the PA function^{31}. We then provide a MinorizeMaximization (MM) algorithm^{51} to efficiently solve the maximization problem. Using the inverse of the negative Hessian matrix of the log posterior calculated at the MAP, our method can also provide approximate credible intervals for the estimated values. The proposed method is implemented in the R package PAFit^{52}. For a tutorial of how to use the package, see the accompanying vignette^{53}.
Our PAFit method contains two regularization parameters: r (PA regularization parameter) and s (fitness regularization parameter). The parameter r controls the amount of regularization for the PA function in so far as the bigger the value of r, the more A_{k} assumes the form k^{α}. On the other hand, 1/s is the variance of a gamma prior distribution over P(η) with mean 1. As will be shown in the Methods Section, each scenario of the coexistence of PA and fitness (e.g. PA only, fitness only, or both PA and fitness and their assumed functional forms) corresponds to a particular combination of the regularization parameters r and s (see Table 2).
In order to choose the optimal r and s for a particular dataset, we use the common approach of splitting the dataset into two parts: a learning part and a testing part. Recall that the full dataset consists of timesteps collected sequentially. In this paper, we set the value of p, that is, the ratio of the number of new edges in the learning data and the full data, to 0.75. This can be done by taking the first threequarters of the full dataset (in terms of number of new edges) as the learning data and taking the remaining last quarter be the testing data. We estimate A_{k} and η_{i} of the GT model for every combination of r and s on some grid D using the learning data and then measure the likelihood of the testing data. It is important to note that the testing part is unseen in the learning phase. Thus a model with a large number of parameters does not necessarily give higher likelihood in the testing part than a model with smaller number of parameters. The workflow of the PAFit method is summarized in Fig. 2. More details are provided in the Methods Section.
In our second contribution, we report the first evidence of the coexistence of PA and fitness mechanisms, or, in other words, richgetricher and fitgetricher effects in the growth of a Facebook wallpost dataset^{54}. While this confirms our expectation that there can be a mixture of two effects driving complex network evolution, we go further and show that, in this dataset, the fitgetricher is actually the stronger of the two effects in governing the degree of a node. We also show that, contrary to the popular assumption of a loglinear PA function, the estimated A_{k} turned out to be highly nonloglinear. These estimated A_{k} become flat in the highdegree region. This might indicate a limit in our capacity to make new acquaintances or new collaborations^{55}. Given that most existing works have modeled the PA function as loglinear in k at best and a substantial body of previous works even assume A_{k} to be linear, this important finding calls for a need to consider more general functional forms.
Results
An illustrative example
Here we present two simulated examples to demonstrate the workings of our proposed methodology. In the first example, the true PA function is A_{k} = max(k, 1), which is the widelypopular linear PA function. The second example uses the true PA function A_{k} = 3(log max(k, 1))^{2} + 1, which presents a nonloglinear function that deviates from conventional assumptions. Other examples with different functional forms are considered in the Supplementary Information Section S1.1. Note that these are true functions used for simulation, not that our PAFit method needs to use any information about them in the estimation. Starting from a seed network with 100 nodes, m(t) = 5 new edges and n(t) = 1 new node are added at each timestep t until the total number of nodes reached is N = 10000. The true underlying node fitnesses are sampled from a gamma distribution with mean 1 and variance 1/s*. Here we set s* = 1. We note that in this case the distribution is also an exponential distribution with mean 1.
We compare PAFit with the growth method of Kong et al.^{29}, which is designed to estimate node fitness, albeit under the assumption that A_{k} is equal to k. The growth method is the closest existing work to our setting. We use the following three metrics to measure how well the methods perform: the average relative error in estimating node fitness, defined as where n is the number of nodes that we estimated fitness for; the average relative error in estimating the PA function, defined as where K is the maximum degree that appears in the growth process of the network; and, finally, the correlation r_{η} between true and estimated fitness. In both methods we only estimate fitness of nodes that acquired at least five new edges in the growth process.
In each example, we follow the workflow of PAFit shown in Fig. 2 over a grid D with r in (0, 0.25, 0.5, 1, 2, 5, 10, 20) and s in (0.1, 0.5, 0.75, 1, 1.25, 1.5, 2, 5, 10). For the A_{k} = max(k, 1) example, the optimal combination is (r, s) = (5, 2). For the A_{k} = 3(log max(k, 1))^{2} + 1 example, the optimal one is (r, s) = (0.25, 2). The final estimators are shown in Fig. 3b,c,e,f.
Let us first consider the results of the growth method shown in Fig. 3a,d. In the case of the linear PA function, the growth method gave e_{η} = 0.16 and r_{η} = 0.74. For the nonloglinear PA function A_{k} = 3(log max(k, 1))^{2} + 1, the growth method gave e_{η} = 0.26 and r_{η} = 0.57. It is encouraging to note that the growth method performed better in the linear case, which is precisely the situation for which it is designed. Although the growth method performed acceptably well in both cases, one can see that the estimated fitness does not follow the true fitness closely, especially when A_{k} is nonloglinear.
Turning our attention to the results PAFit shown in Fig. 3b,c,e,f, it gave e_{η} = 0.08, r_{η} = 0.84, e_{A} = 0.0007 when A_{k} is linear; and e_{η} = 0.09, r_{η} = 0.9, e_{A} = 0.004 when A_{k} = 3(log max(k, 1))^{2} + 1. We can see that PAFit succeeded in the simultaneous recovery of A_{k} and η_{i} in both cases and clearly outperformed the growth method. We note that one advantage of PAFit is that it can naturally estimate confidence intervals for the estimated results.
To find out whether joint estimation of PA and fitness is needed in estimating the PA function, we compare PAFit with a method we named “constant η”, in which we also use PAFit, but assume the model of Krapivsky et al.^{24} with η fixed at 1. The constant η method gave e_{A} = 0.003 when A_{k} is linear and e_{A} = 0.04 when A_{k} is nonloglinear. These two numbers, which are much worse than those of the simultaneous estimation results (e_{A} = 0.0007 when A_{k} is linear and e_{A} = 0.004 when A_{k} is nonloglinear), clearly indicate the need for simultaneous estimation of PA and fitness.
We note that for the PAFit method there is a tendency such that the more new edges a node acquires in the growth process, the better its fitness can be estimated. The simple reason for this is that the number of new edges a node acquires corresponds to the amount of observed data for that node.
We make some remarks about the chosen values of r and s. The chosen r correctly reflects the fact that it is the regularization parameter that enforces the loglinear form k^{α}. In the loglinear example, the chosen r is large (r = 5), while in the nonloglinear example, the chosen r is small (r = 0.25). Although PAFit did not recover the true parameter s* of the underlying gamma distribution of node fitnesses, we note that in both examples the chosen s’s are very close to s*. Due to random fluctuations, s* does not necessarily best represent the observed data. Indeed, the estimated PA functions and node fitnesses in both examples agree well with the true values. In Supplementary Information Section S1.1, we give more examples of choosing the regularization parameters in six simulated networks and show that in all cases PAFit succeeds in recovering both PA and fitness simultaneously.
In these two simulated examples, the true distribution of node fitnesses is the same as the prior distribution of node fitnesses in PAFit, i.e., both are gamma distributions. On the other hand, the growth method is a distributionfree method. In Supplementary Information Section S1.2, we show four examples where PAFit outperforms the growth method when the true distribution of node fitnesses is lognormal or powerlaw, which are more heavytailed than the gamma distribution.
Finally, in Supplementary Section S1.3, we perform a simulation study with 48 combinations of different s* and different true functional forms of A_{k}, where each combination consists of 100 simulated networks. We show that PAFit generally outperforms existing methods in estimating PA and node fitness.
Realworld dataset
We apply PAFit to a realworld network: a directed multiple network representing wallposts between a subset of Facebook users from 2005 to 2009^{54}. A directed edge in the network represents a post from one user to another user’s wall. One might speculate that the following factors are important for a user to attract posts to his/her wall: a) How much information about his/her life that he/she publicises: his/her birthday, engagement, promotion, etc. b) How influential and/or authoritative his/her own posts are which call for further discussions from other people; and c) how responsive the user is in responding to existing wall posts. We then can hypothesize fitness η_{i} to be a combination of these three factors averaged over time. On the other hand, PA can be interpreted as a herding effect of some kind: A_{k} captures the averaged pattern of how people will post more on a wall based solely on how many wall post it already has, regardless of all other factors such as the wall owner’s characteristics, the content of existing posts and so on.
We choose the network at the onset of year 2007 as the initial network and use the data added from 2007 to 2009 to estimate A_{k} and η_{i}. We also grouped the edges into daily timesteps as has previously been done in other social network datasets^{56,57}. The total number of nodes V and total number of edges E in the final snapshot of the network are 46952 and 876993, respectively. Meanwhile, T = 754 is the number of observed timesteps, while ΔV = 37967 and ΔE = 803930 are the increments of nodes and edges after timestep t = 0, respectively. We fit the powerlaw distribution k^{−γ} to the indegree distribution of the final snapshot by the MLE method^{26}. We choose 40 as the starting degree from which the distribution is assumed to be powerlaw and find the estimated γ to be 2.3. K = 1428 is the maximum degree that appears in the growth process. Finally, we use B = 50 logarithmic bins for the PA function.
Coexistence of PA and fitness
We found that for the Facebook dataset, the optimum combination of the regularization parameters is when (r, s) = (0.29, 4.64). As can be seen from the density plot in Fig. 4, this point is well inside the area of the GT model. This indicates the necessity of simultaneous estimation of both fitness and PA free of any assumptions. Estimating either η_{i} or A_{k} in isolation, or estimating the attachment exponent α and node fitness η_{i} jointly with the assumption A_{k} = k^{α} as in the extended BB model gave much worse loglikelihood of the testing data than the best combination.
Figure 5a shows the estimated A_{k} when fitness is ignored, while Fig. 5b,c show the estimated A_{k} and the distribution P(η) of the estimated η_{i} in the case of joint estimation, respectively. We also ran PAFit for a number of other combinations of r and s around the maximum point (0.29, 4.64), as well as for a number of different values of r when s is held fixed at 4.64. We found that the estimation results in these cases are similar to the estimation results when we use the best combination (figures not shown). This indicates, understandably, that our method is robust. We also note that, reassuringly, with the optimum combination of parameters, the estimation results of A_{k} and η_{i} when using only the learning data are similar the estimation results when we use the full dataset (see Supplementary Information Section S1.4 and Supplementary Fig. S6). This assures us that the growth mechanisms of the network in the learning data and in the full data are reasonably similar. This implies that the use of the learning data and the testing data to choose the regularization parameters as in our aforementioned procedure is sound. We also note that the main findings in this section do not change if we change the ratio between learning data and full data from 0.75 to 0.5 or 0.9 (see Supplementary Information Section S1.4).
Inspecting the estimated A_{k} in Fig. 5b, we observe several important findings. Firstly, the estimated A_{k} is an increasing function, thus clearly signals the existence of the richgetricher phenomenon (corresponding to an increasing A_{k} on average). Secondly, the estimated A_{k} is highly nonlinear in logscale, which is different from the widely assumed loglinear model A_{k} = k^{α}. This reinforces the need to consider nonloglinear functional forms when modelling the PA function^{31,47}. Since the estimated A_{k} is nearly loglinear when fitness is ignored (Fig. 5a), this dataset shows the need for joint estimation of PA and node fitness. Finally, the form of the PA function gradually becomes flat when the degree is large, which might indicate a limit in our capacity to make new acquaintances or new collaborations^{55}.
To get a sense of the growth rate of the estimated PA function in comparison with the conventional loglinear form, we fitted the function A_{k} = k^{α} to the estimated A_{k} by a weighted least squares method where the weights are inversely proportional to the estimated variance of the ^{31}. Using this procedure, we found that , which implies that the PA function is sublinear in this dataset. Finally, comparing with the estimated PA function in the case of constant node fitness in Fig. 5a, the estimated PA function in Fig. 5b became lower. This indicates that the richgetricher effect became weaker when the fitgetricher effect was taken into account, which is expected since a portion of a node’s ability to attract new edges could then be explained by its fitness.
Turning our attention to the estimated node fitness in Fig. 5c, while almost all node fitnesses are concentrated around the mean, which is 1, there are some nodes with very high fitness. This highly nonuniformity of the fitness distribution is a clear signal of the fitgetricher phenomenon.
Fitness dominates PA
Now with evidence decisively pointing to the coexistence of richgetricher and fitgetricher phenomena, one cannot help but ask the question as to exactly which one of the two effects played the greater role in governing the evolution of node degree over the growth of the network. To investigate the relation between fitness and the degree of a node, in Fig. 6a–c, we drew the degree growth curves of 200 random chosen nodes from three groups: lowfitness nodes with , mediumfitness nodes with and highfitness nodes with , respectively. We also plot theoretical degree growth curves of a generic node with fitness η = 8, 4, 2, 1, 0.5 and 0.25 to serve as anchors (see Supplementary Information Section S1.6 for the way to calculate these curves).
In Fig. 6, the degree of a high fitness node tends to grow faster than that of a low fitness node. This results in a general trend: curves in Fig. 6a mostly have a nearhorizontal orientation, while those in Fig. 6b have mild upward slopes and most of those in Fig. 6c have steep slopes. These observations indicates clearly the fitgetricher effect. We also note that the realworld data curves generally agree well with the theoretical curves, which implies that the estimated fitness of PAFit is consistent with the underlying GT model. We perform some additional analyses on the degree growth curves in Supplementary Information Section S1.5.
To further investigate the intertwined effects of the PA function and node fitness, in Fig. 7 we plot the number of acquired new edges of a node versus its estimated fitness for three groups of nodes with different initial degrees (degree at time 0). We found that in the Facebook dataset, fitness plays the major role in deciding the number of edges a node acquired. In Fig. 7a, the difference in the number of new edges a node acquired is largely explained by its fitness. While the initial degree and hence the PA function, does have a visible effect, its effect is small, since the three groups overlap substantially. A plausible explanation for this phenomenon is that, in the Facebook dataset, the estimated PA function is rather weak (as mentioned earlier, the estimated attachment exponent α is about 0.43). For checking this explanation, we generate two simulated networks as controlled experiments. In both simulations, we set the initial network G_{0}, the number of new edges and new nodes at each timestep the same as what were observed in the Facebook dataset. We also use the variance of Facebook’s estimated fitness (Fig. 5c) for the variance of the gamma distribution to generate true node fitness. On the one hand, Fig. 7b shows the situation when we use the same estimated PA function of the Facebook dataset (Fig. 5b). We can spot a similarity with Fig. 7a: the number of new edges of a node is largely explained by its fitness, not by its initial degree. On the other hand, in Fig. 7c we show the plot when we use the much stronger PA functional form A_{k} = k. This time the three groups are clearly separated by their initial degrees. This shows how the situation would look like if a strong PA function dominated fitness. These two simulated examples strongly imply that a weak PA function is the reason for the dominance of fitness in the Facebook dataset.
Discussion
We have proposed a statistically sound Bayesian method, called PAFit, for estimating both the PA function (A_{k}) and node fitness (η_{i}) in growing complex networks. PAFit is nonparametric in the sense that it does not fix any particular functional form for either A_{k} or η_{i}, so that it is able to detect different types of functional forms.
PAFit uses a PA regularization term and a fitness regularization term to avoid overfitting. The fitness regularization term is equivalent to placing a gamma prior distribution on each fitness. There is the question of how well PAFit performs when the true distribution differs from a gamma distribution. Although an extensive study involving different types of true fitness distributions is needed to answer this question, as a first step, we show by four simulated examples that our method performs well even when the true fitness distribution follows a powerlaw or lognormal form, which is more heavytailed than the gamma distribution.
We use the likelihood of the testing data for choosing the PAFit regularization parameters. Some wellknown statistical criterions such as the Bayesian Information Criterion or the Bayes factor are not known to be applicable in our situation, since not only the data here is not independent and identically distributed, but the number of parameters in our model (A and η) is also a random variable that grows with the size of the network and the number of timesteps. This differers from a standard statistical setting. While the risk of overfitting still remains in PAFit, we contend that our method serves as an important first step before more involved statistical procedures can be developed for our model.
We reported clear evidence for the joint presence of the “richgetricher” phenomenon (corresponding to an increasing A_{k} on average) and the fitgetricher phenomenon in a Facebook wallpost network. The functional form of the PA function A_{k} differs from the conventional loglinear form, A_{k} = k^{α}. We also observed that the distribution of node fitnesses is heavytailed with a number of nodes having very high fitness.
We found that in the Facebook wallpost network, fitness plays the major role in deciding the number of future edges a node acquires, while the PA function has comparatively little effect. We caution that our analysis of the roles of PA and fitness here is rather qualitative. For a more conclusive answer, it might be needed to develop a quantitative method to measure the contribution of PA and fitness.
In this paper, we set the ratio p between the learning data and the full data to be 0.75. Although this choice seems to be arbitrary, we showed that the results in the Facebook dataset do not change if we use p = 0.5 or p = 0.9. As discussed in the Methods Section, given the biasvariance tradeoff in choosing p, we contend that our choice of p = 0.75 in PAFit represents a reasonable balance between two extremes of this tradeoff.
Although the above contributions are established entirely in the setting of growing networks with timeinvariant PA and node fitness functions, one potential merit of our estimated A_{k} and η_{i} is that, since they can be interpreted as the timeaveraged version of some timevarying A_{k}(t) and η_{i}(t), they are arguably more robust to the network fluctuations, as well as the changes in the number of new edges m(t) and new nodes n(t) at each timestep. At a minimum our method stands as a first step towards the full resolution of the estimation of timedependent A_{k}(t) and η_{i}(t).
Our method requires a grid D to search for the optimal pair of r and s. As can be seen from Fig. 4, the loglikelihood of the testing data has only one peak and gradually changes only on logscale. We also reported that the final estimator would almost not change if we used different (r, s) around the optimal pair. We note that we have the same observations on simulated networks. Thus for the initial probing, we recommend to use a coarse grid on logarithmic scale in order to quickly cover a large range. Then one might use another logarithmic scale grid around the peak of the previous search for local exploring.
There are various directions for future research. First, given the new findings we obtained, it is only natural to conduct a largescale application of PAFit to public data to discover the extent to which our findings in the Facebook wallpost dataset generalizes to other complex networks. Secondly, convergence of the PAFit method, as well as consistency and asymptotic normality of the MLE, are open research questions. Thirdly, there are some immediate extensions of the PAFit framework worth pursuing. For example, since PAFit assumes the timeinvariant case of a growing network, it would be interesting to see if one can extend the methodology to the timevarying case with not only addition, but also deletion of nodes and edges. Another interesting extension is to use more heavytailed distributions such as the lognormal or powerlaw as prior distributions for node fitness. Finally, the PAFit method assumes that we fully observed the sequence of network snapshots. However, there are situations where we can only observe the final network snapshot, namely G_{T}, but none of the preceding snapshots. Making PAFit able to jointly estimate the PA function and node fitness will enable us to ask the core question of coexistence of richgetricher and fitgetricher, as well as all other questions concerning the functional forms of the PA function and node fitness, for these networks too.
Methods
The General Temporal model
The PAFit method assumes the GT model, which is a generative network model for both directed and undirected growing networks^{31}. According to the GT model, a network is generated by starting from some seed network G_{0}, then at each timestep t ≥ 1, m(t) new edges and n(t) new nodes are added to G_{t−1} to form G_{t}. Note that m(t) may consist of both new edges that emanate from the n(t) new nodes and emergent new edges between existing nodes. This allows wide applications of PAFit in realworld situations, where new edges do emerge between existing nodes.
Here we state the GT model for directed networks. The details of the undirected GT model is provided in Supplementary Information Section S2.2. When a new edge is added to the network G_{t−1}, it will connect to an existing node v_{i} with probability
where k_{i}(t) is the indegree of node v_{i} at the onset of time t. For a directed network, given m(t) and n(t), Eq. (1) does not completely determine G_{t}, since it ignores the source nodes of the edges. But the quantities A_{k} and η_{i} are by definition concerned with the ability of nodes to acquire new edges and thus are independent of the outdegrees of the source nodes in the directed case. Therefore, modelling only the destination node as in Eq. (1) is actually enough for the estimation of A_{k} and η_{i}. The GT model includes a number of important generative network models as special cases, as can be seen from Table 1.
Finally, it is important to note that, although the GT model in this paper contains only the addition of nodes and edges, this is purely for simplicity and clarity of exposition. The PAFit method is easily extendable to handle the case when there are deletions, as long as the probabilistic mechanism of deletions is independent of the addition mechanism and does not involve A_{k} and η_{i}.
Bayesian estimation
Here we provide a brief discussion of the Bayesian estimation for the directed GT model. The case of the undirected GT model is treated in a similar way. The full details of both cases are described in the Supplementary Information Sections S2.3 and S2.4. Our observed data is the sequence of networks. Let K and N be the maximum degree and the final number of nodes in a GT model network, respectively. Let and be the parameter vectors we want to estimate.
Adopting a Bayesian approach, PAFit maximizes the following objective function:
l(A, η) is the loglikelihood function of the data:
with z_{i}(t) be the number of new edges that connect to node v_{i} at the onset of time t. reg_{A} is the following regularization term for the PA function:
with and m_{k}(t) is the number of edges that connect to a degree k node at time t. reg_{η} is the following regularization term for node fitness:
These two regularization terms are equivalent to Bayesian prior distributions for A_{k} and η_{i}. Thus r of reg_{A} and s of reg_{η} are hyperparameters in the Bayesian interpretation and the estimated (A_{k}, η_{i}) is the MAP estimate.
By using reg_{A} in Eq. (4), we estimate A_{k} without any assumptions on its functional form, but will be able to fall back to the widelyassumed functional form A_{k} = k^{α} when needed, since this regularization term becomes approximately 0 when A_{k} = k^{α} and is negative otherwise. Note that in order to balance the strength of the regularization and the observed data, each quadratic term in Eq. (4) is then weighted by the number of observed data points w_{k} of degree k. If r is 0, then we estimate the PA function without any prior assumptions. The larger the value of r, the more the form of the estimated A_{k} approaches k^{α}. When r = ∞, the strength of Eq. (4) overwhelms the observed data and forces A_{k} to be k^{α}. We note that the regularization term in Eq. (4) is the same as in ref. 31.
We derive this regularization term as follows. Starting from A_{k} = k^{α}, for nonzero log k this is equivalent to log A_{k}/log k = α. Now using the same formula but with k replaced by k + 1 and k − 1 yields log A_{k+1}/log(k + 1) = α and log A_{k−1}/log(k − 1) = α. This implies log A_{k+1}/log(k + 1) − log A_{k}/log k = log A_{k}/log k − log A_{k−1}/log(k − 1). This is equivalent to log A_{k+1}/log(k + 1) + log A_{k−1}/log(k − 1) − 2 log A_{k}/log k = 0. For moderately large k, since log(k + 1) ≈ log(k − 1) ≈ log k, the last equation leads to log A_{k+1} + log A_{k−1} − 2 log A_{k} = 0, whose left hand side forms the quadratic terms of Eq. (4).
For node fitness, the regularization term reg_{η} has the same effect as placing a gamma prior with shape and rate parameters s on each η_{i}, since it is the logarithm of the density function of the gamma distribution. This prior setting is viable, given that the η_{i}’s are positive realnumbers. Gamma priors have been used extensively for the rating parameters of the PlackettLuce model, whose likelihood function consists of multinomial probabilities just as our GT model^{58,59,60}. In the context of growing complex networks, we contend that only the gamma distribution has been explored as a fitness prior^{46}. So in this paper we follow convention and adopt a gamma prior. We note that in large datasets, the likelihood is likely to dominate the prior’s information, so a different prior setting for node fitness is unlikely to change the numerical result significantly.
The mean and variance of our gamma prior are 1 and 1/s, respectively. Thus the larger the value of s, the smaller the variance of the node fitness. In the limiting case when s = ∞, all the η_{i}’s take the value 1. Thus s = ∞ is effectively equivalent to the case when we fix all η_{i} at 1 and only estimate A_{k}, i.e. the Krapivsky et al. model in Table 1.
The objective function in Eq. (2) can be efficiently maximized by a MinorizeMaximization (MM) algorithm^{49}, which in this case is also known as a ConCaveConvex Procedure^{61}. Starting from some initial value (A^{(0)}, η^{(0)}) at iteration q = 0, the proposed algorithm iteratively calculates (A^{(q+1)}, η^{(q+1)}) from (A^{(q)}, η^{(q)}), until some convergence condition (such as the relative difference between successive values of the objective function reaches some threshold) is met. At each iteration q, the proposed algorithm decomposes the multivariate maximization problem into many onedimensional problems in a way such that the value of h(η, A) is guaranteed to increase after each iteration. The onedimensionality of these subproblems allow them to be solved efficiently in parallel. We implemented the algorithm in the R package PAFit^{52}.
Lastly, although we use A_{k}’s in all equations and algorithms in this paper for ease of exposition, in practice one invariably needs to perform binning on the degrees for more reliable results. In binning, A_{k}’s are set to be ω_{i} for all k in the ith bin, then ω_{1}, , ω_{B} are taken as parameters to be estimated. Here B is the number of bins. All the equations and algorithms described in this paper are valid with A_{k}’s replaced by ω_{i}’s. The number of k’s inside a bin is determined by that bin’s width. In PAFit, we choose logarithmic binning in order to create smallwidth bins in low degree regions, where we have many data points for each degree and largewidth bins in the region of highdegrees, where we have few data points for each degree^{31}. In our experience, 20 to 200 is a good range for the number of bins, B.
Choosing regularization parameters by testing data
Here we give more details on the workflow shown in Fig. 2. In this paper, we use 0.75 as the value for p, the ratio of number of new edges between the learning data and the full data. In other words, T_{learn}, the final timestep in the learning data, is chosen so that is approximately three times of . Here recall that z_{i}(t) is the number of new edges that connect to node v_{i} at the onset of time t. When we calculate the loglikelihood of the testing data, we use Eq. (3) but with the set {1, …, N} restricted to the set of nodes that appeared in the learning data, since we do not have η_{i} for the nodes v_{i} that newly appear in the testing data.
We note here about the inherent biasvariance tradeoff in choosing p, the ratio between the learning data and the full data. When p is large, the bias of and is small, but the variance is large. To understand this statement let us take an example when p = 0.99. In this case, our estimated A_{k} and η_{i} using only the learning data are very close to those when we use the full data, since almost all of the full data is learning data. This means the bias is small. But since the testing data, which is the remaining one percent of the full data, has so few observations, any small random fluctuation can greatly change the optimal pair of r and s and thus change and . This means the variance is big. When p is small, a reverse situation occurs: the variance is small, but the bias is large.
While we do not have a theoretical reason to support our choice of p = 0.75 in this paper, we argue that this value of p represents a reasonable balance between the two extremes of biasvariance tradeoff. On the one hand, Supplementary Fig. S6 suggests that there is a sense of convergence of the result when p approaches 1: the estimated results when p = 0.75 and p = 0.9 are very similar and thus the choice of p is not sensitive in this region. On the other hand, the same figure also shows that p = 0.5 is too small to get a reliable result.
It is important to stress that the above approach not only provides a statistically sound way to determine the regularization parameters r and s, but also answers the fundamental question: which of the models in Table 1 best describes the evolving process of a network? To answer this question, we fit each of the models in Table 1 to the learning dataset and evaluate their loglikelihoods on the testing dataset.
Additional Information
How to cite this article: Pham, T. et al. Joint estimation of preferential attachment and node fitness in growing complex networks. Sci. Rep. 6, 32558; doi: 10.1038/srep32558 (2016).
References
Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘smallworld’ networks. Nature 393, 440–442 (1998).
Szabó, G., Alava, M. & Kertész, J. Clustering in complex networks. In BenNaim, E., Frauenfelder, H. & Toroczkai, Z. (eds) Complex Networks, vol. 650 of Lecture Notes in Physics, 139–162 (Springer: Berlin Heidelberg, 2004).
Clauset, A., Shalizi, C. R. & Newman, M. E. J. Powerlaw distributions in empirical data. SIAM Review 51, 661–703 (2009).
Newman, M. E. J. Mixing patterns in networks. Phys. Rev. E 67, 026126 (2003).
Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004).
Barabási, A.L., Albert, R. & Jeong, H. Scalefree characteristics of random networks: the topology of the worldwide web. Physica A: Statistical Mechanics and its Applications 281, 69–77 (2000).
Adamic, L. A. & Huberman, B. A. Powerlaw distribution of the World Wide Web. Science 287, 2115 (2000).
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. & Barabási, A. The largescale organization of metabolic networks. Nature 407, 651–654 (2000).
Vespignani, A. Modelling dynamical processes in complex sociotechnical systems. Nat Phys 8, 32–39 (2012).
Redner, S. How popular is your paper? an empirical study of the citation distribution. The European Physical Journal B  Condensed Matter and Complex Systems 4, 131–134 (1998).
Newman, M. E. J. The structure and function of complex networks. SIAM Review 45, 167–256 (2003).
Dorogovtsev, S. N. & Mendes, J. F. F. Evolution of Networks: From Biological Nets to the Internet and WWW (Physics) (Oxford University Press, Inc., New York, NY, USA, 2003).
Boccaletti, S., Latora, V., Moreno, Y., Chavez, M. & Hwang, D.U. Complex networks: Structure and dynamics. Physics Reports 424, 175–308 (2006).
Newman, M. Networks: An Introduction (Oxford University Press, Inc., New York, NY, USA, 2010).
Albert, R. & Barabási, A. Emergence of scaling in random networks. Science 286, 509–512 (1999).
Bianconni, G. & Barabási, A. Competition and multiscaling in evolving networks. Europhys. Lett. 54, 436 (2001).
PastorSatorras, R., Smith, E. & Solé, R. V. Evolving protein interaction networks through gene duplication. Journal of Theoretical Biology 222, 199–210 (2003).
McPherson, M., Lovin, L. S. & Cook, J. M. Birds of a Feather: Homophily in Social Networks. Annual Review of Sociology 27, 415–444 (2001).
Newman, M. Clustering and preferential attachment in growing networks. Physical Review E 64, 025102 (2001).
Dorogovtsev, S. N. & Mendes, J. F. F. Evolution of networks with aging of sites. Physical Review E 62, 1842–1845 (2000).
Yule, G. U. A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis, F.R.S. Philosophical Transactions of the Royal Society of London B: Biological Sciences 213, 21–87 (1925).
Simon, H. A. On a class of skew distribution functions. Biometrika 42, 425–440 (1955).
Price, D. D. S. A general theory of bibliometric and other cumulative advantage processes. Journal of the American Society for Information Science 27, 292–306 (1976).
Krapivsky, P., Rodgers, G. & Redner, S. Organization of growing networks. Physical Review E 066123 (2001).
Mitzenmacher, M. A brief history of generative models for power law and lognormal distributions. Internet Math. 1, 226–251 (2003).
Newman, M. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 323–351 (2005).
LimaMendez, G. & van Helden, J. The powerful law of the power law and other myths in network biology. Mol. BioSyst. 5, 1482–1493 (2009).
Caldarelli, G., Capocci, A., De Los Rios, P. & Muñoz, M. A. Scalefree networks from varying vertex intrinsic fitness. Phys. Rev. Lett. 89, 258702 (2002).
Kong, J., Sarshar, N. & Roychowdhury, V. Experience versus talent shapes the structure of the web. Proceedings of the National Academy of Sciences of the USA 37, 105 (2008).
Borgs, C., Chayes, J., Daskalakis, C. & Roch, S. First to market is not everything: an analysis of preferential attachment with fitness. In Proceedings of the thirtyninth annual ACM symposium on Theory of computing (2007).
Pham, T., Sheridan, P. & Shimodaira, H. PAFit: A statistical method for measuring preferential attachment in temporal complex networks. Plos One e0137796 (2015).
Krapivsky, P. L., Redner, S. & Leyvraz, F. Connectivity of growing random networks. Phys. Rev. Lett. 85, 4629–4632 (2000).
Callaway, D. S., Hopcroft, J. E., Kleinberg, J. M., Newman, M. E. J. & Strogatz, S. H. Are randomly grown graphs really random? Phys. Rev. E 64, 041902 (2001).
Holme, P. Modern temporal network theory: a colloquium. The European Physical Journal B 88, 1–30 (2015).
Wang, D., Song, C. & Barabási, A.L. Quantifying longterm scientific impact. Science 342, 127–132 (2013).
Blasio, B. F. d., Seierstad, T. G. & Aalen, O. O. Frailty effects in networks: comparison and identification of individual heterogeneity versus preferential attachment in evolving networks. Journal of the Royal Statistical Society: Series C (Applied Statistics) 60, 239–259 (2011).
Ke, Q., Ferrara, E., Radicchi, F. & Flammini, A. Defining and identifying sleeping beauties in science. Proceedings of the National Academy of Sciences 112, 7426–7431 (2015).
Jeong, H., Néda, Z. & Barabási, A. Measuring preferential attachment in evolving networks. Europhysics Letters 61, 567–572 (2003).
Massen, C. & Jonathan, P. Preferential attachment during the evolution of a potential energy landscape. The Journal of Chemical Physics 127, 114306 (2007).
Sheridan, P., Yagahara, Y. & Shimodaira, H. Measuring preferential attachment in growing networks with missingtimelines using Markov chain Monte Carlo. Physica A Statistical Mechanics and its Applications 391, 5031–5040 (2012).
Gómez, V., Kappen, H. J. & Kaltenbrunner, A. Modeling the structure and evolution of discussion cascades. In Proceedings of the 22Nd ACM Conference on Hypertext and Hypermedia, HT’ 11, 181–190 (ACM, New York, NY, USA, 2011).
Kunegis, J., Blattner, M. & Moser, C. Preferential attachment in online networks: Measurement and explanations. In Proceedings of the 5th Annual ACM Web Science Conference, WebSci’ 13, 205–214 (ACM, New York, NY, USA, 2013).
Csardi, G., Strandburg, K., Zalanyi, L., Tobochnik, J. & Erdi, P. Modeling innovation by a kinetic description of the patent citation system. Physica A 374, 783–793 (2007).
Medo, M. c. v., Cimini, G. & Gualdi, S. Temporal effects in the growth of networks. Phys. Rev. Lett. 107, 238701 (2011).
Wang, M., Yu, G. & Yu, D. Measuring the preferential attachment mechanism in citation networks. Physica A: Statistical Mechanics and its Applications 387, 4692–4698 (2008).
Shen, H.W., Wang, D., Song, C. & Barabási, A. Modeling and predicting popularity dynamics via reinforced poisson processes. In Proceedings of The TwentyEighth AAAI Conference on Artificial Intelligence (2014).
Pham, T., Sheridan, P. & Shimodaira, H. Nonparametric Estimation of the Preferential Attachment Function in Complex Networks: Evidence of Deviations from Log Linearity, 141–153 (Springer International Publishing, Cham, 2016).
Erdös, P. & Rényi, A. On random graphs. Publicationes Mathematicae Debrecen 6, 290–297 (1959).
Lü, L. & Zhou, T. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications 390, 1150–1170 (2011).
Lü, L., Pan, L., Zhou, T., Zhang, Y.C. & Stanley, H. E. Toward link predictability of complex networks. Proceedings of the National Academy of Sciences 112, 2325–2330 (2015).
Hunter, D. & Lange, K. Quantile regression via an MM algorithm. J. Comput. Graphical Stat 60–77 (2000).
Pham, T., Sheridan, P. & Shimodaira, H. PAFit: Nonparametric Estimation of Preferential Attachment and Node Fitness in Temporal Complex Networks, R package version 0.7.5 (2015).
Pham, T., Sheridan, P. & Shimodaira, H. PAFit: Nonparametric Estimation of Preferential Attachment and Node Fitness in Temporal Complex Networks, URL: https://cran.rproject.org/web/packages/PAFit/vignettes/Tutorial.pdf. Package PAFit vignette (2016).
Viswanath, B., Mislove, A., Cha, M. & Gummadi, K. On the evolution of user interaction in Facebook. In In Proc. Workshop on Online Social Networks, 37–42 (2009).
Dunbar, R. Neocortex size as a constraint on group size in primates. Journal of Human Evolution 22, 469–493 (1992).
Mislove, A., Koppula, H., Gummadi, K., Druschel, P. & Bhattacharjee, B. Growth of the Flickr social network. In Proc. Workshop on Online Social Networks, 25–30 (2008).
Mislove, A. Online Social Networks: Measurement, Analysis and Applications to Distributed Information System. Ph.D. thesis, Rice University (2009).
Gormley, I. C. & Murphy, T. B. A grade of membership model for rank data. Bayesian Anal. 4, 265–295 (2009).
Guiver, J. & Snelson, E. Bayesian inference for PlackettLuce ranking models. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML’ 09, 377–384 (ACM, New York, NY, USA, 2009).
Caron, F. & Doucet, A. Efficient Bayesian inference for generalized Bradley–Terry models. Journal of Computational and Graphical Statistics 21, 174–196 (2012).
Yuille, A. L. & Rangarajan, A. The concaveconvex procedure. Neural Comput. 15, 915–936 (2003).
Acknowledgements
This work was supported by grants from the Japan Society for the Promotion of Science KAKENHI [JP16J03918 to T.P. and 26120523, 24300106, 16H01547 to H.S.].
Author information
Affiliations
Contributions
All authors designed the research, T.P. and H.S. developed the statistical method, T.P. implemented the software, all authors designed the experiments, T.P. and P.S. performed the experiments, all authors analysed the results. All authors wrote and reviewed the manuscript.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Electronic supplementary material
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
Pham, T., Sheridan, P. & Shimodaira, H. Joint estimation of preferential attachment and node fitness in growing complex networks. Sci Rep 6, 32558 (2016). https://doi.org/10.1038/srep32558
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/srep32558
Further reading

Deciphering the laws of social networktranscendent COVID19 misinformation dynamics and implications for combating misinformation phenomena
Scientific Reports (2021)

A generative model of article citation networks of a subject from a largescale citation database
Scientometrics (2021)

Neutral syndrome
Nature Human Behaviour (2020)

Tour guides’ communication ecosystems: an inferential social network analysis approach
Information Technology & Tourism (2018)

The evolutions of the rich get richer and the fit get richer phenomena in scholarly networks: the case of the strategic management journal
Scientometrics (2018)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.