Joint estimation of preferential attachment and node fitness in growing complex networks

Complex network growth across diverse fields of science is hypothesized to be driven in the main by a combination of preferential attachment and node fitness processes. For measuring the respective influences of these processes, previous approaches make strong and untested assumptions on the functional forms of either the preferential attachment function or fitness function or both. We introduce a Bayesian statistical method called PAFit to estimate preferential attachment and node fitness without imposing such functional constraints that works by maximizing a log-likelihood function with suitably added regularization terms. We use PAFit to investigate the interplay between preferential attachment and node fitness processes in a Facebook wall-post network. While we uncover evidence for both preferential attachment and node fitness, thus validating the hypothesis that these processes together drive complex network evolution, we also find that node fitness plays the bigger role in determining the degree of a node. This is the first validation of its kind on real-world network data. But surprisingly the rate of preferential attachment is found to deviate from the conventional log-linear form when node fitness is taken into account. The proposed method is implemented in the R package PAFit.

Supplementary Figure S1. Some examples of choosing the optimum regularization parameters. In all these networks, the true distribution of node fitnesses is a gamma distribution with mean 1 and variance 1/s * . For each combination of r and s, A k and η i are estimated using the learning data, and then the log-likelihood of the unseen testing data is calculated. The optimal combination of r and s is the one that gives the highest log-likelihood of the unseen testing data. The lightblue band at estimated points represents the estimated two-sigma confidence intervals of the associated point estimates.
Next we specify the regularization parameters of A k (the parameter r) and η i (the parameter s). For the regularization of A k , we set r = 0.1, as Pham et al. 2 have shown through simulations that a small amount of regularization helps reducing the error in estimating A k . To investigate the effect of mis-specification of the true distribution of node fitnesses, we consider three different values for the regularization parameter s: 0.1, 1 and 10.
To summarize, the simulation settings are as follows. For each of the 48 combinations of true functions of A k and true distributions of η i , we generate 100 networks according to the GT model. Starting from a seed network with 20 nodes, at each time-step, m(t) = 5 new edges and n(t) = 1 new node were added, until a total of 5000 nodes is reached. We then apply PAFit to each generated network. Regarding η i , we consider three cases: s = 0.1, 1 and 10. Regarding A k , the parameter r is fixed at 0.1. The A k are also grouped into logarithmic bins and the number of bins B is 20. Supplementary Fig. S5 shows average relative errors e η and e A averaged over 100 repetitions.
First consider e η in estimating η i , we see that PAFit outperforms the growth method, even when the regularization parameter s is mis-specified. The effect of mis-specifying s is clear, when the case s is equal to the true parameter s * almost always gave the lowest error.
Next, let us consider the performance of PAFit through e A . Since there are no existing methods that estimate A k with considerations for node fitness, we have to introduce the following two variants of PAFit to be served as reasonable baselines. The first variant, called baseline1, estimates A k while fixing the fitness of node v i at the number of new edges it has acquired in the growth process, i.e. letting η i = ∑ i z i (t). Since we can see PAFit markedly outperforms baseline1 in all situations, the seemingly reasonable heuristic of using the number of acquired new edges as fitness results in surprisingly poor estimation. In the second variant called baseline2, we essentially ignored the effect of fitness by fixing η i = 1 for each node v i . Note that this is equivalent to the method proposed in our previous work 2 . PAFit outperforms baseline2 in almost all cases except when A k = k α when α is near 1.5.
With such a large value of the attachment exponent, the winner-take-all effect becomes prominent even in finite data. As explained in the Introduction Section of the main text, the winner-take-all situation indicates a situation where only a handful of nodes, perhaps even only one node, actually acquire new edges, while the vast majority of nodes do not acquire any new edges at all. This winner-take-all effect drastically skews the amount of data for each node. Also note that, e A generally increases with α in this model. This phenomenon, which has also been confirmed in the case of estimating A k in isolation 2 , indicates that the estimation problem gets "harder" as the degree of the winner-take-all effect increases. The simulated results suggest that fixing η i = 1 for all i in such cases, which serves as a regularization that helps to reduce the dimension of the problem, actually  1 ) when the true node fitnesses are sampled from a log-normal distribution. First row: A k = max(k, 1). Second row: A k = 3 (log max(k, 1)) 2 + 1. The plots are on a log-log scale. The lightblue band at estimated points represents the estimated two-sigma confidence intervals of the associated point estimates. For the first row, the growth method gave the correleration coefficient between estimated and true fitness r η = 0.74 (the higher the correleration coefficient, the better the estimation) and the relative error e η = 0.27 (the lower the relative error, the better the estimation). With the chosen (r, s) being results in better estimation of A k . The effects of mis-specifying s on estimating A k is surprising: the value s = 0.1 gives better estimation of A k than other values of s, even when the true parameter s * is not 0.1. Note that in the log-linear model with α ≥ 1, the specification of s seems to be unimportant, since all values of s gave almost the same error.
Overall, we can conclude that PAFit can indeed estimate both A k and η i successfully. PAFit also performs better than the growth method in estimating η i . The specification of s seems to be important, but when the mis-specification is not so great (as in this example, about 10 times), the estimated results of both η i and A k do not worsen much.

S1.4 Estimation results when varying the ratio between learning data and full data
While we have used p = 0.75 in the main text, here we present estimation results for two additional cases when the ratio p between the learning data and the full data is 0.5 and 0.9. When p = 0.5, the optimal combination of the regularization parameters is (r, s) = (0.28, 32.47), while this combination is (r, s) = (0.03, 2.66) when p = 0.9. When the ratio between the 4/16 5e−02 5e−01 5e+00 5e+01 5e−02 5e−01 5e+00 5e+01 True fitness Degree k + 1 Estimated PA function Proposed method Constant η a: Growth method 5e−02 5e−01 5e+00 5e+01 5e−02 5e−01 5e+00 5e+01 True fitness Estimated fitness learning data and the full data is p, recall thatÂ full,p andη full,p are the estimated PA function and the estimated fitness using the full data with the optimal (r, s), respectively (see Fig. 2). Similarly, here we letÂ learn,p be the estimated PA function using the learning data with the same optimal (r, s). Supplementary Fig. S6a showsÂ learn,p and Supplementary Fig. S6b showsÂ full,p for three different values of p. Supplementary Fig. S7 showsη full,0.75 in comparison withη full,0.5 andη full,0.9 for nodes whose number of acquired edges is at least 50.
Overall, the results concerning the PA function do not change significantly if we use p = 0.9 or p = 0.5. In Supplementary  Fig. S6, we notice thatÂ learn,0.75 andÂ full,0.75 are reasonably similar, and they are both strikingly close to bothÂ learn,0.9 and A full,0.9 . Since we useÂ full,p for subsequent analyses, the closeness ofÂ full,0.75 andÂ full,0.9 indicates there is almost no change if we choose p = 0.9. WhileÂ learn,0.5 is substantially different fromÂ full,0.5 ,Â full,0.5 are reassuringly close to A full,0.75 . This closeness again implies that the results concerning the PA function does not change significantly even if p = 0.5.
The results concerning node fitness also do not change significantly if we use p = 0.9 or p = 0.5. In Supplementary Fig. S7, the point (η full,0.75 ,η full,0.9 ) are strikingly close to the y = x line. The R 2 coefficient, whose values close to 1 often indicate Supplementary Table S1. Summary of true PA functions used in Monte Carlo simulations. There is a total of 16 different functions.
a linear relationship, is 0.99 in this case. Although the points (η full,0.75 ,η full,0.5 ) deviate a bit from the y = x line , the R 2 coefficient is 0.91, which can still be considered high.

S1.5 Additional data analyses in the Facebook wall-post dataset
We show in Supplementary Fig. S8 the same degree growth curves as in Fig. 6, but this time with the horizontal axis being time. The same trend in which nodes with high fitness have steep growth curves is still visible here, but is difficult to see due to the differences in birth times of the curves. Nevertheless this figure reveals some additional information that cannot be seen in Fig. 6. One might notice a sudden increase in the number of new nodes at each time-step from around time-step 600. This increase can be seen more clearly from the change in the slope of the number of new nodes n(t) in Supplementary Fig. S9a. The increase of new nodes leads to an increase of the number of node fitnesses we have to estimate. One might worry that there might be not enough data to do this adequately.
Fortunately, since this increase in n(t) is accompanied by an increase in the number of new edges m(t) ( Supplementary  Fig. S9b), the normalized number of new edges is stable ( Supplementary Fig. S9c), and thus the whole system is in fact stable. The normalized number of new edges, which is defined as µ(t) = m(t)/ ∑ j:v j exists at t A k j (t) η j , can be intuitively viewed as the amount of new data available at time-step t, after taking into consideration the differences in network sizes or degree distributions, etc. at different time-steps. µ(t) is calculated using the observed k i (t), while in Supplementary Methods Section S1.6 it was calculated using the expected k i (t).
We note that the oscillations observed in µ(t) are caused by the oscillations in m(t). The normalized number of new edges µ(t), by definition, can be affected by m(t) and ∑ j:v j exists at t A k j (t) η j , and the latter in turn can be affected by n(t). But while there is no periodicity in n(t) (Supplementary Fig. S9d), a weekly period can be seen clearly in both the autocorrelation function of m(t) (Supplementary Fig. S9e) and µ(t) (Supplementary Fig. S9f). We found that the cause of this periodicity is that m(t), the number of new wall-posts, tends to increase toward the middle of the week and decrease on weekends ( Supplementary  Fig. S10).
Although Eq. (S1) implies that we can only compare the growth rate of two curves in Supplementary Fig. S8 at the same time-step, it is possible to use µ(t) to compare the growth rate of two degree growth curves of some nodes v 1 and v 2 (v 1 and v 2 can be the same) at two different time-steps t 1 and t 2 . Assuming η 1 and η 2 are the fitness of v 1 and v 2 respectively, the corresponding growth rates of the two curves are A k 1 (t 1 ) × µ(t 1 ) × η 1 and A k 2 (t 2 ) × µ(t 2 ) × η 2 . Furthermore, if k 1 (t 1 ) = k 2 (t 2 ), i.e. the curve of v 1 at time t 1 is on the same horizontal level with the curve of v 2 at time t 2 in Supplementary Fig. S8, then we can ignore the PA factor when comparing them, and the two (relative) growth rates reduce to µ(t 1 ) × η 1 and µ(t 2 ) × η 2 .

S1.6 Calculating expected degree growth curves
In short, the theoretical degree growth curves in Fig. 6 are expected degree growth curves calculated by evolving the whole network from the initial time-step using the estimated PA function (Fig. 5b) and estimated fitness of all real nodes (Fig. 5c), with the number of new edges and new nodes are kept exactly the same as in the observed evolution process. Specifically, for each η = 8, 4, 2, 1, 0.5 and 0.25, we repeat the following procedure: • Step 0: We include a generic "ghost" node v ghost with fitness η into the network at the initial time with initial degree k ghost (0) = 10. Set t = 0.
• Step 1: If t = T − 1, go to Step 3. Otherwise, for every node v i that appears in the network at time t we calculate   Figure S6. Estimated PA functions in the Facebook wall-post dataset using the learning data (Â learn,p ) and the full data (Â learn,p ) for three cases when p, the ratio between learning data and full data, is 0.5, 0.75 and 0.9. The closeness ofÂ full,0.75 ,Â full,0.9 andÂ full,0.5 implies that the results concerning the PA function in the main text does not change significantly if p = 0.9 or p = 0.5.

S2 Supplementary Methods
In Section S2.1, we discuss related work on estimating PA and node fitness. We provide the definition of the undirected GT model in Section S2.2. The derivation of the log likelihood function is shown in Section S2.3. Finally, Section S2.4 provides the MM algorithm for estimation.

S2.1 Related work
Here we discuss existing work that is related to our proposed method. First consider the case when either A k or η i does not change with time, which is the situation we assume in this paper. In this case, the problem of estimating the PA function A k while fixing η i equal to 1 for all i has attracted the attention of many researchers, and there are a number of estimation methods [2][3][4][5][6][7][8][9] . While most of these methods assume the log-linear form A k = k α , Pham et al. 2 recently provided a statistical method that does not assume any functional form for A k . Assuming linear PA (A k = k) upfront, Zhang et al. 10 proposed a method to find the optimal weight to combine the PA and clustering mechanisms. Middendorf et al. 11 used discriminative classification techniques to classify an evolving network according to one of the many network evolving mechanisms, including the linear PA mechanism. For the problem of estimating fitness in the time-invariant case, Kong et al.'s growth method 1 is the only existing method we know of that estimates η i , albeit under the assumption that A k = k. Their method estimates η i by using the following asymptotic formula related to the degree of a node with its fitness: The slope of the least-squares fit of logt to log k i (t) gives us the fitness η i . The fact that the growth method is an asymptotic method imposes several limits on the resulting estimation. Firstly, we have no assurances of how the method will behave on finite data. For example, the slope of the least-squares line of best-fit in Eq. (S2) might very well be negative, and thus lead to a nonsensical estimated value for η i . Secondly, the assumptions needed to derive the asymptotic formula can be wrong in real-world networks. Most strikingly, the linear A k assumption might not be supported by the observed data, as shown in the main text. From another different view point, Blasio et al. 12 considers the problem of distinguishing the effects of PA and fitness through statistical tests. They assumed that the PA effect only works in the linear form A k = k, and did not consider the problem of estimating fitness. Supplementary Figure S7. Estimated fitness in the Facebook wall-post dataset using the full data, for three cases when p, the ratio between learning data and full data, is 0.5, 0.75 and 0.9.η full,p is the estimated fitness using the full data when the ratio between the learning data and the full data is p. p = 0.75 is the value used for the results in the main text. The three cases give similar fitness values.
If we allow time-variation in either the PA function or node fitness, there is a growing body of methods for estimating A k or η i . But comparing with the focus of our paper, the main focus of most of these estimation methods has been on a narrower class of complex networks, namely citation networks, where the effects of time on the probability that a paper or a patent will be cited are substantially studied and modeled. The time effect in these methods is also aptly called aging to describe the widely-used model in which A k (t) or η i (t) gradually decrease as the time t increases [13][14][15] . For the estimation of A k with aging, various methods have been proposed [16][17][18] . All of them assume that η i = 1 for all i. For estimating η i with aging, some notable methods 19,20 are provided in the context of citation networks. Note again that the PA effect in these methods is assumed to be linear.
Lastly, generalizations of the ER random network model have long been proposed to model static networks. Since static networks are without growth, PA does not enter the picture, and the mechanism driving the birth of new edges is node fitness. For example, in the β -model 21 , or p 1 -model as it was originally called 22 , the probability of an edge is proportional to the product of the fitness of its two end nodes. This model is closely related to the well-known Bradley-Terry model 23 , which is often used to model pair comparisons. Statistical aspects of these models, including estimation methods and asymptotic properties, have been studied extensively 21,[24][25][26][27] . From another related direction, asymptotic properties of some important statistics, such as the maximum degree K, of the BA model have also been researched rigorously 28,29 . All these works might be important references when we want to investigate asymptotic properties of the PAFit method.  Figure S8. Degree growth curves in Fig. 6 with time as horizontal axis. The dashed lines are theoretical growth curves of a generic node with true fitness η = 8, 4, 2, 1, 0.5 and 0.25, based on the GT model. These curves are added as visual guides, and are calculated using the procedure described in Supplementary Information Section S1.6. Nodes with high fitness can still be seen to have dominating growth curves. Using this figure, we can compare the growth rate of curves at the same time-step. Growth rates of curves at two different time-steps can also be compared if we use the information from µ(t), the normalized number of new edges. a: same curves as in Fig. 6a (200 randomly chosen curves from nodes whoseη < 1). b: same curves as in Fig. 6b (200 randomly chosen curves from nodes whose 1 ≤η < 2). c: same curves as in Fig. 6c (200 randomly chosen curves from nodes whoseη ≥ 2).

S2.2 The undirected GT model
For an undirected network, the GT model defines the probability π i, j (t) of a new edge e i, j as follows.
for i = j, and with k i (t) and k j (t) are degrees of nodes v i and v j at time-step t.

S2.3 Derivation of the log-likelihood function
Suppose that we observe the sequence {G t } T t=0 of networks. Let K and N be the maximum degree and the final number of nodes in a GT model network, respectively. Let A = [A 0 A 1 · · · A K−1 ] and η = [η 1 η 2 · · · η N ] be the parameter vectors we want to estimate.
The likelihood of the whole dataset can be built up sequentially from the likelihood of data at each time-step. For the initial graph G 0 , we assume that its distribution is governed by some parameter vector θ * that does not involve A and η. For example, if G 0 is an Erdös-Rényi random graph in which the numbers of nodes and edges both follow a Poisson distribution with mean 1, then θ * is the vector [1 1]. Conditioning on G t−1 (t ≥ 1), the probability P(G t |G t−1 ) is the product of two terms: P (m(t), n(t)|G t−1 , θ (t)) and P (G t |G t−1 , m(t), n(t), A, η). The first term is the probability of the numbers of new edges and new nodes given G t−1 , which is assumed to be governed by a parameter vector θ (t) that does not involve A and η. The vector θ (t) can be the mean parameters of Poisson distributions, for example. The second term is the probability of how the new edges and new nodes form G t given G t−1 , m(t) and n(t).
Since θ * and θ (t) do not involve A and η by assumption, we can ignore all but the first term on the right hand side of Eq. (S5) in deriving the MLE of A and η. We denote this term by l (A, η), which means the log-likelihood of the data with respect to A and η. . This µ(t), which can be intuitively viewed as the amount of new data available at each time-step, is actually quite stable in the region around time-step 600. d: Autocorrelation function of n(t). There is no periodicity here. e: Autocorrelation function of m(t). There is a clear weekly period. f: Autocorrelation function of µ(t). There is also a clear weekly period. Thus the periodicity of µ(t) is caused by the periodicity of m(t), which in turn is caused by the tendency of m(t) to increase toward the middle of the week and to decrease on weekends.
At the onset of time-step t, denote z i (t) be the number of new edges that connect to node v i . First consider the directed GT model. Although the directed GT model defined by Eq. (2) does not completely determine G t , it completely determines the in-degree statistics, which are enough for the estimation of A k and η i . So we abuse the notation a bit and write P (G t |G t−1 , m(t), n(t), A, η) for the probability of the in-degree statistics in G t given the conditional. Now having defined the notation, the term P (G t |G t−1 , m(t), n(t), A, η) is then a multinomial distribution probability. This comes from the observation that given m(t), the quantities z 1 (t), z 2 (t), . . ., z N (t) follow a multinomial distribution with parameters π 1 (t), π 2 (t), . . ., π N (t), where π i (t), the probability that a newly added edge at time t connects to node v i , is Here we use two conventions: if node v j does not exist in the network at time t, then k j (t) = −1 and A −1 = 0. Note that both A k and η i are only identifiable up to a multiplicative constant, as can be seen from Eq. (S6). For the formal treatment in this section, we assume that A 0 = 1 and η 1 = 1. In practice, one can normalize A k and η i separately as one deems fit. We can write down the log-likelihood function: with C being a constant that depends neither on A or η. We note about the concavity of the log-likelihood function. In general, l (A, η) is not jointly concave in A and η. But using the same calculation as in Pham et al. 2 , we can show that for any fixed value A * of A, l(A * , η) is concave in η. Similarly, for any fixed value η * of η, we can show the concavity of l(A, η * ).
In the undirected GT model, the probability of an edge e i, j given by Eqs. (S3) and (S4) can be transformed as follows.
where I is the indicator function and π i (t) is defined in Eq. (S6).
For the case i = j, by viewing an undirected edge e i, j as two directed edges connect to v i and v j , we can see that the log-likelihood function in Eq. (S7) correctly captures the log-likelihood of e i, j , up to an additive constant of log 2. For the case i = j, since we interpret a loop as one directed edge, we miss an additional term of log π i (t) in the log-likelihood function. Denote l undirect (A, η) as the log-likelihood of the undirected GT model, and ζ i (t) as the number of self-loops of node v i at time t. Recall that l (A, η) is the log-likelihood of the converted directed network under the directed GT model. Then we have l (A, η) is exactly the log-likelihood of the undirected network, and so the algorithms described in the next section can be applied without any modifications.

S2.4 Algorithms for maximizing the objective function
Here we describe an MM algorithm for maximizing the objective function h(A, η) in Eq. (3). The strategy here is the same as what can be found in Pham et al. 2 for the case of estimating A k in isolation. First we find an MM algorithm for maximizing the log-likelihood function (Eq. (4)). The algorithm for maximizing h(A, η) is then based on this algorithm. Solving the likelihood equation ∂ l/∂ A = 0 and ∂ l/∂ η = 0, we obtain for i from 2 to N. (S10) Since A k and η i appear in both sides of Eqs. (S9) and (S10), an explicit solution for A and η is difficult to obtain. In order to maximize Eq. (S7), we start from some initial value for A and η at step q = 0 and then iteratively update: and for i from 2 to N, (S12) until some convergence condition is met. The running time of each iteration depends on the number of η i and A k we have to estimate. Recall that we often perform binning, in which we set A k = ω i for all k in the i-th bin. All algorithms described in this section remain valid with A k 's replaced by ω i 's. Assuming that the number of bins is B, the running time of each iteration is O(T N + T B). The number of iterations needed depends on the convergence condition and the particular dataset.
In order to show that the algorithm consisting of Eqs. (S11) and (S12) can indeed increase the log-likelihood function l(A, η) at each iteration, we first find a minorize function Q (A, η) of l at the point (A (q) , η (q) ). Such a function Q, by definition 30 , satisfies Q ≤ l at all points and Q(A (q) , η (q) ) = l(A (q) , η (q) ). Consider the function One can check that Q defined by Eq. (S13) is a minorize function of l. Next, it can be easily verified that Eq. (S11) is in fact the solution of ∂ Q/∂ A = 0, thus Q (A, η) ≤ Q A (q+1) , η . Finally, we have l(A (q) , η (q) ) = Q(A (q) , η (q) ) ≤ Q A (q+1) , η (q) ≤ l A (q+1) , η (q) , which shows that Eq. (S11) indeed increases the log-likelihood. In the same way we can show that Eq. (S12) further increases l from l A (q+1) , η (q) .

13/16
Note that, the current algorithm (Eqs. (S11) and (S12)) first calculates A (q+1) , and then use A (q+1) to calculate η (q+1) . It is possible to calculate A (q+1) and η (q+1) in parallel by modifying the current algorithm. Define then Q is a minorize function of l at point (A (q) , η (q) ). The key point here is that the two sets of equations ∂ Q /∂ A = 0 and ∂ Q /∂ η = 0 are separable, so we can calculate A (q+1) and η (q+1) in parallel by using Q instead of Q. Nevertheless, we prefer the current algorithm in Eqs. (S11) and (S12), since cyclic updating of A and η is potentially faster than updating in parallel 30 . Furthermore, the update formulas in Eqs. (S11) and (S12) are also conceptually simpler. Next we derive the update equation for η i for the objective function in Eq. (3). The new formula for η (q+1) i is similar to Eq. (S12), but with the discount from s added: , except when s < 1, the estimated fitness of a node whose ∑ T t=1 z j (t) = 0 is simply 0. Finally, we derive the update for A k when the regularization term in Eq. (5) is added. Although the same derivation can be found in Ref. 2, we also record it here for completeness. A (q+1) k is the solution of a univariate equation which can be derived as follows. First we need to find a minorize function of Eq. (5). The strategy is the same as what can be found in Ref. 30. For a concave function g(x), we have the following inequality that comes directly from the definition of a concave function. (S14) We can easily verify that the right hand side of Eq. (S14) is a minorize function of Eq. (5). Combining this function with the minorize function of the log-likelihood (see Eq. (S13)), we get a minorize function Q A of h(A, η) at iteration q. The main point of deriving Eq. (S14) is that by using Eq. (S14), at each iteration we can turn the K + 1-variable maximization problem ∂ Q A /∂ A = 0 into K + 1 problems ∂ Q A /∂ A k = 0 in which each is univariate and can be easily solved in parallel.