Weighted Betweenness Preferential Attachment: A New Mechanism Explaining Social Network Formation and Evolution

The dynamics of social networks is a complex process, as there are many factors which contribute to the formation and evolution of social links. While certain real-world properties are captured by the degree-driven preferential attachment model, it still cannot fully explain social network dynamics. Indeed, important properties such as dynamic community formation, link weight evolution, or degree saturation cannot be completely and simultaneously described by state of the art models. In this paper, we explore the distribution of social network parameters and centralities and argue that node degree is not the main attractor of new social links. Consequently, as node betweenness proves to be paramount to attracting new links – as well as strengthening existing links –, we propose the new Weighted Betweenness Preferential Attachment (WBPA) model, which renders quantitatively robust results on realistic network metrics. Moreover, we support our WBPA model with a socio-psychological interpretation, that offers a deeper understanding of the mechanics behind social network dynamics.

. Power-law interpolation slopes γ for the distribution of node degree (deg), node betweenness (btw), and link weight for real-world datasets, obtained using the poweRlaw package in R.

Dataset
Node-deg γ Node-btw γ Link-weight γ Facebook Statistical analysis of betweenness-degree relationship in empirical social networks shows a non-linear dependency between the two node centralities, namely betweenness distribution has a polynomial or exponential rise in relation to degree distribution taken over the same nodes. This aspect is detailed in Figure 1. Each panel contains the best approximating interpolation function for the correlation betweenness-degree.
In order to analyze whether there is a natural attraction between nodes with high fitness (specifically degree and betweenness) and links weights, we use a Pareto approach. Theories like the "80/20" principle 1 , the "distribution of wealth" (e.g. 10% of the people own 90% of the wealth) 7 , are all examples in which there is no direct, linear correlation between a node's property (cause) and it's contextual value (effect), but we can observe associations when we divide a population in percentiles. In an analogous manner, we systematically filter our weighted datasets by leaving the top 1, 2, 5, 10, 25, 50, 75 and 100% percentiles in terms of link weights. For each percentile, we measure the accumulated fitness and weight. These associations are quantified in Table 2 and depicted in Figure 2.
The association methodology implies summing up the weights on all incident links for every node in the network, and then correlating each sum with the measured fitness of the node.  Based on the statistics of betweenness, we argue that the betweenness/degree (B/D) ratio in a social network should have a uniform distribution. As such, Figure 3 presents how the B/D ratios are centered around the value of B/D = 1, in the interval (0.1, 1000). It can be observed that all empirical data, weighted or unweighted, has a specific distribution pattern which presents an even distribution in the selected interval. However, all synthetic datasets (random, small-world, scale-free) used for null-model validation have a considerably narrow interval for the B/D ratio. Google Plus users network 2 (g GP = 0.4820). c. Online social network 4 (g OSN = 0.5921). d. Scientifc collaboration network 5 (g Geom = 0.610). e. Weighted POK online user network 6 (g POK = 4879). f. Co-authorships network 3 (g CoAu = 0.4392). g. Random network 9 (g rand = 0.9374 ± 0.0013) h. Small-world network 10 (g SW = 0.8771 ± 0.0451). i. DPA network 1 (g DPA = 0.7828 ± 0.0182).

SI.2. Best fitness for preferential attachment
The realism assessment based on the comparison between the synthetic preferential attachment (PA) networks and the real-world datasets is done through individual graph metric comparison, as well as using the composite statistical fidelity metric 11 . First, we analyze the distribution of the following six graph metrics on the real-world datasets: average degree (AD), average path length (APL), average clustering coefficient (ACC), modularity (Mod), graph diameter (Dmt), and graph density (Dns). The evolution of these metrics from network sizes N = 590 to N = 364K nodes is presented in Figure 4.
The diversity of the datasets is high, so the purpose of the trend lines present in Figure 4 is not to suggest that graph metrics abide a strict confidence interval, but rather to show how each graph metric evolves with increasing network size (i.e., from N = 590 to N = 364K). As such, we notice the following trends: • AD increases from 10 (N ≈ 1, 000) to 25 (N ≈ 100, 000), within AD ≈ [3 − 33] and standard deviation σ = 8.904.
• APL increases slowly from 4 (N ≈ 1, 000) to 4.5 (N ≈ 100, 000), within AP ≈  Table 5 in the main manuscript for acronyms). Datasets on the OX axis are ordered by increasing network size, from N=590 to N=364K nodes. The trend line is suggested with a red dotted line.
Having established a general overview of how graph metrics evolve in real data based on network size, we further generate WPBA (based on betweenness Btw as node fitness) networks and compare them to DPA (based on degree Deg), ECPA (based on eigenvector centrality EC), ClsPA (based on closeness Cls) and CCPA (based on clustering coefficient CC). To have a comparison base with the real-world results in Figure 4, we make measurements and then average them for 10 networks corresponding to each centrality, for network sizes of N = 1K, 2K, 5K, 10K, 50K, 100K nodes. In Figure 5 we represent the evolution of the same six graph metrics in preferential attachment (PA) networks. To keep the visualization intuitive, we highlight with error bars the minimum and maximum measurements only for DPA and WBPA networks. We also suggest, with a dotted orange line, the corresponding metric evolution for the random networks that we use as a Null model. The random networks are generated with the same number of nodes and average degree as the corresponding PA networks.  By analyzing solely the synthetic results, we conclude that: • AD drops from 6.2 to 3.3 for all networks as network size N increases.
• ACC decreases for all datasets, however this is a notable difference between Btw and all other networks. Indeed, Btw networks (WBPA) are the only ones capable of producing significant clustering in the network, starting from 0.25 ± 0.016 (N ≈ 1, 000) down to 0.15 ± 0.011 (N ≈ 100, 000). The other networks start at around 0.05 ± 0.016 (N ≈ 1, 000) and decrease to 0 (N ≈ 100, 000).

5/9
• Mod increases and converges to distinct values for each centrality. • Dmt increases for all networks, yet every single centrality has its own characteristic increase. In general, the Dmt is similar to that of the null model which increases from 5 (N ≈ 1, 000) to 16 (N ≈ 100, 000). CC and Btw produce the shortest diameter networks, while Cls and Deg produce the longest diameters.

SI.3. UPT-Social: an emerging online social network use case
The objective of this section is to briefly present empirical evidence that node betweenness is a better centrality than node degree in terms of attracting new social ties. In this context, we use a dataset with timestamped dynamical data 12 . UPT-social captures the birth (launch), and initial growth, over 6 weeks of a newly launched online social platform. The dataset contains detailed data about each node over multiple snapshots in time, and reaches a size of 351 users in 44 days after launch. The UPT-social dataset provides several snapshots at relative moment in time after launch (day 0), namely: days 3, 7, 15, 24, and 44. For simplicity, we refer to these moments in time as T 1−5 . For each T i we define a weighted correlation function, for both node centralities, called link attractiveness α, as: where k i is the degree of a node v i , respectively b i its betweenness; r i is the number of received (attracted) new links in T i from T i−1 . The * superscript represents the fact that all three metrics (k i , b i and r i ) are normalized. The obtained α-sums are given in Table 3, and represent weighted correlations between a node's centrality and it's ability to attract new links, summed up over the whole network. We obtain consistent results that betweenness has a higher attractiveness than degree, within a range of roughly +6-49%. Table 3. Evoltion of degree (k) and betweenness (b) attractiveness α in the UPT-social dataset over 4 moments T i in time. A second analysis we present studies the evolution and correlation in time of the same three metrics: degree, betweenness and received links. We obtain consistent results for most relevant nodes in the network, and present them in Figure 6 for the top 3 nodes, in terms of received links, for snapshots T 3−5 . The left-most vertical panel in Figure 6 corresponds to the oldest receivers, which were in the top 3 during T 2; these nodes present the full evolution of betweenness: rise, peak, fall and stabilization; respectively rise and saturation of degree, as explained through the social evolution cycle in the Discussion section (Figure 7). The middle and right panels of Figure 6 represent similar evolutions of the the two centralities, but capture the nodes in their middle ages of evolution. Both numerical and visual results are presented to support the higher potential of betweenness centrality as a driver in social network emergence, and further explain the social evolution cycle -and implicit degree-betweenness dependency -we propose in the paper.

SI.4. State of the art network model comparison
In order to extend our realism assessment of WBPA, we introduce several social network models, approved in the field, as a comparison baseline. As such, we make use of the Cellular 13 , Holme-Kim 14 , Toivonen 15 , and WSDD 16 networks. Their averaged metric measurements are presented in Table 4, alongside the standard deviations (±σ ) after generating 10 networks of size N = 10, 000 nodes for each model.
Cellular networks are composed out of semi-independent cells of small sizes, each with one node acting as a cell leader. Only leaders may connect to other cells, via their respective leaders, resulting in a highly decentralized topology. Inspired by covert networks, this is a non-traditional organizational configuration, without a hierarchical structure.
The Holme-Kim model extend the BA scale-free network model 1 to include a triad formation step. A high degree of realism results from using this model because it possesses the same characteristics as the standard scale-free networks, such as power-law degree distribution and small average path length, but adds a high clustering at the same time.
The Toivonen model starts from a different perspective, namely that real-world social networks are divided into communities with dense internal connections, resulting in consistently high values of the clustering coefficient. Also, the authors consider the observed degree assortativity, and the broad degree distribution. The Toivonen model is capable of reproducing realistic synthetic networks based on a mixture of random attachment and implicit preferential attachment.
Finally, the Watts-Strogatz model with degree distribution (WSDD) is a small-world network with enhanced preferential attachment. The creation algorithm relies on generating a set of independent communities inside which preferential attachment is applied; then, each community is connected to other communities, as if they were unconnected nodes, with the Watts-Strogatz algorithm. Table 4. Mean values of average degree (AD), average path length (APL), average clustering coefficient (ACC), modularity (Mod), diameter (Dmt), and density (Dns) averaged (± standard deviation σ ) for synthetic state of the art network models of size N = 10, 000 nodes. The described state of the art network models are compared alongside WBPA, DPA and the random network model, with multiple real-world datasets in terms of their ability to reproduce similar graph metrics. The comparison is expressed using the fidelity metric φ , and is available in the main manuscript in Table 3. Table 5 provides the averaged measures for the WBPA, DPA and random network models, for graphs of size N = 10, 000 nodes. Table 5. Mean values of average degree (AD), average path length (APL), average clustering coefficient (ACC), modularity (Mod), diameter (Dmt), and density (Dns) averaged (± standard deviation σ ) for the WBPA, DPA, and random network models of size N = 10, 000 nodes.

SI.5. Datasets availability
In Table 6 we detail the information about the nature of nodes and links, as well as direct URLs for acquiring each dataset used in this paper for real-world accuracy test.