An interpretable approach for social network formation among heterogeneous agents

Understanding the mechanisms of network formation is central in social network analysis. Network formation has been studied in many research fields with their different focuses; for example, network embedding algorithms in machine learning literature consider broad heterogeneity among agents while the social sciences emphasize the interpretability of link formation mechanisms. Here we propose a social network formation model that integrates methods in multiple disciplines and retain both heterogeneity and interpretability. We represent each agent by an “endowment vector” that encapsulates their features and use game-theoretical methods to model the utility of link formation. After applying machine learning methods, we further analyze our model by examining micro- and macro- level properties of social networks as most agent-based models do. Our work contributes to the literature on network formation by combining the methods in game theory, agent-based modeling, machine learning, and computational sociology.


Supplementary Note 1: Pairwise Stability
The proof of the Proposition (1) is derived directly from the definition of pairwise stability. On the one hand, removing an agent in S * i leads to a decrease in U i , i.e., In our specific form (equation (5) in main text), we have On the other hand, forming a new link does not increase the utility of both agents at the same time, i.e., if then Equivalently, between ∆u i (j) and ∆u j (i), at least one should be less than zero.
Therefore, we have

Supplementary Note 2: Data description
Here we describe in detail the datasets that we utilize, with a focus on the individual characteristics to predict. Degree distributions are presented in Supplementary

Description of Andorra Dataset
We begin with a brief illustration of the format of the dataset. For each entry, we have the anonymized phone number which initiates a record (call/text/cellular), the starting time of the record, the duration of the record, the cell tower id of the initiator, the type of the record (call/text/cellular), cell phone carrier (containing country code), the anonymized phone number that receives the call or text, the type allocation code (containing phone type). The statistics of individual characteristics are presented in Supplementary Table 1. Because we do not have fine-grained information for users, we use the following three variables as the proxy variables for individual characteristics.
• Phone type. In Andorra, the first and second largest phone types are Samsung and Apple, respectively. Therefore, we distinguish these two phone types from other phone types. We believe that phone type should be (weakly) correlated to individual income and social status; therefore we employ the phone type as one of the variables to predict.
• Location. Andorra Telecom records the cell tower location to which each call connects. We are then able to label each user by the place where she appears most frequently. Andorra Telecom classifies cell towers into seven clusters (as shown in Supplementary Table 1). Note that the clusters do not strictly correspond to the seven districts in Andorra.
• Internet usage. For Internet usage, we show the minimum, the first quartile, median, the third quartile, and the maximum in the dataset. Youths tend to use the Internet more frequently than old people. Thus Internet usage should be a decent proxy for age since we lack sufficient fine-grained characteristics for the individuals.

Description of Movie Dataset
The TMDB is a public dataset 1 . In the dataset, we have 4,803 movies in total. For each movie, participants are classified into either cast or crew. Each individual is identified by a unique ID. We only extract the actors/actresses and directors, and then establish links between the director and the cast, meaning that "the director invited the actor/actress to the collaboration and the actor/actress agreed". If an individual serves as the director in more than half of the movies that she has engaged in, then we label her as "director", otherwise "actor/actress". The gender of each individual is also identified in the dataset.

Description of Company Dataset
The company dataset is also a public dataset (MobileD 2 ). We examine this network because it shows the collaboration within a company and provides the labels of individual characteristics (managers or subordinates). In this network, managers' neighbors are mostly managers and subordinates' neighbors are mostly subordinates.
Therefore, strong homophily (caused by reduced coordination costs) exists on the "job" dimension in this network but there are also exchanges between these two types of employees. We predict whether an employee is a manager or a subordinate.

Description of Trade Dataset
Because we do not consider the strength of a link, we will have a very densely connected network if we do not filter out some links. We filtered out links with export amounts lower than one billion; countries with high trade volumes would retain more links. We also remove countries without any link because then we have no information to estimate information for these countries. This process yields a network of 100 countries.
We extract three features for the 100 countries (or region) in the network: 1. Continent. We predict whether a country belongs to Africa, America, Asia/Pacific and Europe, and measure the performance by the average AUC of these four predictions. We have 14 African countries, 19 American countries, 33 Asian or Pacific countries, and 33 European countries in this network.
2. GDP. We use the GDP in 2014 as the second variable to predict. We label as 1 the countries with a GDP higher than the median; and otherwise 0.
3. Export complexity index (ECI). ECI measures the economic complexity for a country. For example, Japan has a high ECI because it exports products with high uniqueness. Similar to GDP, we label the countries with an ECI higher than the median as 1; and otherwise 0.

Description of Synthetic Dataset
As mentioned in the main article, we generate the network with two types of agents (e.g., buyers and sellers) who explore exchanges in their neighborhoods. We predict two individual characteristics in this dataset: 1. Type. As mentioned previously, we predict the type of each agent (a binary variable). The probability that the agent i is type A (e.g., buyers) is 0.5, and agents draw their types independently.

Supplementary Note 3: Learning
Learning Method Random noise. Since real-world social networks may incorporate random noises, we therefore add an i.i.d. random shock for the marginal utility into each pair (i, j).
Following the conditions in Proposition (1), for j ∈ S * i , and for l / ∈ S * i , For computational simplicity, we assume that the CDF of random shock ij is a sigmoid function, We then establish the loss function. We learn the W, b, c by minimizing the loss function. u 0 can be considered as a baseline benefit or cost of forming a link. u 0 is learned along with b and c (can be perceived as a reserved utility) but is omitted in later equations.
Loss. We next specify the definition of the loss function: . This loss measures how well connected pairs . u 0 is an unknown scalar to be learned along with b and c. Multiplying |S * i | is to consider that fact that when an agent has many agents (a large |S * i |), the marginal benefit from each neighbor (∆u i (j)) does not need to be very large.
In other words, we require ∆u i (j) to be large if i has few neighbors but not necessarily large if i has many neighbors. • L reg = λ reg θ 1 . Lasso ( 1 ) helps to eliminate unnecessary dimensions and to determine the choice of K. Therefore, if we learn a b k or a c k close to zero, we can remove this dimension and find an optimal K * .
Dimensionality selection. We next provide a method to select the optimal dimensionality (K * ) along with (K * bnf and K * cst ). Algorithm 1 presents the method. The philosophy is that for each K, we enumerate all possible K bnf . "SGD" represents the algorithm for learning W and θ via stochastic gradient descent and will be discussed later. If K bnf gives rise to the best AUC for the current K, and the corresponding θ contains 0, then we terminate the enumeration. After removing the dimensions with zero-valued θ k (< 0.01 in practice), we obtain the optimal dimensionality and the values of corresponding variables. We stop increasing K if we find a zero-valued θ k in the optimal run in the best K bnf for this K. A zero-valued θ k means that the k-th dimension does not contribute to either benefits or costs so we can drop that dimension without impacting the performance of this model. We call such dimensions "degenerate dimensions".
Result: W * , θ * , K * , K * bnf , K * cst Initialize K to a small number; repeat K = K + 1; Let L(K, K bnf ) be the current loss; if L(K, K bnf ) < L * (K) then L * (K) = L(K, K bnf ); bnf ) (after removing the k-th dimension with θ k = 0); θ * = θ(K * , K * bnf ); Algorithm 1: The procedure of learning endowment vectors and other parameters. Stochastic gradient descent (SGD) Here we describe the algorithm for SGD given K and K bnf . We use TensorFlow (1) to implement the learning algorithm.
Specifically, we use Adam optimizer (2) as the optimizer. We run a large number of iterations (5,000) to guarantee the convergence of learning. Importantly, because the memory cannot typically afford N 2 links as the input for one iteration, we retain all the connected pairs as the input while randomly sampling the number of unconnected pairs. Specifically, we sample ten times the number of connected pairs each iteration (four times for Andorra because of the memory limit); however we still have a consistent estimation for L neg in expectation. This technique is called "negative sampling" and is also used in many machine learning methods. For a typical network with a density of 0.1%, each iteration we sample approximately 1% pairs so every 100 iterations may cover all pairs on expectation. Also note that the stochastic nature of this learning process reduces the likelihood to converge to unwanted local minima. To approximate the optimum, we selected the run with the best AUC for link fitting.
Speed-up for Andorra Dataset. When the network is large (more than 10,000, e.g. Andorra) nodes, the computation of L fp is very time-consuming. To speed up the learning, we could drop this term because empirically, this term is small for large-scale networks. For example, when learning endowment vectors for Andorra dataset, this term takes up less than 0.1% of the total loss even if the term is not optimized. Therefore, we drop L fp in Andorra dataset to speed up the learning, but retain L fp for other datasets to increase learning performance.
Choice of λ reg and λ fp . The value of λ reg determines the constraint of the dimensionality and avoids overfitting. A large λ reg will lead all b k and c k to be zero, while a zero-valued λ reg fails to limit the dimensionality and the model may eventually overfit the data with an extremely high K. We select λ reg a priori based on the complexity of the network. When the number of nodes is smaller than 1,000, we let λ reg = 0.1; when the number of nodes is greater than 1,000 but smaller than 10,000, we set λ reg = 0.05; otherwise we set λ reg = 0.01. For λ fp , we set it as 0.01 for all datasets. The good fitting and the prediction performance demonstrate that such selection is reasonable. Note that it is not necessary to obtain an optimal λ reg or λ fp if they result in satisfactory endowment vector estimations.
Learning rate and the number of iterations. There is no specific guidance to select these two terms a priori. Empirically, 0.1 is a reasonable initial learning rate, and it decays by 5% every 100 epochs. In addition, 5,000 can be a reasonable choice for the number of iterations. We will show in the next section that such choices result in sensible results.

Learning curves
Here we plot the learning curves for all datasets under the optimal settings. As described in Methods, we start from five random initial points to approximate the global optimum. As shown in Supplementary Figure 2

Results of dimensionality selection
For each K, we have a K * bnf (K) that results in the smallest loss. If a zero-valued b k or c k is learned at the setting of K, K * bnf (K), we stop increasing K. Here we plot the losses for the best run for each pair of K, K bnf for all datasets in Supplementary for Synthetic, (7, 4) for Movie, (11, 4) for the Company dataset. For Andorra, because of the limit of computational resource, we are unable to enumerate for all (K, K bnf ). Instead, we set an arbitrary large K = 16 and reduce the dimensionality by the regularization term L reg , where we find one costly degenerate dimension. So we reduce one costly dimension, i.e. K = 15. Here we do not argue that 15 is an optimal choice for the Andorra dataset, but it is good enough for further agent-based modeling.

Illustration for the trade network
Since we have the trade network with only 100 countries, we can visualize the output of learned endowment vectors that is close to the optimal choice of dimensionality: K = 6 and K bnf = 4. It is convenient for visualization to keep both K bnf and K cst even.
Supplementary Figure 5 depicts the distribution of endowment vectors for all countries in the network. The first four dimensions are beneficial dimensions (with b k > 0). We find that China, United Arab Emirates, the US, and Germany are high in these four dimensions respectively. The last two dimensions are costly dimensions (with c k > 0). We find that geographically close countries are clustered on the space, meaning that the distance between countries plays an important role in the costs of network formation.

Illustration for synthetic dataset
As mentioned in the main article, nodes are located on a 50×50 grid. Each node can be either a seller or a buyer (with probabilities of 0.5 independently). We plot the learned endowments in Supplementary Figure 6. The plots show that we have recovered the data generating process. In the left panel, red and blue nodes are randomly distributed, which is consistent with the fact that buyers and sellers are randomly distributed on the grid. In the right panel, we observe that the degree of blue increases along the x-axis and that the degree of red increases along the y-axis.
When both x and y are small, black is shown; when both x and y are large, purple is shown. In sum, the right panel demonstrates that we have successfully recovered the data generation of costly dimensions.

Supplementary Note 5: Comparison with network embedding algorithms
Although we have emphasized that the proposed model is not intended for dimension reduction of graph structures, we compare it with a classical network embedding algorithm, DeepWalk. Our goal is not to invent algorithms to outperform DeepWalk or other network embedding algorithms. In addition, to remain interpretability (link formation mechanism) and model simplicity, we do not enforce similarity among nodes on randomly sampled paths, which is why DeepWalk and other network embedding algorithms sometimes perform better in node classification tasks than we do, especially when similar nodes are clustered on the networks. As shown in Supplementary Table 2 Supplementary   Table 3. Because the synthetic network is a bipartite graph and the movie network is almost a bipartite graph (some agents served as both directors and cast members), clustering coefficients are inapplicable. We find consistent results for most datasets with the results for the Andorra dataset. For the company data, we do not observe the significant correlation between social power and degree; we conjecture that this is because members in high status (managers) of the company do not need In addition, we also show that even if we do not force some b k or c k to be zerovalued in the "non-split" condition, we will still learn b * and c * such that for most k, either b k = 0 or c k = 0.
Supplementary Note 9: Robustness check: Variations of the functional form We show the robustness of our model by using two variations of the functional form.
The first one (Abs) is to change the G i from 2 into an 1 norm: The second one (Smooth) is to use a smooth form of ReLU (max(x, 0)) to assist the optimization: Note that when x is far away from 0, max(x, 0) is very close to log(1 + exp(x)).
When x is around 0, log(1 + exp(x)) has a smoother gradient than max(x, 0), which is more convenient for the optimization. We can also consider log(1 + exp(x)) as a "noisy" form of max(x, 0): when x is close to zero, a small random shock will influence whether there is a benefit.
To show that these two versions of functional forms will not influence the fitting ability and the predictive power of the learned endowments, we present the fitting performance in Supplementary