Inferring personal economic status from social network location

It is commonly believed that patterns of social ties affect individuals' economic status. Here we translate this concept into an operational definition at the network level, which allows us to infer the economic well-being of individuals through a measure of their location and influence in the social network. We analyse two large-scale sources: telecommunications and financial data of a whole country's population. Our results show that an individual's location, measured as the optimal collective influence to the structural integrity of the social network, is highly correlated with personal economic status. The observed social network patterns of influence mimic the patterns of economic inequality. For pragmatic use and validation, we carry out a marketing campaign that shows a threefold increase in response rate by targeting individuals identified by our social network metrics as compared to random targeting. Our strategy can also be useful in maximizing the effects of large-scale economic stimulus policies.

limited. However, the financial cost of using phone services makes it possible that there is a systematic bias in how much wealthy individuals use the phone services relative to people that have less money to spend on phone calls. At this point, with the present data, we cannot rule out this possibility.
The financial dataset from a major bank in the same country was collected during the same time period as the mobile dataset. These data record financial details of 1.23 × 10 6 clients assigned unique anonymized identifiers over the same three-month period as the mobile network. The dataset consists of records of the bank clients' age, gender, credit score, total transaction amount during each billing period, credit limit of each credit card, balance of cards (including debit and credit), zip code of billing address, and encrypted registered phone number. A subset of 5.02 × 10 5 clients have an encrypted mobile phone number, thus enabling them to be matched with the mobile communication dataset. The phone numbers are encrypted in the same way as in the mobile dataset, which guarantees that the two datasets are matched. Excluding the information on credit lines, all other personal information is erased. We sum up the credit limits of all the credit cards of each account owner to represent the total credit limit of each individual.
In the absence of direct access to an individual's income and total assets, evaluating an individual's financial status remains an open question. In this dataset, we can access the following factors: Transaction amount, which also directly reflects the individuals' consumption patterns.
However, since it is common that one holds multiple accounts in different banks, and some of these may not be used at all, records in only one bank might not correctly reflect the real spending ability of an individual. Similar reasoning can be applied to total credit card balance per month, which could also lose its ability to measure one's financial status.
Credit scores assigned to individuals by credit scoring agencies are also good indicators of financial status. However, the values of credit scores are quite limited, ranging from 300 to 850. This limited range makes the credit score a low-resolution indicator of wealth that does not allow us to correctly classify a large number of people into well-defined financial classes.
On the other hand, the credit limit ranges over three orders of magnitude, allowing us to correctly classify the entire population. Considering the weaknesses of the other features, total credit limit is the most convenient measure of personal financial status in the present dataset.
Instead of transaction amounts and credit scores, we choose the total credit limit which is assigned by the bank after comprehensive evaluation of an individual's financial status, as a proxy for financial status. Since detailed information on how the credit limit is assigned is not provided, there are several possible factors that could cause bias in inferring an individual's real economic status. These include the delay of credit limit in reflecting a change in an individual's financial status, possible correlation with the age of the account, and so on. In fact, the credit limit might be capturing the amount of information the bank has about the customer, instead of his/her actual income.

Supplementary Note 2 -Removing non-human-operated lines
Inferring social network structure through mobile phone data requires the removal of lines operated by non-humans. Due to privacy restrictions, we could not filter business landlines and spawn spreaders at the outset. Several ways of filtering the landlines were applied in previous works, including setting a cut-off threshold degree [1] or only considering reciprocal phone calls [2]. However, these methods usually also cut off some important human communication behavior in that particular window of observation. All communication events should be considered in evaluating the social network. Therefore, the key problem is to find a method to distinguish human-and non-human-operated lines while retaining maximal information about individuals' communication patterns.
Although we do not have the human/non-human label for the totality of the phone lines, which could separate at the outset the non-human-operated lines, we are in possession of the set of phone numbers registered with the bank dataset. These human-operated lines provide the possibility of supervising a machine learning process to learn the human behavior that separates them from robots and non-human-operated lines. We set up a hypothesis test by modeling the human-operated lines based on several variables. We first cluster the humanoperated lines in a hyperspace. A new unlabeled node will be assigned a p-value according to its distance to the cluster. By carefully choosing a threshold of the p-values, we can label the node according to whether we accept or reject the hypothesis that the line is operated for personal use.
A training set consisting of the phone lines in the bank database (1.23×10 6 nodes), which is around 1% of all of the data in the entire network (1.10 × 10 8 nodes), was set up. We define a call or message from phone number i to j as a 'communication event,' and denote the total number of communication events on the link as W i→j . The key assumptions of the model are the following: 1. Communication between lines of personal use is usually (but not always) reciprocal.
This means that the fraction of paired communication events on human-operated lines is generally higher than that of unpaired ones. Namely, it suggests that although communication load difference D i on every line: should increase with degree k, it should be bound by an upper limit in the case of humanoperated lines. Numbers operated for non-personal use like business hubs and spawn spreaders may have very large D i because they are usually operated only for sending or receiving phone calls independently, but not for both at the same time.
2. Other types of business hubs may have large numbers of paired communications despite their limited D i . These business hubs include the phone numbers for company landlines, roadside assistance, or other services requiring instant follow-up by the recipient of the phone call. To filter out these hubs we assume that the paired communication: also increases with k, but is limited for lines for personal use. The decay of the tail is supposed to follow a power-law due to the preferential attachment rule [2].
The last assumption is: 3. Most phone numbers in the network are for personal use, which results in the number of non-human-operated lines being small.
After we introduce these basic assumptions, empirical analysis can be applied to build a model describing human-operated line behavior. The model simplifies to a parametric probability distribution depending on two random variables D i and R i , and a variable maximum degree k which controls the parameters. Under the preferential attachment rule of assumption 2, it is reasonable to assume the distributions of both D i and R i for a given k deviate from a maximum entropy distribution and show a power-law tail. A good approximation is the log-logistic distribution: and where This also suggests the logarithm of both metrics follows a normal-like but exponential tailed logistic distribution: and where with µ(k) = log(α(k)), and s(k) = 1 β(k) . Based on the knowledge we have, this distribution is the best choice even though we cannot precisely provide an exact fitting. However, the fitting results strongly support the approximation geometrically (Supplementary Figure 1 After validating the assumptions, we are able to implement the learning process by performing a hypothesis test: 1. Fit the model of training data and get the sequence of estimatedμ D (k),ŝ D (k),μ R (k), andŝ R (k).
2. For each node i with given difference d i , number of communication pairs r i and degree k i , calculate the p-value of p D (i) = P (D < d i |k i ), and p R (i) = P (R < r i |k i ).
3. Set a threshold p using the following test to classify the nodes: then i is a human-operated line. Otherwise a p-value outside the range defined above will be rejected by the null hypothesis: H 0 → i is a human-operated line. It will be labeled as a non-human-operated business hub due to its extraordinarily unbalanced communication pattern or large volume of communication events.
Last but not least, the threshold p should be optimized. Suppose the network follows the exact distribution given by the model above. of a single node is ∼ 10K including messages and calls, which is reasonable for a person who is active in business contacts during a three-month period.

Supplementary Note 3 -Entropy Analysis
In order to explore the structural differences between people with different levels of credit limits, we performed an entropy analysis. First, we choose people within the top 5% and bottom 5 to 10% credit limit percentiles, representative of the wealthy and poor populations respectively. Then, we randomly divided both groups into 20 small subgroups where each subgroup contained N (0) ∼ 2700 bank clients. Next, we expanded each subgroup's contacts by a distance to get a subnetwork and clustered the nodes in the subnetwork through modularity analysis (Supplementary Note 6) into different communities, finally counting the number of nodes inside each community (n i ). The entropy of this subnetwork is defined as: where p i = n i i n i is the fractional size of community i. Also, we introduced two indicators:  Table 1 shows the results of entropy S, R n ( ) and R c ( ) across an average of 20 subgroups, with uncertainties.
The entropy in subnetworks generated from the poor population is higher than in subnetworks generated from the wealthy population, while the numbers of both the total communities and nodes are smaller. This suggests that the sizes of the communities in the subnetwork of poor people are relatively more balanced than in the wealthy population.
Namely, wealthy people are more likely to form larger and more closely-connected communities which result in relatively low entropy. The result of R n and R c shows the significant difference between the size and diversity of the subnetworks of the wealthy and poor populations. By expanding their contacts, people with higher credit limits 'collect' more people and more communities. Such differences exist even when we increase the value of to 4.
The result of the entropy analysis implies that the network structure of these two groups may be significantly different. Wealthy people have higher diversity in mobile contacts and are centrally located, surrounded by other highly-connected people (network hubs).
Entropy analysis results also provide evidence of homophily, which implies that there exists a higher probability that two wealthy individuals are connected than that a wealthy individual and an extremely poor individual are connected. Since society is known to have this strong stratification property embedded in social networks, we would expect that this feature is expressed in our network. For example, if wealth implies higher degree, then homophily will lead to degree correlations, higher k-shell scores for wealthy individuals, and higher CI. Thus, part of the effect we observe in the present study might be due to the effects of homophily. However, the exact picture of how homophily affects the wealthy population is still to be discovered.

Supplementary Note 4 -Social Network Metrics
In order to capture the analytical evidence describing the effects shown in Figs. 1a-d, we introduce four different metrics to evaluate network influence [3,4].
1. Degree centrality k i is the simplest evaluation of an individual's local contact size.
It requires minimum information and is easy to calculate. Other centralities such as be-tweenness centrality cannot be efficiently calculated in our networks due to their nonlinear running times with system size.
2. k-core and k-shell index k s [5] capture the centrality of a node in the global network by the method of k-shell decomposition. In this method, nodes are removed iteratively if their degree k i < k until all the remaining nodes have degree equal to or greater than k. These nodes remain in the k-core of index k. The largest k-core a node can hold is the k-shell index k s , which means the node is in the 'shell' of the k'th core but outside the k + 1'th core. The k-shell or k-core number is a global metric. It has been proven efficient in identifying single influencers through the SIR model [5]. The k-shell index requires the overall information of the network. It is a quantity that does not allow one to classify the nodes with high resolution: there usually exist a few k-shells in the whole system, each containing many of the nodes in the network. Fig. 1c is a schematic example of a k-shell in a network.
3. PageRank [6] is as eigenvalue centrality metric used to evaluate the probability that information or knowledge will likely visit a node through a random walk. PageRank is calculated through an iterative algorithm in which nodes collect PageRank values from their neighbors in every iteration. For simplicity, each node is initially assigned a value of PR(i) = 1. During each iteration, node i collects a PageRank value through the link pointed from its neighbor j (j → i) as the PageRank of an adjacent node divided by its outbound degree k j out . Namely, Here ∂i → i is the set of points which have outbound links to i, and d is a damping factor which we choose as 0.7 in our work. When a converging threshold (10 −4 ) is reached, the iteration stops and outputs the final result of PageRank.
Although PageRank was originally proposed for ranking websites, it has also been applied in social network analysis. Given the assumption that senders of messages or makers of phone calls are likely to be the ones providing the information being communicated, PageRank is a good metric to evaluate the likelihood that an individual captures the information spreading in the network. Similarly to k-shell, PageRank requires the global information of the whole network. However, it is easy to update when the network changes.
4. Collective Influence (CI) is an algorithm to identify the most influential nodes via optimal percolation [7]. Rather than the above heuristic metrics, Collective Influence is introduced by a theoretical approximation of the solution to a problem of influence maximization in locally tree-like social networks [8]. CI minimizes the largest eigenvalue of a modified non-backtracking matrix of the network in order to find the minimal set of nodes to disintegrate the network. It has been shown that this process maximizes the spread of information via a threshold model of spreading and also provides the most important nodes for the integrity of the network (optimal percolation). Each node is associated with a CI value, and those with the top CI values are the most influential nodes in the network. The definition of CI is given by: where the Ball(i, ) is defined in the text. We should note that the mobile communications network is a typical small world network (average path length < >∼ 8.9), and the radius of the ball is limited by the network diameter.
Of the metrics we investigated so far, CI draws our attention since in practice, it has advantages in resolution, correlation with wealth, and scalability to massively large social networks. On the "global versus local" issue, we point out that while CI comes from a global theory of maximization of influence, it represents a local approximation in a sphere of influence of finite radius . Thus, it is a convenient way to quantify influence in large social networks due to its scalability. Furthermore, in cases where the whole picture of global connectivity is incomplete, the local connectivity up to a few layers might be enough to define network influence and predict the financial status of an individual. On the other hand, we have shown that global quantities like the k-core are also good for capturing an individual's financial status. Indeed, the global k-core contains nested structures of relatively large degrees, which somehow resemble the concentric spheres of influence of a high-CI node. However, the k-core suffers from resolution problems: wealthy people might be located preferentially in the core of the network, but this core is too large to locate them with accuracy. For instance, there are only 25 k-cores in the whole network (Fig. 2b) to separate one hundred million people, while CI has a larger resolution spanning eight orders of magnitude. Thus, in practical terms, CI presents advantages both in resolution and in high correlation with wealth.
Also, CI represents a balance between a global maximization of influence and its local approximation in successive layers, allowing one to use the CI metric in large-scale datasets composed of hundreds of millions of individuals. Overall, we emphasize that CI is just a useful strategy for the reasons shown above, but by no means the only or best way to express the wealth of individuals. More generally, supervised machine learning can be applied to the problem of predicting an individual's credit score based on a number of features.
These methods could include not only CI but also the other measures discussed, along with many other standard network metrics. Augmenting these measures for determining feature importance could allow us to better assess which features are important to determine the wealth of individuals with higher accuracy than that shown by CI in the present study. The prediction model will give standard measures of features' importance in further studies when we have access to more data. Future work will follow this promising direction.

Supplementary Note 5 -Financial parameters and other factors
We use the following statistics to identify economic effects: First, we separate the individuals into groups on sampling grids in variable space (1D as segment bins and 2D as grids). In each group (with more than 10 people for statistical significance), we count the fraction of wealthy individuals, defined as those individuals in the top 4-quantile Q > 0.75 or who have a total credit limit greater than USD $4,000 (converted).
Besides the credit limit, transaction amount and credit score the bank data also provides the information of the clients' birth years. Age as a variable is independent from the network metrics (Supplementary Table 2) and correlates with the percentile-ranking credit limit (r = 0.42). However, we do not know the model used by the bank to assign the credit limit, so the age may be a complex reflection of the mixed effects of both increased income and increased account history. Thus, the correlation between age and credit limit might not be capturing only variation in actual wealth but also the amount of information the bank has about the customer.
To quantitatively evaluate the variance caused by network metrics when combined with other factors, we employed Analysis of Covariance (ANCOVA) [9]. ANCOVA is an analysis method which conducts regressions between covariate (CV) and dependent variables (DV) under different groups of categorical independent variables (IV). In this case, regression was made between covariate CI and the dependent variable, the fraction of wealth. As in Fig. 2d, CI is divided into 100 partitions. Based on the information to which we have access, ANCOVA was applied separately among the following independent variables: gender, age, and residential communities. Gender was naturally divided into two groups. Age was grouped year by year from 18 to 65 in a total of 48 groups. The communities were identified by their registered zip code. To reduce the dimensionality of the problem and directly quantify the effect of geographical location, we first sorted the communities by the fraction of wealthy people inside and divided them into 50 balanced groups. We assigned to every community an 'Index of Community Wealth' (ICW), which is the quantile ranking of each group that the community belongs to.
The correlation between IVs and CV are shown in Supplementary Table 3 1. All IVs' effects are significant (p < 0.001); namely, the fraction of wealthy people is different among different groups of gender, age or communities.
2. Inside most groups of each IV, the variation caused by CI is also significant (p < 0.001).
The only exception is that CI's effect is only significant when the clients are older than 24 years (Supplementary Figure 6b). This result indicates that the effect of network metrics, in most cases, is independent from the other known factors. is also similar under different thresholds. Therefore, we focus our results on a given quantile threshold Q = 0.75 for the remainder of the study. Although the violation of homogeneity in 3 prevents us from making a direct comparison between variables, these results imply that CI significantly and independently affects the fraction of the wealthy population.

Supplementary Note 6 -Correlation between network metrics and financial status
To compare the value of the social metrics to the economic status of individuals, we have to draw out the best one to describe network location influence effects. We sum up all the age groups and consider the effect of network metrics to demonstrate the effects of each variable.
The reason for using the aggregated model instead of the direct correlations at the individual level is because the regression models at the individual level are based on certain assumptions that are not satisfied by our data. Thus, we were unable to apply regression models at the individual level, and instead provide data at an aggregated level. The failure of regression models at the individual level is due to two reasons: 1. The distribution of credit limit (CL) for a given level of ANC [which is a log-normal-like distribution with several peaks located at integers such as 50,000 or 100,000 (Supplementary Thus, we adjust our statistical model to reflect the complexity of economic effects from network metrics and aggregate the data as follows: First we separate the individuals into groups of sampling grids in a variable space (in 1D as segment bins and in 2D as grids). In each group (with more than 10 people for statistical significance), we count the fraction of wealthy individuals defined as those individuals in the top 4-quantile Q > 0.75 or who have a total credit limit greater than (equivalent to) USD $4,000. The dependence of our results on different wealth thresholds is provided in Supplementary Note 5.
Besides the degree, the volume of communication may have correlations with economic status since we could not eliminate the systematic bias caused by phone call service fees.
We investigate the correlation between the fraction of wealthy people and the average communication load per link: AVL i = W i k i , where W i is the volume of communication events and k i is the degree of node i. The regression result shown in Supplementary Figure 9 shows that there is no significant correlation between the average communication volume per link and the fraction of wealthy individuals. Therefore, the effect of communication volume is negligible in comparison with the other variables considered in this study.
Supplementary Figure 8 shows the results. The large fluctuation in degree for higher quantiles in Supplementary Figure 8a implies that the effect of degree involves complex social patterns rather than only the local properties of the degree of the node. Thus, we abandon the use of degree for further study as an indicator. k-shell is good enough to present a positive correlation of high network location influence. However, due to the limited values of k-core, it cannot provide finer resolution for prediction (Supplementary Figure 8b). Therefore, k-shell is also not considered for further studies as an indicator. The performance of PageRank (Supplementary Figure 8c) with a slightly negative correlation suggests that it is not the optimal variable to rank economic status, and thus it is not considered herein. We notice a non-monotonic oscillatory behavior of the fraction of wealthy people when using k and CI as variates ( Supplementary Figures 8a and 8d). This effect is complex and cannot be captured by either the degree or CI, and may not be limited to local properties.
The oscillation is reduced when using CI in the analysis, and this is one of our reasons for choosing CI as a potential predictor. We will continue investigating the non-monotonic pattern in future work.

Supplementary Note 7 -Modularity and Diversity ratio
Additional research on modularity was implemented as follows. Personal structural hole [10] effects were evaluated by the ratio of total weights attached with nodes outside a community k out , to those inside a community k in . A fast community detection algorithm introduced by Blondel et al. [11] was implemented in this work. The algorithm aims to maximize the modularity function [11,12]: After we label the network with its communities, we can evaluate an individual's structural hole effect [10] by introducing the diversity ratio DR. DR is defined by the ratio of total communication events with people outside one's own community W out to those with people inside the community, namely W in , DR =W out /W in . The ratio is weakly correlated with CI (r = 0.4). The same statistic of composite ranking was implemented as CI with the same number of statistic segments and composite factor α = 0.5 as in the text. The result (Fig. 3d) shows that the structural hole effect also has a strong correlation with the distribution of affluent individuals while it is weakly dependent on CI. This result confirms the importance of the ability to communicate with outside communities via "weak ties" for personal economic development [13].

Supplementary Note 8 -Marketing Campaign
In the marketing campaign, clients were approached by SMS messages offering a benefit.
In the text we sent during the campaign, we did not provide a specific product. Instead, the only information we provided was to notify the client that he/she was eligible for an offer from the bank. This somehow eliminated the bias caused by the nature of a product which may have a different appeal to wealthy or poor people. We sent the following messages: Request your credit card with benefits from (Bank name) by calling at (Bank phone number).
Fees and requirements at (Bank url).
(Bank name) has a special offer for you. If you're interested call at (Bank phone number).
Fees and requirements at (Bank url).
(Bank name) has a credit card fit for you. Request it by calling at (Bank phone number).
Fees and requirements at (Bank url).
(Bank name) has a credit card with benefits. Request it at (Bank phone number).
Fees and requirements at (Bank url).