Mobile phone calling is one of the most widely used communication methods in modern society. The records of calls among mobile phone users provide us a valuable proxy for the understanding of human communication patterns embedded in social networks. Mobile phone users call each other forming a directed calling network. If only reciprocal calls are considered, we obtain an undirected mutual calling network. The preferential communication behavior between two connected users can be statistically tested and it results in two Bonferroni networks with statistically validated edges. We perform a comparative analysis of the statistical properties of these four networks, which are constructed from the calling records of more than nine million individuals in Shanghai over a period of 110 days. We find that these networks share many common structural properties and also exhibit idiosyncratic features when compared with previously studied large mobile calling networks. The empirical findings provide us an intriguing picture of a representative large social network that might shed new lights on the modelling of large social networks.
In the past two decades, the number of mobile phone users in China increased dramatically. There were 47.5 thousand users in 1991. This number increased to 84.5 million in 2000. In October 2013, it was released by the Ministry of Industry and Information Technology of China that there were 1.216 billion mobile phone users. The number of people using mobile phones is certainly less than that number, because it is not uncommon that a person owns two or more mobile phone numbers. Nevertheless, the actual population of mobile phone users is huge. Hence, the records of mobile phone calls provide us great opportunities to study human's mobility patterns, communication dynamics, and social structure.
González et al. studied 16,264,308 displacements between successive mobile phone calls of 100,000 individuals randomly selected from a sample of more than 6 million anonymized mobile phone users over a six-month period in a European country and found that the density function follows a shifted power law with an exponential cutoff1. An analysis of human movements based on the trajectories of 464,670 dollar bills obtained from a bill-tracking system in the United States shows that jumps in human trajectories are distributed as a power law2. In contrast, there is evidence showing that intra-urban human mobility does not follow a power law but an exponential distribution according to mobile phone records3 and taxi trips data4,5,6. It is crucial to point out that, when compared to human's mobility patterns at the aggregate level, individuals' patterns might not be homogeneous but exhibit different features7. In addition, different data from different regions may also give different results8. Intriguingly, human's mobility patterns are largely predictable9,10,11. It is unmistakable to foresee that mobile phone data will play a more significant role along the progress of constructing smart cities.
Understanding the temporal patterns of human's communication dynamics is essential in the tracking and management of information spreading and social contagion. According to the analysis of durations between two consecutive calls1,12,13,14,15 and short-message correspondences16,17,18, it is well documented that the distribution of inter-communication durations has a fat tail and human interactions exhibit non-Poissonian characteristics. The non-Poissonian communication patterns are also observed in other situations such as email communications19,20 and letter correspondences21,22,23.
Mobile phone communication data also provide a useful channel for the study of social structure from the perspective of complex networks24,25,26,27. For instance, one can infer friendship network structure by using mobile phone data, offering an alternative method to the standard self-report survey28. The investigation of temporal motifs in the calling networks unveils the existence of temporal homophily in the sense that similar individuals tend to participate in mutual communications29. The topological properties of a large calling network constructed from European data have been investigated in detail30, which enhanced our understanding of human social networks and shed new light on modelling weighted large-scale social networks.
In this paper, we investigate the statistical properties of four calling networks (directed calling network (DCN), mutual calling network (MCN), statistically validated directed calling network (SVDCN) and statistically validated mutual calling network (SVMCN), see Methods section for the details of network construction) constructed from the detailed call records of more than nine million different mobile phone numbers from a mobile service provider in China. The DCN is a calling network, in which all mobile phone users in our dataset are treated as nodes and a directed link is drawn from a call initiator to a call receiver. The MCN is a bidirectional calling network in which an edge is only drawn between any two mobile users with reciprocal calls. We also extract from DCN and MCN two Bonferroni networks SVDCN and SVMCN in which the links are filtered by a statistical validation test31.
Size distribution of isolated components and the small-world effect
Since we can only access the calling records of one mobile service provider, the constructed networks are fragmented into isolated subnetworks or “components”. The original calling network (DCN) contains 236,738 components and its statistically validated calling network (SVDCN) has 468,138 components. There are 3,456,437 nodes and 16,269,689 edges in the giant component of the DCN (GCDCN) and 1,044,522 nodes and 1,440,366 edges for the giant component of the SVDCN, respectively. In contrast, there are 260,799 components in the mutual calling network (MCN) and 198,323 components in the statistically validated mutual calling network (SVMCN). The giant component of the MCN, denoted GCMCN, has 1,978,680 nodes and 4,677,642 edges, while the giant component of the SVMCN has 526,234 nodes and 765,213 edges. We summarize this information in Table 1.
Panels (a) and (b) of Fig. 1 show the empirical distributions of component size s, which is defined as the number of nodes in a component for the four communication networks. In Fig. 1, the giant components are not included. It is found that the four distributions exhibit an asymptotic power-law decay where the tail exponent α is 2.89 for the DCN, 2.60 for the SVDCN, 2.75 for the MCN, and 2.58 for the SVMCN. The statistical validated networks SVDCN and SVMCN have a shallower slope and therefore a wider distribution of component sizes than the DCN and MCN. This observation is due to the fact that the giant component of each original network has been segmented by removing the edges that are not statistical validated as illustrated in Fig. 1(c). We also find that the component size distributions of the statistically validated networks of GCDCN and GCMCN have power-law tails and both tail exponents are α = 2.54.
We now turn to investigate the local structure of DCN and MCN through their ego networks13. For a randomly chosen source node, its ego network of distance ℓ contains all the nodes whose distance to the source node is not longer than ℓ. An example of ego network extracted from the GCMCN is illustrated in Fig. 1(c).The number of nodes of an ego network of distance ℓ, Ns(ℓ), is plotted as a function of ℓ in Fig. 1(d) and (e) for several random chosen source nodes and their average. It can be seen that the number of nodes increases exponentially when ℓ ≤ 6 and saturates to the size of the whole network with a slower growth rate when ℓ > 6. Hence, the two giant components GCDCN and GCMCN exhibit a small-world effect32.
Since the DCN and the SVDCN are directed, we investigate their in-degree and out-degree distributions as shown in Fig. 2(a). All the four probability distributions can be well fitted by an exponentially truncated power law33: where and for the out-degree distribution of the DCN, and for the in-degree distribution of the DCN, and for the out-degree distribution of the SVDCN, and and for the in-degree distribution of the SVDCN. Figure 2(b) plots the in-degree and out-degree distributions of the giant components of the DCN and the SVDCN, denoted GCDCN and GCSVDCN in the legend. These distributions can also be nicely fitted by the exponentially truncated power law of Eq. (2). The estimated parameters are and for the out-degree distribution of the GCDCN, and for the in-degree distribution of the GCDCN, and for the out-degree distribution of the GCSVDCN, and and for the in-degree distribution of the GCSVDCN.
According to Fig. 2(a) and (b), the corresponding distributions of a network and its giant component are very similar and share quite a few common features. The first feature is that there is no evident difference between the in-degree and out-degree distributions for all the four networks. However, the distribution of an original network exhibits a much heavier tail than its statistically validated network. For instance, the average degree of the two giant components (GCMCN and GCSVMCN) are 〈kGCMCN〉 = 4.73 and 〈kGCSVMCN〉 = 2.91, which means that a mobile phone user on average reciprocally exchanges calls with more than 4 people of whom about 3 people are frequent contacts. However, there are outliers with very large in-degrees and out-degrees that cannot be modelled by the exponentially truncated power law. In addition, there are users characterized by a very large number of out-calls and a small or average number of in-calls. Most of these outliers are not typical mobile phone users but hot lines or “robots”15. After filtering out the edges that do not pass the statistical validation, the number of outliers reduces significantly in the distributions of Bonferroni networks.
In Fig. 2(c) and (d), we present the degree distributions of the MCN, of the SVMCN, and of the two giant components of these two networks (GCMCN and GCSVMCN). These four networks are not directed since the edges stand for reciprocal calls between any two users. These degree distributions can also be well fitted by the exponentially truncated power law of Eq. (2). The estimated parameters are γk = 1.46 and kc = 20.81 for the MCN, γk = 1.20 and kc = 4.27 for the SVMCN, γk = 1.40 and kc = 21.00 for the GCMCN, and γk = 0.40 and kc = 3.35 for the GCSVMCN. For comparison, we note that the degree distribution for the European GCDCN has a shifted power-law form with k0 = 10.9 and γk = 8.425. Most of the features of the MCN networks are similar to those of the DCN networks. An interesting difference is that the right-end tails become much narrower, because the reciprocal calling criterion for the construction of MCN has the ability to filter out most of those abnormal calls associated with hot lines and robots which are often unidirectional.
The mixing patterns of complex networks have significant implications on the structure and function of the underlying complex systems34. Most social networks are reported to be assortative, i.e., people with many contacts are connected to others who also have many contacts. This may lead to a positive degree-degree correlation in the network, suggesting that the degree of a node positively correlates to the average degree of its neighborhood. The average nearest neighbors degree of a node i is defined as , where is the neighbor nodes set of i. In the calculation of knn for the DCN and the SVDCN, we do not consider the direction of the edges. By averaging this value over all nodes in the network for a given degree k, one can calculate the average nearest neighbors degree denoted by 〈knn|k〉. A network is said to be assortatively mixed if 〈knn|k〉 increases with k and disassortatively mixed if it decreases as a function of k.
In Fig. 3(a) and (b), we show the dependence of 〈knn|k〉 as a function of k for the giant components of the four networks. We find that 〈knn|k = 1〉 > 〈knn|k = 2〉 for all curves. This observation was also present in the investigation of a large European dataset30. For k values larger than 2, the 〈knn|k〉 function exhibits an evident increasing trend to reach a maximum. After the maximum, there is a decreasing region for large k. We notice that the overall shape of the two curves for the two MCN networks is qualitatively similar to that observed in the investigation of the European dataset30. A closer scrutiny of the GCDCN curve unveils an approximate plateau for the largest degrees. This can be partly explained by the fact that the nodes with the largest degrees usually correspond to hot lines or robots who receive calls from or call to diverse people. Figure 3(a) shows that mobile phone users with a “reasonable” number of contacts form an assortative network, while users with an abnormally large number of contacts exhibit a disassortative mixing pattern.
We also compute two weighted average nearest neighbors degrees defined as and to measure the degree-degree correlations30,35, where is the number of calls between i and j, is the total number of calls between i and her contacts, is the call duration between i and j, and is the total call duration between i and her contacts. In the calculation of and for the DCN and the SVDCN, we do not consider the direction of the edges. We average these two weighted degrees over all users with the same degree k to get and . We show in Fig. 3(c) and (d) the weighted average nearest neighbour degrees for the four giant components of the four networks. We note that there is no significant difference between the two curves with number and duration weights for each network. The weight-based curves in Fig. 3(c) and (d) exhibit almost the same behaviors as the unweighted results in Fig. 3(a) and (b).
Edge weight distribution
For a calling network, we have defined two kinds of weight for each edge between two users. For the DCN and the SVDCN, the number-based edge weight is the number of calls occurred between user i and user j and the duration-based edge weight is the total time elapsed during users i and i talk to each other through their mobile phones. Following Ref. 30, we focus on the giant components of the four networks. For the GCSVMCN, two connected users talked with each other on average 〈wN〉 ≈ 23.98 times and 〈wD〉 ≈ 2234 seconds (37 minutes). Figure 4 shows the distributions of the giant components of the four networks.
Figure 4(a) shows that the distribution for the GCSVDCN exhibits an obvious kink at wN ≈ 120. It is not clear why there is such a kink. We can use a bi-power-law distribution to fit the data where the two power-law exponents are α1 = 1.79 and α2 = 2.97. In contrast, the distribution for the GCDCN can be fitted by an exponentially truncated power law where and . The two distributions in Fig. 4(b) can also be fitted by Eq. (5), except for the region defined by wN < 10. The estimated parameters are and for the GCMCN and and for the GCSVMCN. We note that, rather than using the maximum likelihood estimation as in Ref. 33, the least-squares regression approach is adopted to fit the curves throughout this work. Indeed the method proposed in Ref. 33 cannot be applied straightforwardly to truncated power-laws.
The distributions of duration-based edge weight wD for the giant component of the four networks are shown in Fig. 4(c) and (d). The distributions for the original network and its corresponding statistically validated network are very similar. There is a maximum in each distribution occurring for a value, which is close to 100 seconds. These distributions cannot be well fitted by the exponentially truncated power law expressed in Eq. (5), nor a power law.
Correlations between edge weights
One would expect that there is a positive correlation between the weights of call number and call duration . In Fig. 5(a) and (b), we illustrate the scatter plots of duration-based weights and number-based weights of a random sample of 5000 edges selected in the giant component of the MCN and the SVMCN. The two weights are strongly correlated as expected. The Pearson's linear correlation coefficient r between and is for the GCMCN and for the GCSVMCN, indicating the existence of a strong positive correlation. The relationship between the two link weights can also be characterized by Spearman's rank correlation coefficient ρ, which is a non-parametric measure of the statistical dependence between two variables. We obtain that for the GCMCN and for the GCSVMCN. Since Spearman's correlation is higher than Pearson's correlation, the correlation has a nonlinear component in spite of the presence of a linear trend in the association between and . The results for the GCDCN and the GCSVDCN are similar.
To analyze in more detail the correlation, we equally partition the interval [min(wN), max(wN)] into 30 intervals by logarithmic binning and assign each link into one of the 30 groups. We obtain the call number weight wD as a function of call duration weight wN by calculating the mean and standard deviation of wN and wD in each group. Specifically, we plot 〈wD/wN〉 as a function of wN for the GCDCN and the GCSVDCN in Fig. 5(c) and for the GCMCN and the GCSVMCN in Fig. 5(d). The average duration of a call is close to 200 seconds for all the networks and it is almost independent of the number of calls. We observe that the statistical validated networks have lower 〈wD/wN〉 values and lower fluctuations. Another interesting feature is that the call duration fluctuates more for small or large number of calls.
Node strength distribution
For each user, we define two node strengths based on the number and duration of calls. The number-based node strength is the total number of calls the user made, while the duration-based node strength is the total duration of calls the user performed. For the directed networks, we can further distinguish incoming and outgoing node strengths.
The distributions of number-based node strength are shown in Fig. 6(a) and (b) for the giant components of the four networks. The distributions for the GCDCN, the GCMCN and the GCSVMCN can be fitted by an exponentially truncated power-law function The fitted curves are shown as dashed red lines in Fig. 6(a) and (b). We estimate that , , and for the GCDCN, and for the GCMCN, and and for the GCSVMCN. For the GCSVDCN, the distribution curves are not smooth and the fitting would be of poor quality.
The distributions of duration-based node strength are shown in Fig. 6(c) and (d) for the giant components of the four networks. These distributions share a very similar shape, which is reminiscent of the inter-call durations at the population level12,14,15. For the directed networks, there is no difference between incoming and outgoing call durations. Figure 6(d) shows that the statistical validation method is able to filter out the nodes with very short or very long mutual call durations.
Correlations between node strength
For nodes, besides the degree-degree correlation, we also study the correlation between node strength. The number-based and duration-based correlation of node strengths are calculated as follows: and . The results for the giant components of the four networks are illustrated in Fig. 7(a)–(d). In the figure, all curves show a very slow increase for small sD and sN values and then a more pronounced increase for large values of s. For small s < sx, independence can be observed, whereas a dependence of the form 〈snn|s〉 ~ sα is observed for large sD and sN values. For strong ties with wD > 2 × 104, which form 1.6% of all edges of the original giant component, the strength of both adjacent nodes depends almost on the weight of this single edge (si = wij = sj). This explains the linear trend in the strength-strength correlation of the original GC network (αD,MCN ≈ 1). In contrast, we find that αN,MCN ≈ 0.5 when sN,MCN > 200, αD,MCN ≈ 1 when sD,MCN > 104, αN,SVMCN ≈ 0.5 when sN,SVMCN > 50, and αD,SVMCN ≈ 0.67 when sD,SV MCN > 200.
Similar to Fig. 5(c) and (d), we calculate and plot 〈sD/sN〉 as a function of sN for the giant components in Fig. 7(e) and (f). It is found that all the curves exhibit a slight decreasing trend both in the mean and in the standard deviation as a function of node strength. In addition, the curves for the statistically validated networks are lower than their original counterparts.
Cross-correlations between node strength, edge weight and node degree
We now turn to the cross-correlations between node strength, edge weight and node degree. Figure 8(a) and (b) plot the dependence of the node strengths on the node degree for the giant components of the four networks. The curves have a power-law dependence: 〈s|k〉 ~ kα. The best fitting power-law exponents are the following: αN,out = αN,in ≈ 1.0 and αD,out = αD,in ≈ 0.85 for the GCDCN, αN,out = αN,in ≈ 1.1 and αD,out = αD,out ≈ 0.95 for the GCSVDCN, αN ≈ 0.95 and αD ≈ 0.86 for the GCMCN, and αN ≈ 1.01 and αD ≈ 1.23 for the GCSVMCN. For the GCMCN, the average call durations of individuals who have high degrees are slightly less than that of individuals with low degrees. These results confirm that the statistical validation procedure filters out communications occurring between users linked by underlying social relationships. We also observe that αN > αD for the GCMCN, suggesting that individuals who talk to a large quantity of users appear to spend less time per callee than those who spend less time on phone.
We present the correlation between strength product sisj and degree product kikj in Fig. 8(c) and (d). Also in this case we observe a clear power-law dependence 〈sisj|kikj〉 ~ 〈kikj〉β. According to Ref. 30, if 〈si〉 = ki〈w〉, one would expect that 〈sisj|kikj〉 = 〈w〉2〈kikj〉. Differently than expected, we obtain that βN ≈ 1.12 and βD ≈ 1 for the GCDCN, βN ≈ 1.35 and βD ≈ 1.48 for the GCSVDCN, βN ≈ 0.9, βD ≈ 0.8 for the GCMCN, and βN ≈ 1.2 and βD ≈ 1 for the GCSVMCN. The discrepancy of β ≠ 1 indicates that there are correlations between node degree and the weights of the edges adjacent to the node.
We also study the correlation between edge weight and node degree product (Fig. 8(e) and (f)) and the correlation between edge weight and node strength product (Fig. 8(g) and (h)). The curve and the curve are very similar for each network, and there are evident difference between the 〈wij|kikj〉 curves of an original network and its statistically validated network. However, the dependence of the 〈wij|kikj〉 curves on the degree product kikj is weak. In contrast, the curves increase rapidly and exhibit roughly power laws: 〈wij|sisj〉 ~ (sisj)δ, where δN ≈ 0.43 and δD ≈ 0.44 for the GC DCN, δN ≈ 0.42 and δD ≈ 0.47 for the GCSVDCN, δN ≈ 0.3 and δD ≈ 0.45 for the GCMCN, and δN ≈ 0.4 and δD ≈ 0.5 for the GCSVMCN.
Clustering coefficient is an important metric of complex networks. It represents the local cohesiveness around a node. The clustering coefficient of node i is defined as Ci = 2ti/[ki(ki − 1)], where ti is the number of triangles of node i with its neighbours. For the directed networks (DCN and SVDCN), we treat edges as undirected. We find that the average clustering coefficients of the giant components of the four networks are 0.11 (DCN), 0.02 (SVDCN), 0.12 (MCN), and 0.11 (SVMCN). The relatively small values of the average clustering coefficients suggest that tree-shaped subgraphs are quite frequent in the local structure of the four networks. Indeed, the clustering coefficient of about 72.5% of the users is zero. It is worth noting that the clustering coefficients of the communication networks of European users are also small30. We also observe that the average clustering coefficient in the SVDCN is smaller than in the DCN network. This observation reflects the fact that the statistical validation approach, while minimizing the presence of links not related to an underlying social relationship, may also remove edges with meaningful social relationships. See the Methods section for a more detailed discussion of this aspect.
Panels (a) and (b) of Fig. 9 show the dependence of 〈C|k〉 on k for the four networks. Surprisingly, we do not observe a power-law decay as observed for the European users30. On the contrary, high-degree users have a relatively high clustering coefficient. This can be partially explained by the fact that one main promotion strategy of the mobile phone service provider is to make contract with institutions with lower communication prices. The users with more contacts are usually “secretaries” and their contacts also call each other frequently. Figure 9(c) and (d) present the dependence of average weighted clustering coefficient 30 on s for the four networks. The increasing trend in these curves is also observed in the European case30.
Topological overlap of two connected nodes
The topological overlap of the neighborhood of two connected nodes i and j is estimated by considering the relative overlap of their common neighbors30, where ki and kj are the degrees of the two nodes and nij is the number of neighbors common to both nodes i and j. Overlap is the fraction of common neighbors that a pair of connected nodes has, which is different from the edges-clustering coefficient reflecting the probability that a pair of connected vertices has a common neighbor36. In the calculation of overlap for the directed networks, we ignore the direction of edges and treat the directed networks as undirected networks.
Fig. 10(a) illustrates the average overlap 〈O|wN〉 as a function of the number-based edge weight wN for the four networks. The two curves for the MCN and the SVMCN are similar, while the curve for the DCN is higher than that for the SVDCN indicating that a significant fraction of common neighbors have been removed by the statistical validation method. In addition, all the curves exhibit an increasing trend and the two blue curves seemingly decrease after wN ≈ 2000. The curve for the MCN is very different from the European case with a bell shape curve with a maximum at wN ≈ 5030. Fig. 10(b) illustrates the average overlap 〈O|wD〉 as a function of duration-based edge weight wD for the four networks. Similar to the European case30, all the average overlap curves 〈O|wD〉 increase up to wD ≈ 2 × 104, whereas after that they decline quickly. In Fig. 10(c) and (d), we show the average overlap 〈O|Pc(wN)〉 and 〈O|Pc(wD)〉 against the cumulative edge weight Pc(wN) and Pc(wD) respectively. Different from the behavior observed in the European case30, all the curves increase. Fig. 10 shows that the statistical validation method does not change much the overlap structure of the mutual calling networks. However, the overlap reduces remarkably after applying the statistical validation method on the edges of the directed calling networks.
We have constructed and investigated four calling networks from a dataset of more than nine million phone users. These networks are the directed calling network, the mutual calling network and their statistically validated networks. The statistical properties of these four calling networks have been investigated in a comparative way. Specifically, we have considered the distributions of the degree, the edge weight, the node strength, the relative overlap of two connected nodes, and their mutual dependence. We found that these networks share many common topological properties and also exhibit idiosyncratic characteristics in both qualitative and quantitative ways. When compared with the results observed for a mutual calling network of an European dataset of mobile phone users30, the results obtained for the Shanghai dataset exhibit some different communication behaviors.
The differences between the two original calling networks (DCN and MCN) and their statistically validated networks are of great interest. We have observed that the size of statistically validated network is significantly smaller than its original network. Also, the Bonferroni networks have thinner degree distributions, indicating fewer highly connected nodes. This finding suggests that a large proportion of edges in high-degree nodes might not be directly associated with an underlying social motivation, which is consistent with the finding that there are hot lines and robots calling a large number of different users and characterized by an ultra low number of incoming calls15. For the original networks the average call durations of high-degree users are slightly less than that of low-degree users, while for the statistical validated networks, we observe the opposite situation that high-degree individuals have larger average call durations than low-degree individuals. Our comparative analysis shows the importance of investigating statistically validated networks because the original networks contain users whose communication patterns do not reflect a social motivation. The calling profile of these users makes it difficult to uncover the true communication behavior of the system.
It is natural that the networks for the Shanghai dataset and the European dataset share many common topological properties. However, we also observed several discrepancies. The differences are of crucial interest as they might point to different mechanisms at play in mobile communication networks (and more generally in social networks) located at different parts of the world. For instance, we observed a different broadness of degree distributions, which might originate from different dynamics in social ties formation and disappearance37. The different behaviors that might explain formation and deletion of social ties indicate the presence of different elementary mechanisms governing social dynamics under different cultures and social norms. However, a detailed investigation of these issues is beyond the scope of this work.
The setting of the statistical validation and its threshold depends on the problem investigated. One can choose to use a more or less conservative threshold (as it is done when one chooses a 0.05 or 0.01 or 0.001 univariate threshold). To investigate the possible impacts of different thresholds, we repeat all the analyses by using a Bonferroni threshold 0.01/NE, where NE is the number of pairs of subscribers that had at least one call over the entire period for the DCN or the number of pairs of subscribers with mutual calls in the MCN. In this way we have two new Bonferroni networks for the DCN and MCN networks obtained with the least restrictive Bonferroni threshold we can set. It is obvious that the new Bonferroni networks have larger sizes. We find that the results are qualitatively the same as the more restrictive Bonferroni threshold. The differences are only quantitative. For instance, the degree distributions become broader simply because there are more nodes with higher degrees.
It might be worth discussing more in detail the implications of the statistical validation. While the statistical validation is useful to filter edges like hot lines and robots, it also removes a consistent fraction of possible edges with meaningful social relationships. We argue that any other filtering methods also suffer a similar shortcoming. For instance the filtering method keeping bidirectional links while removing unidirectional links30 or the method extracting the “multiscale backbone” of the original network in which the edges are statistically validated by identifying which links of a node carry disproportionate fraction of the weights38. For such large social networks, we will never be able to identify all the true social ties but rather any filtering procedure will present false positive links, that is, links present in the filtered network but without a social origin, and false negative links, that is, links that are absent in the filtered network but have a social origin. In this respect, we can say that our statistical validation method minimizes the number of false positive links while does not put constraints on the number of false negative links. For example, a similar approach has been pursued (i) in Ref. 39 to investigate preferential credit links between banks and firms based on their mutual credit relationships or (ii) in Ref. 40 to identify clusters of investors from their real trading activity in a financial market. Further details about the methodology, specifically applied to mobile phone data, can also be found in Ref. 41, where interesting evolution patterns of triadic motifs are discussed.
Our dataset comprises the detailed call records of more than nine million different mobile phone numbers from one mobile operator in Shanghai during two separated periods. One is from 28 June 2010 to 24 July 2010 and the other is from 1 October 2010 to 31 December 2010. Because the records in several hours are missing on October 12, November 6, 21, 27, and December 6, 8, 21, 22, these days are excluded from our sample. The sample has a total of 110 days of call records. Each entry of the records contains the following information, caller number, callee number, call starting time, call length, as well as call status. The caller and callee numbers are encrypted for protecting personal privacy. Call status with a value of 1 means that the call gets through successfully and is terminated normally. When we construct communication networks, only the calls with the call status equaling to 1 are considered.
Construction of networks
There are three mobile operators in mainland China. We only have access to the entire call records used for billing purpose of one operator. We thus focus on the calling networks between mobile phone users that are costumers of the operator. We construct four calling networks as follows.
The directed calling network (DCN) is composed of all users. If user i calls user j, a directed edge is assigned from i to j. There are a total of 4,032,884 nodes and 16,753,635 directed edges in the DCN. The mutual calling network (MCN) contains part of the users. An edge is drawn between user i and user j if and only if there are reciprocal calls between them. All isolate nodes are not included in the MCN. There are totally 2,615,247 nodes and 5,065,397 edges in the MCN. One can see that about 70% edges are not reciprocal in the DCN.
We then perform statistical validation on each edge of the DCN and the MCN as described below. Edges that are consistent with the null hypothesis of random selection of the receiver are removed together with the nodes that become isolated. With our procedure we obtain a statistical validated directed calling network (SVDCN) which has 2,410,757 nodes and 2,453,678 edges and a statistical validated mutual calling network (SVMCN) which has 1,042,751 nodes and 1,099,254 edges. The sizes of the two original networks reduce significantly.
Statistical validation of edges
The statistical validation is performed by comparing the number of calls observed between each pair of caller and receiver with a null hypothesis of random matching between the caller and the receiver. The null hypothesis takes into account the heterogeneity in the number of calls performed by subscribers. The method is a variant of the approach originally proposed in Ref. 31 and used in different systems39,40,41. Here the statistical validation is done by considering the number of calls done by a caller, the number of calls received by a receiver and checking whether or not the number of calls exchanged between them is compatible with the null hypothesis that these calls were made by selecting randomly the receiver. The test is performed as detailed hereafter. The test allows to assign a p-value to each pair of caller and receiver. The p-values are then compared with a statistical threshold set at 1%. However, since the null hypothesis of random pairing is tested for all pairs of subscribers, we have to perform a multiple hypothesis test correction in order to control the number of false positives. In this work, we use the Bonferroni correction which is the most restrictive amongst all possible corrections minimizing the number of false positives. When a link between two subscribers i and j has a p-value less than the Bonferroni threshold we assume that the calls from i to j have a social origin.
The p-value is obtained as follows. Let us denote N as the total number of phone calls of all users in the calling network, Nic the number of calls initiated by individual i and Njr the number of calls received by individual j. Assuming that X = Nicjr is the number of calls initiated by individual i and received by individual j. The probability of observing X co-occurrences is given as follows31,42: where is a binomial coefficient. We can associate a p-value to the observed Nicjr as follows: The Bonferroni correction for the multiple testing hypothesis is pb = 0.01/NT where NT is the number of performed tests. For the DCN, we perform NT = 16,753,635 tests. If the estimated p(Nicjr) is less than pb, we conclude that the calls between user i and user j cannot be explained by a null hypothesis of random calls from i to j performed according to the heterogeneity of the caller and the receiver. When the test does not reject the null hypothesis, the directed edge from i to j is removed.
In the validation of the MCN network, we need to estimate the p-value of the number of calls Njcir initiated by j and received by i in a similar way. For the MCN, we need to conduct NT = 2 × 5,065,397 = 10,130,794 tests. The Bonferroni correction for the multiple hypothesis test is again pb = 0.01/NT. If the estimated p(Nicjr) is less than pb, we can conclude that i preferentially calls j. We also need to estimate the p-value of the number of calls Njcir initiated by j and received by i in a similar way. The edge between i and j is included in the statistically validated network if and only if the two directional links are both validated.
To illustrate how this method works, we present a few quantitative examples with typical values of calls extracted from Fig. 1(c). We consider the DCN, in which pb = 0.01/NT = 1/16,753,635 = 9.87 × 10−10. The root node (square) is linked to one node close to it. For calls the root node made, Nic = 14, Njr = 81, and Nicjr = 14, leading to p(Nicjr) = 6.36 × 10−88. For calls the root node received, Nic = 58, Njr = 81, and Nicjr = 14, resulting in p(Nicjr) = 0. These two directed links between the root node and her unique contact are thus statistically significant. Consider the dashed link connecting a node in the lime green cluster and a node in the gray cluster, located in the southeast of Fig. 1(c). For the direct link from the lime green node to the gray node, Nic = 400, Njr = 824, Nicjr = 2, and p(Nicjr) = 3.41 × 10−6. For the direct link from the lime green node to the gray node, Nic = 289, Njr = 459, Nicjr = 1, and p(Nicjr) = 1.05 × 10−3. In spite of the low p-values, these two directed links are statistically compatible with the null hypothesis of random selection of the receiver when a Bonferroni correction is applied. More information about the distribution of p-values can be found in Ref. 41.
It is worth pointing out that many of the links not statistically validated might also be associated with a social origin. In fact, with our choice of the Bonferroni correction we primarily control the absence of false positives. This is done at the cost of observing an admittedly high level of false negative. The motivation behind our choice is that we aim to detect with our methodology a backbone of social interaction that is not affected by the presence of false positives.
This work was partially supported by the National Natural Science Foundation of China (11205057), the Humanities and Social Sciences Fund of the Ministry of Education of China (09YJCZH040), the Fok Ying Tong Education Foundation (132013), and the Fundamental Research Funds for the Central Universities.