Quantifying social segregation in large-scale networks

We present a measure of social segregation which combines mobile phone data and income register data in Oslo, Norway. In addition to measuring the extent of social segregation, our study shows that social segregation is strong, robust, and that social networks are particularly clustered among the richest. Using location data on the areas where people work, we also examine whether exposure to other social strata weakens measured segregation. Lastly, we extend our analysis to a large South Asian city and show that our main results hold across two widely different societies.

The expected amount of communication Some towers are more heavily used than others, among others due to their location and the number of nearby towers. Hence there is more communication going in and out of some towers than others, and hence some tower links are more intense simply because of the importance of the two towers. To separate this effect from effects due to segregation, we construct a measure of expected communication between two towers, based on the amount of outgoing communication from the sender tower and income communication ate the receiving tower.
The expected amount of communication is calculated as follows: Let T be the total number of communication events in the whole network, T A be the number of events originating at tower A, and T B the number of events directed at tower B. Then the probability that an events originates from tower A is T A T and the probability that an event is directed at B is T B T . If the two events are independent, the probability that a particular event goes from A to B is hence T 2 and the expected number of events between the two towers T A T B T .   Reported coefficients are coefficients with t-values in parenthesis, and *, **, and *** denotes significant at the 10 percent, 5 percent, and 1 percent levels.
Columns (1) and (2) report results from OLS regressions whereas Column (3) reports results from a regression where we include tower fixed effects for the sending and receiving tower. In this specification, standard errors are clustered two-ways on sending and receiving tower. Columns (4) and (5) report results from negative binomial regressions. Note. The table shows how communication intensity can be explained by differences in income. In the column ("Total events"), the dependent variable is the number of communication events between the two cell phone towers, in column 2 ("Extensive margin") a dummy variable for communication occurring, and in column 3 ("Intensive margin") the number of events conditional on communication occurring. Reported coefficients are coefficients with t-values in parenthesis, and *, **, and *** denotes significant at the 10percent, 5percent, and 1percent levels. A range of control variables are included, such as geographical distance between two cell towers (up to fourth polynomial), the income level of sending and receiving tower, total tower traffic level and expected tower traffic level.

Men Women
Note. The figures shows the relationship between social segregation and age. Communication from an individual to a cell phone tower is regressed in the log absolute income difference between the sender and receiver tower. The regression coefficient is modelled as a gender specific 7 dimensional polynomial in sender age. A lower coefficient indicates more segregation. Grey areas are 95 percent confidence bands. Estimation is based on a 7 dimensional polynomial in age.   Note. The table shows how communication intensity can be explained by differences in income measured as the tower average and the estimated individual income. The latter is estimated by imputing group averages at the basic unit level based on gender and six age groups for a total of 12 demographic groups.
Reported coefficients are coefficients with t-values in parenthesis, and *, **, and *** denotes significant at the 10%, 5%, and 1% levels. A range of control variables are included, such as geographical distance between two cell towers (up to fourth polynomial), the income level of sending and receiving tower, total tower traffic level and expected tower traffic level. Reported coefficients are coefficients with t-values in parenthesis, and *, **, and *** denotes significant at the 10percent, 5percent, and 1percent levels. A range of control variables are included, such as geographical distance between two cell towers (up to fourth polynomial), the income level of sending and receiving tower, total tower traffic level and expected tower traffic level.