Bayesian method for inferring the impact of geographical distance on intensity of communication

Spatially-embedded networks represent a large class of real-world networks of great scientific and societal interest. For example, transportation networks (such as railways), communication networks (such as Internet routers), and biological networks (such as fungal foraging networks) are all spatially embedded. Both the density of interactions (presence of edges) and intensity of interactions (edge weights) are typically found to decrease as a function of spatial separation of nodes in these networks. Communication and mobility of groups of individuals have also been shown to decline with their spatial separation, and the so-called gravity model postulates that this decline takes the form of a power-law holding at all distances. There is however some evidence that the rate of decline might change as the distance increases beyond a certain value, called a change point, but there have been few statistically principled methods for determining the existence and location of change points or assessing the change in intensity of interactions associated with them. We introduce such a method within the Bayesian paradigm and apply it to anonymized mobile call detail records (CDRs). Our results are potentially useful in settings where understanding social and spatial mixing of people is important, such as in the design of cluster randomized trials for studying interventions for infectious diseases, but we also anticipate the method to be useful for investigating more generally how distance may affect tie strengths in general in spatially embedded networks.

Spatially embedded networks are networks in which each node has been assigned a fixed location in some underlying Euclidean space. Although this description could include embedding of nodes in a covariate space (e.g., representing fitness of nodes), here we focus on geographically embedded networks, i.e., networks that have been embedded in a two-dimensional Euclidean space where the positions of the nodes can be interpreted as geographical locations. Although this interpretation is not necessary for the formulation or use of the method, it applies to our specific application.
With the rise of communication and social network technologies, the role of spatial distance on establishing and maintaining social ties is constantly changing [1][2][3] . Knowing that two individuals communicate with one another using a specific channel or mode of communication makes them more likely to use also another [4][5][6] . For example, people who speak on the phone frequently also interact in person 7 . For researchers studying infectious diseases, such as HIV/AIDS or Malaria, the structure of social interactions in a population can provide valuable insights into how pathogens are transmitted among members of that population [8][9][10] . Another context for which the interplay between social ties and geography is important is in the delivery of healthcare. Patterns of care delivery can be naturally represented as networks, wherein two physicians are connected to one another if they share one or more patients 11 . The clusters of physicians in these networks often do not coincide with institutional boundaries but instead extend across them 12 . The literature on geographic variations literature in healthcare costs and outcomes was launched by Wenberg and Gittelsohn 13 , and has since become the central empirical argument for the inefficiency of the health care system in the Unites States. Because geography places constraints on patient-sharing relationships of physicians, a principled way to assess the impact of distance on intensity of connections in these networks might lead to a more complete examination of the sources of variability in provision of healthcare. Although we do not pursue this application here, the methods we introduce could also be used to address the role of geography also in healthcare delivery.
Because traditional surveys are resource intensive and scale poorly, mobile phone data, or more specifically call detail records (CDRs), have emerged as an alternative for inferring the structure of underlying interpersonal open Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA. * email: degrut@hsph. harvard.edu

Results
Data. We aggregated the dataset in two ways. First, we aggregated the daily call counts over the 3-month period, resulting in a single call count for each distinct pair of users. We distinguish between the caller and the receiver; hence, the count for each call between each pair is directed. Second, we aggregated the data from the level of individuals to the level of counties; the resulting dataset describes communication intensity for calls among the counties. There were records for a total of 2,511,035 users; 359,759 of them resided in the largest county and 136, in the smallest. The number of calls from one county to another ranged from 0 to 266,199 with 21,016,548 calls in total. There were 2,646 distinct zip codes nested within 427 counties. The geographical location of each county was calculated by first identifying the latitude and longitude of each zip code centroid and then taking the mean of the these coordinates over all zip codes that were nested within a given county. For each county we thus obtained the number of resident users; and for each pair of counties, we obtained the spatial distance between them and the number of calls made and received by users in those counties over the 3-month period. As discussed in the section, Computational complexity, we reduce computational burden by selecting a subset of data that arose from 65 counties with the greatest numbers of users; in this subset, the number of calls ranged from 7,879 to 359,759. The corresponding call counts between pairs of counties ranged from 2 to 266,226. Multiple calls between any pair of users were included as one number in the call count. Figure 1 demonstrates the decay in intensity with distance as well as the distribution of number of calls; the log transformed call numbers appear to be roughly normal in distribution.
The distance is calculated at a coarser level (county) rather than at the zipcode level to protect user privacy; call counts between zipcodes might reveal user identity, especially between those for which the number of users and calls is small. We also note that although our analysis is of the locations of calls (not residences of callers), using a larger geographical unit will make these more likely to be the same, and perhaps thereby add to the interpretability of the analyses. We comment on this issue in the discussion.
Gravity model and our extension. Analyses of the data described above is based on the gravity model.
Adapting the notation from 26 , this model can be written as where G ij specifies the communication intensity from source location i to destination location j, K is a constant, m i is the population of the source location i, n j is the population of the destination location j, and d ij is the distance between source i and destination j.
A related article 25 provided an extension to this model: where n i and n j are the number of users in county i and j; d ij is the distance between the two in kilometers; Y ij = g(G ij ) and g(·) is a transformation function, in the gravity model, g(·) = log(·) ; µ is the intercept; θ i represents the location of the change point measured on the logarithmic scale for communication initiated from location i; β 3,i represents the distance effect before change point θ i ; β 4,i specifies the difference of distance effect before and after the change point; and S is the number of locations under consideration. When β 4,i = 0 , the www.nature.com/scientificreports/ difference is 0, i.e. the rate of decay does not change over the observed range. We denote the size of the population at location i as n i and refer to the model with β 4,i as the full model and the model that sets β 4,i to 0 as the reduced model. By definition, is the indicator function. It takes value 0 before the change point θ i and d ij − θ i after the change point. We assume that ǫ ij iid ∼N(0, σ 2 ) . This formulation provides a straightforward way to compare the two nested models with regard to the effect of distance effect; the reduced model has the constraint β 4,i = 0 . In this formulation, model selection only involves variable selection; we perform the latter using LASSO 31 . We also estimate θ i and quantify its uncertainty as described in Methods below. We note that the above formulation assumes that the full and nested models share the same intercept and population size effects-an assumption that might not hold in practice. To address this concern, we consider two distinct settings, case I, which refers to the setting where the assumption holds, and case II, where it does not. For the latter, we extend the model by allowing different intercepts and population size effects for models with and without change points. In Methods, we describe how inference on this model is achieved.

Analysis of call records data
As illustrated by the scatter plot in Fig. 1, the relationship between natural log of call counts and natural log of geographical distances appears to follow a linear relationship both before or after the break point. We also note that Fig. 1 is consistent with our assumptions of continuous calling intensity and normality of natural log of the number of calls. We used the preliminary binary assignments of change points based on BIC in a simple linear regression to assess whether there is variability across counties in intercepts and population size effects. Both models with only main effects (indicator variable of group assignments, log population sizes, log distance-before/ after change point) and those with main effects and interaction terms showed evidence (p value < 0.05) of such variability. Hence we applied the method described below (in the Simulation study section) for the analysis of the cell phone data. The variability in intercepts and population size effects is true both for the general population from all 427 counties and for the user subpopulation we described above.
In the analysis of call records (Figs. 2 and 3), we note that the slopes for source locations in the northeast appear to be less steep; that slopes near the capital city, where the population is dense, are more likely to have change points No such patterns were observed for slopes of other locations, either before or after the change points. Model estimates revealed that locations with no change point tended to be in the north while those with change points were concentrated in the south around the capital area. For diagnosis on convergence, Fig. 4 shows a trend of PSRF 2 approaching 1 very quickly and a PSRF 1 fluctuating below 1.5, which is acceptable.

Discussion
To analyze the decline in communication intensity with geographical distance, we extended the gravity model by allowing for change points in this relationship. We addressed the issue of the existence of change points for each source location and quantified associated uncertainty using a Bayesian model. We also provided estimates of the slopes before and after each change point. We investigated the geographical pattern of the existence of change points and noted differences in these patterns between rural and urban areas.
We apply our method to an anonymized dataset of call detail records, using the number of mobile phone calls in as the measure of communication intensity between a pair of counties. The outcomes are log-transformed counts; the regression model we specify treats the transformed outcomes as continuous-a choice that is most appropriate when the number of calls between two locations is large (Fig. 1). In settings with 0 or very small counts, one could consider alternative models (e.g., negative binomial) or the addition of an arbitrary small positive number to 0, although the latter approach can add bias 32,33 . In this setting, a negative binomial model might be a better fit, though the interpretation of the parameters is less straightforward. Using Bayesian methods in a setting where the data are assumed to be negative binomial distributed requires non-standard approaches even without inclusion of change points into models. Some research has provided useful tools for sequentially updating the parameters using Gibbs sampler by augmenting the posterior distribution with auxiliary parameters [34][35][36] . When the number of counts is large, the negative binomial approach may not be computationally feasible; fitting negative binomial outcomes in Bayesian LASSO needs further investigation. One possible direction is to extend the methods based on the conditional normal distribution 36 by transforming the variance matrix so that normal-distribution based LASSO method can be employed.
Another extension of our method would allow for aggregation of results across different subsamples; currently the number of locations we can analyze is limited by computational capacity. Developing a method to obtain consistent results from different overlapping sets of nodes, perhaps in a meta-analysis framework, would alleviate the computational concerns, but is challenging. Some potentially useful approaches are provided [37][38][39][40] . In particular, the stability selection 41 may be used to assess the properties of the meta-analytic results. An example of the use of LASSO in analyses that combine across subsamples arose from analyses intended to discover adverse drug reactions 42 . Another potentially useful approach is the use of path of partial posteriors 43 . In this approach, the resampling procedure resembles the bootstrap, but with smaller resampling sizes. Because standard bootstrapping of the LASSO estimator of the regression parameter for variance inference is known to yield inconsistent estimates 44,45 , modified bootstrapping must be used 46 . Nonetheless, Bayesian LASSO procedures provide straightforward and valid estimates for standard errors.
The findings from our analysis of mobile phone communication intensity illustrate how such information might be used. For example, should such communication networks prove to be accurate proxies for contact networks, such analyses might help guide the design of cluster randomized trials for infectious disease. Randomized trials ideally enroll participants in a way that minimizes the extent to which the treatment assignment of one subject affects the outcome of another. For interventions in which such interference occurs at the individual but not the cluster level (e.g., through contacts among randomized subjects), cluster randomization can be Scientific RepoRtS | (2020) 10:11775 | https://doi.org/10.1038/s41598-020-68583-1 www.nature.com/scientificreports/ useful 47 . Clusters may be comprised of participants in the same geographical location, institution (e.g. school) or administrative unit (village). Cell phone data could potentially aid in the identification of appropriate clusters by providing information about the probability of interference. When mixing across clusters cannot be eliminated, identification of treatment effects requires modeling of the mixing process 48 . The impact of interference across randomized units on power of a clinical trial to detect effects of an intervention in preventing spread of infectious disease is investigated 49,50 . As geographical distance is likely to affect contact networks, knowing the relationship between communication and distance may be useful not only for identification of clusters, but also for aiding in development of appropriate mixing models.

Methods
To estimate the parameter of interest, θ i , and quantify its uncertainty we employ a Metropolis Hastings algorithm in Bayesian framework. We consider a Metropolis sampling block for θ i and a Bayesian LASSO block dealing with β 4,i . To allow different intercepts and population size effects for models with and without change points, we employ a Reversible Jump Monte Marlo Markov Chain algorithm. To implement it, we chooose (RJMCMC) option in the blasso function in R package monomvn. We use the default non-informative priors for unknown parameters in both simulation and data analysis. This approach allows for statistical inference using Bayesian LASSO. RJMCMC is a general version of the Metropolis-Hastings algorithm 51 , which allows transitions between

Metropolis block and Bayesian LASSO.
Case I: Assuming same intercept and population size effects across all source locations With Bayesian LASSO, the model is specified as which can be written as Y = µ1 + Xβ + ǫ using matrix notation. µ is not included in the Bayesian LASSO penalty term 52 ; 1 is the vector of 1s; X is the model matrix consisting of logarithmic population sizes and distances, and β is the vector of βs.
In general, LASSO 31 solves an unconstrained optimization problem subject to a given bound on the L 1 norm of the parameter vector that is equivalent to where Ỹ = Y − µ1 is the centered outcome vector; p is the number of parameters after excluding the intercept. In the Bayesian setting, solution to Eq. (7) provides the posterior mode estimates when β j has i.i.d. double exponential priors. Conditional double exponential priors are used in the formulation to avoid multiple modes 52 . They can be expressed hierarchically as The entire sampling procedure is available using function blasso in R package monomvn with the option for RJMCMC specified as False. To incorporate a Metropolis block for change point estimation, we alternate between the Metropolis and Bayesian LASSO blocks. Validity of this approach is established by regarding it as two components of a Gibbs sampling algorithm 53 . In summary, conditional on change points, our inferential problem becomes one of a variable selection; conditional on other parameters, change point sampling is a straightforward application of a Metropolis algorithm.
Thus after obtaining the initial values µ (0) , β (0) , θ (0) and σ 2 (0) , we proceed as follows: 1. At iteration t for each source location i, update change point θ (t+1) i using Metropolis algorithm with a normal proposal N(θ (t) i , σ 2 θ ) . The range of θ i is determined empirically from data, i.e., the posterior likelihood of θ i has an indicator function term in the product that is 0 if the proposed θ (t+1) i is out of the observed empirical log-distance range, thereby assuring that any out-of-range proposal will be rejected. 2. For each location i, if there are fewer than 5% of data points on either side of θ (t+1) i for the subset of data, i.e., Y i , we consider it to be on the boundary, specify β (t+1) 4,i = 0 , and remove it from the model in the next estimation step. We denote the number of locations belonging to the boundary sets as b (t+1) . 3. Create the corresponding S(S − 1) × (2 + 2S − b (t+1) ) covariate matrix (intercept column is not included) based on θ (t+1) . Together with the data, β (t) (after β (t+1) 4,i = 0 are removed), σ (t)2 and (t) , input the covariate matrix into the blasso function for h iterations (2 or more). The output intercept is µ (t+1) . From the output we also get β (t+1) ( β (t+1) 4,i = 0 are put back), σ (t+1)2 and (t+1) .

Case II: Allowing different intercepts and population size effects for models with and without change points.
When there is evidence of the presence of change points, we estimate these parameters separately in two different models. In this case, estimates of intercepts and population size effects depend on the set of source locations whose data contribute to the estimation in any given iteration. We denote the mean model as η (t) for iteration t to maintain consistency with the notation we introduced earlier.
As mentioned above, estimation makes use of the Reversible Jump MCMC option in the blasso function. In our setting, different models imply different specification of zeros in β (t) 4 , and are characterized by η (t) , where η (t) RJMCMC is a general version of the Metropolis-Hastings algorithm 51 , which allows transitions between different states or models of different dimensions. A thorough review of RJMCMC with more recent comments can be found in a review article 54 .
Use of RJMCMC yields the following sampling scheme: Y |µ, X, β, σ 2 ∼ N(µ1 + Xβ, σ 2 I), = 0 and remove it from the model in the next estimation step. 2. Conditional on θ (t+1) , create the s(s − 1) × (5 + 2s − b (t+1) ) covariate matrix (intercept column is not included). Data from each source location contribute to their own group's estimation of intercept and population size effects, which depends on η (t) i . All data and parameter values from the previous iteration t (including σ (t)2 and (t) ) are used in the blasso function with RJMCMC for 3 iterations. 3 is the minimum number of iterations to avoid the situation in which zeros in the previous iteration are carried forward. 3. From Step 2 we get the updated β (t+1) , σ (t+1)2 , µ (t+1) and (t+1) . Now update the η (t+1) : η Diagnostics for assessment of convergence. The usual diagnostic framework for Bayesian LASSO [55][56][57] includes trace plots for different chains and calculation of the Potential Scale Reduction Factor (PSRF). Diagnostics for RJMCMC can be developed by extending that framework to include within-model and between-model variations in the parameters.
We make use of Castello and Zimmerman 58 , which defines two PSRFs in the assessment. For a chosen parameter, PSRF 1 is the ratio between total variation V and variation within chains W c ; PSRF 2 is the ratio between variation within models W m and variation within models and chains W m W c . V , W c , W m and W m W c are defined as follows: where θ r cm , θ .. . , θ c. . , θ .m . and θ cm . are the rth appearance of θ in model m chain c, mean θ across all models and chains, mean θ within chain c across all models in that chain, mean θ within model m across all chains, mean θ within chain c and model m, respectively. R cm is number of θ in chain c model m. C and M are the number of chains and distinct models, respectively. We follow the strategy provided by Castello and Zimmerman 58 to assess convergence and, for simplicity, illustrate this approach by considering a scalar. We choose σ 2 , the variance of the error terms, for this illustration, as its interpretation remains the same across the models. Each chain is divided into batches of equal length. A sequence of PSRF 1 and PSRF 2 is calculated for each batch. A desirable result is that the two quantities move toward 1 as the iteration proceeds. In the simulation study below, we illustrate the use of diagnostic graphs for evaluating convergence; further details on this subject can be found in Brooks and Giudici 59 . interpretation. Under the assumption that intercept and population size effects are identical across source locations, we obtain a sample of β 4,i as well as its 95% credible interval rather than an estimate of the probability that each source location has a change point. Intervals that do not cover 0 imply the presence of a change point by providing evidence against the null hypothesis that the difference of the two slopes is zero. Approaches that allow variability in intercepts and population size effects yield a sample of models and their corresponding parameter estimates. For prediction, we make use of the models that RJMCMC has sampled in the estimation process; the estimated mean for predicted outcomes is a weighted average of the predicted outcomes of all models. computational complexity. Because of the computational burden of these methods, we consider an analysis of a subset of data. Simulation studies (Fig. 6 in Appendix) show that computation time for the Bayesian LASSO function blasso increases sharply as the number of locations increases. We note that the size of the covariate matrix increases at O(S 3 ) where S specifies the number of locations. It has been showed that for the least angle regression formulation of the problem, the computational complexity is O(m 3 + m 2 n) 60 , where m is the number of features and n is the number of the outcomes. In our setting, the situation is even more challenging in that the number of outcomes grows quadratically with S, which renders the overall computational complexity to be O(S 4 ).
Simulation study. We conducted the following simulations to assess the performance of our models compared with naïve approaches as well as to check the effect of the tuning parameter σ 2 θ . The values of the parameters in the data generation process were selected to be the estimates from the preliminary data analysis using σ 2 θ = 0.03 . The observed geographical distances between counties were used. We assessed the performance of the gravity model, the naïve fit based on BIC and grid search, and the Bayesian LASSO model on scenarios with The diagnostic graphs in Appendix show that convergence was generally achieved. We assessed the model fit and the effect of the tuning parameter based on the prediction error (PE), which is defined as follows: where L is the model, M is the number of data points, y new is the observed outcome in the test dataset, y new is the fitted value using model estimated on the old dataset.
One hundred new datasets were generated using the same covariates and parameters for each variance category. The findings are shown in Table 1.
As expected, estimates based both on BIC and Bayesian LASSO performed better than those of the gravity model with respect to prediction error in low, medium, and high error variances. The choice of tuning parameter had little effect; use of 0.2 in data analysis appears reasonable as this choice leads to a mean acceptance rate for the Metropolis algorithm on change points in the range of 20-25% 57 , as shown in Table 2. The 95% credible interval coverages for change points, as shown in Fig. 5 and Table 3, also reached high values at tuning parameter 0.2. The crude model based on BIC and Bayesian LASSO estimates are comparable. This is demonstrated in Fig. 5, which shows the crude estimates and Bayesian LASSO estimates to be similar. An advantage of the latter however is its ability to provide interval estimates on the change points and its smaller number of required parameters; Fig. 5 provides the 95% credible interval. These results imply that predictive power was not reduced because of the estimation of location of change points. Bayesian LASSO does require greater computation time: Computation time for 15,000 iterations takes around 9-10 h, whereas the BIC approach requires only a few minutes. For further information about runtime from simulation studies, see Fig. 6.  www.nature.com/scientificreports/    www.nature.com/scientificreports/

Data availability
The data that support the findings of this study is available from Telenor and was obtained for this research by Dr. Onnela. Restrictions apply to the availability of these data, and so are not publicly available.

Appendix: Discussion of model choices
In addition to the gravity model, other models to study the impact of spatial distance on communication intensity, such as the radiation model 25 , have been proposed. This model predicts commuting flux between locations, and the rank-based friendship model 30 , which ranks friendships based on the geographical distance between them. Both models reduce to Eq. (4) with certain constraints on their parameters or under certain assumptions. The radiation model 25 uses the following specification where T ij is the average commuting or mobility flux from location i to j (for simplicity, we denote average flux as T ij to keep the notation consistent), T i = j� =i T ij is the total number of commuters from i, and s ij is the population living in the circle centered at the source with a radius of r ij (not including m i ). Adopting this notation, Taking the logarithm yields, In radiation model 25 , we note that Eq. (13) reduces to Eq. (4) with α + β = 1 and γ = 4 when the population is uniformly distributed such that m = n and s ij ≈ m i r 2 ij . The model is mechanistic and has no parameter to fit. The rank-based friendship model 30 is formulated as follows. Let u and v be two individuals. Then define rank u (v) = |{w : d(u, w) d(u, v)}| , where d(u, w) is the distance between individual u and individual w. The probability of u and v being friends is modeled as As rank u (v) ≈ d(u, v) 2 when the the population is uniformly distributed, Eq. (14) reduces to Eq. (4) with m = n = 1 and γ = 2.
Both the gravity and radiation models are based on strict assumptions of the underlying mechanism, which are hard to validate. The gravity model, which uses the same parameters for each pair of locations, implicitly assumes a homogeneous effect of distance for the intensity function. The radiation model addresses this issue by modeling the intrinsic heterogeneity of the geographical distribution of population by incorporating s ij in the model. However, subject to its strict assumption and 'parameter-free' property, it allows little room for other factors. The rank-based model deals with the heterogeneity by substituting distance with rank, which seems to have a similar role as the s ij in the radiation model. Thus the rank function in Eq. (14) can be regarded as an implicit function of distance and population distribution. We can make Eq. (14) parametric by incorporating a parameter for the power of the rank. If the population is uniformly distributed across the area, this is equivalent to the gravity model with parameter γ for the distance r ij .
We note here that even though the rank-based approach sheds some lights on the question of interest, to move from the individual level to zip code or county level requires a completely different set of assumptions. Therefore, a rank-based gravity model cannot be seen as a simple extension of the rank-based friendship model.