A Deep Gravity model for mobility flows generation

The movements of individuals within and among cities influence critical aspects of our society, such as well-being, the spreading of epidemics, and the quality of the environment. When information about mobility flows is not available for a particular region of interest, we must rely on mathematical models to generate them. In this work, we propose Deep Gravity, an effective model to generate flow probabilities that exploits many features (e.g., land use, road network, transport, food, health facilities) extracted from voluntary geographic data, and uses deep neural networks to discover non-linear relationships between those features and mobility flows. Our experiments, conducted on mobility flows in England, Italy, and New York State, show that Deep Gravity achieves a significant increase in performance, especially in densely populated regions of interest, with respect to the classic gravity model and models that do not use deep neural networks or geographic data. Deep Gravity has good generalization capability, generating realistic flows also for geographic areas for which there is no data availability for training. Finally, we show how flows generated by Deep Gravity may be explained in terms of the geographic features and highlight crucial differences among the three considered countries interpreting the model’s prediction with explainable AI techniques.


Introduction
Cities are complex and dynamic ecosystems that define where people live, how they move around, whom they interact with, and how they consume services [1][2][3][4][5] . Most of the world's population live now in urban areas, whose evolution in structure and size influences crucial aspects of our society such as the objective and subjective well-being 6-11 and the diffusion of innovations 4, 12, 13 .
Traffic congestion, domestic migration, and the spread of infectious diseases are processes in which the presence of mobility flows induces a net change of the spatial distribution of some quantity of interest (e.g., vehicles, population, pathogens). The ability to accurately describe the dynamics of these processes depends on our understanding of the characteristics of the underlying spatial flows and it is crucial to make cities and human settlements inclusive, safe, resilient and sustainable [43][44][45] .
Among all relevant problems in the study of human mobility, the generation of mobility flows, also known as flow generation 14,15,17 , is particularly challenging. In simple words, this problem consists of generating the flows between a set of locations (e.g., how many people move from a location to another) given the demographics and geographic characteristics (e.g., population and distance) and without any historical information about the real flows.
Flow generation has attracted interest for a long time. Notably, in 1946 George K. Zipf proposed a model to estimate mobility flows, drawing an analogy with Newton's law of universal gravitation 46 . This model, known as the gravity model, is based on the assumption that the number of travelers between two locations (flow) increases with the locations' populations while decreases with the distance between them 15,47 . Given its ability to generate spatial flows and traffic demand between locations, the gravity model has been used in various contexts such as transport planning 48 , spatial economics 18, 49, 50 , and the modeling of epidemic spreading patterns [51][52][53][54] . Although the gravity model has the clear advantage of being interpretable by design and of requiring a few parameters, it suffers from several drawbacks, such as the inability to accurately capture the structure of the real flows and the greater variability of real flows than expected 15,47,55 .
Since the gravity model relies on a restricted set of variables, usually just the population and the distance between locations, flows are generated without considering information that is essential to account for the complexity of the geographical landscape, such as land use, the diversity of points of interest (POIs), and the transportation network [56][57][58][59] . Therefore, we need more detailed input data and more flexible models to generate more realistic mobility flows. The former can be achieved by extracting a rich set of geographical features from data sources freely available online; the latter by using powerful non-linear models like deep artificial neural networks. Deep learning approaches exist for a different declination of the problem, namely flow prediction: they use historical flows between geographic locations to forecast the future ones, but they are not able to generate flows between pairs of locations for which historical flows are not available 14,[33][34][35][36][37][60][61][62][63][64][65] .
To what extent deep learning can generate realistic flows without any knowledge about historical ones is barely explored in the literature 14 . Finally, since deep learning models are not transparent, explainability is crucial to gain a deeper understanding of the patterns underlying mobility flows.
We may achieve this goal using explainable AI techniques [66][67][68][69] , which unveil the most important variables overall as well as explain single flows between locations on the basis of their geographic characteristics.
We design an approach to flow generation that considers a large set of variables extracted from OpenStreetMap 70, 71 , a public and voluntary geographic information system. These variables describe essential aspects of urban areas such as land use, road network features, transportation, food, health, education, and retail facilities. We use these geographical features to train a deep neural network, namely the Deep Gravity model, to estimate mobility flows between census areas in England, Italy, and New York State. We prefer neural networks over other machine learning models because they are the natural extension of the state-of-the-art model for flow generation, i.e., the singly-constrained gravity model 15,47 , which corresponds to a multinomial logistic regression that is formally equivalent to a linear neural network with one softmax layer. Our approach is based on a non-linear variant of the multinomial logistic regression obtained by adding hidden layers, which introduce non-linearities and build more complex representations of the input geographic features.
We find that Deep Gravity outperforms flow generation models that use shallow neural networks or that do not exploit complex geographic features, with a relative improvement in the realism with respect to the classic gravity model of up to 66% (Italy), 246% (England), and 1076% (New York State) in highly populated areas, where flows are harder to predict because of the high number of relevant locations. Deep Gravity also has a good generalization capability, making it applicable to areas that are geographically disjoint from those used for training the model. Finally, we show how to explain Deep Gravity's predictions on the basis of the collected geographic features. This allows us to observe that, while in Italy and New York State the nonlinear relationship between population and distance captured by the model provides the strongest contribution to predict the flow probability, in England the interplay between the various geographic features plays a key role in boosting the model's predictions.

Results
We define a geographical region of interest, R, as the portion of territory for which we are interested in generating the flows. Over the region of interest, a set of geographical polygons called tessellation, T , is defined with the following properties: (1) the tessellation contains a finite num-ber of polygons, l i , called locations, T = {l i : i = 1, ..., n}; (2) the locations are non-overlapping, l i ∩ l j = ∅, ∀i = j; (3) the union of all locations completely covers the region of interest, The flow, y(l i , l j ), between locations l i and l j denotes the total number of people moving for any reason from location l i to location l j per unit time. As a concrete example, if the region of interest is England and we consider commuting (i.e., home to work) trips between England postcodes, a flow y(SW1W0NY, PO167GZ) may be the total number of people that commute every day between location (postcode) SW1W0NY and location PO167GZ. The total outflow, O i , from location l i is the total number of trips per unit time originating from location l i , i.e., Given a tessellation, T , over a region of interest R, and the total outflows from all locations in T , we aim to estimate the flows, y, between any two locations in T . Note that this problem definition does not allow to use flows within the region of interest as input data. That is, we cannot use a subset of the flows between the locations in the region of interest neither historical information to generate other flows in the same region. This means that a model tested to predict flows in region R must have been trained on a different region R , non-overlapping with R.
The most common metric used to evaluate the performance of flow generation models is the Sørensen-Dice index, also called Common Part of Commuters (CPC) 14,15,47 , which is a well-established measure to compute the similarity between real flows, y r , and generated flows, y g : CPC is always positive and contained in the closed interval (0, 1) with 1 indicating a perfect match between the generated flows and the ground truth and 0 highlighting bad performance with no overlap. Note that when the generated total outflow is equal to the real total outflow, as for all the models we consider in this paper, CPC is equivalent to the accuracy, i.e., the fraction of trips' destinations correctly predicted by the model. In fact, when the generated total outflow is equal to the real total outflow, the denominator becomes 2 i,j y r (l i , l j ) and the CPC measures the fraction of all trips that were assigned to the correct destination, i.e., the fraction of correct predictions or accuracy.
For a more comprehensive evaluation, we also use the Pearson correlation coefficient, the Normalized Root Mean Squared Error (NRMSE) and the Jensen-Shannon divergence (JSD), which measure the linear correlation, the error, and the dissimilarity between the distributions of the real and the generated flows, respectively 14 (see Supplementary Note 1 for details).
Derivation of Deep Gravity. Deep Gravity originates from the observation that the state-of-theart model of flow generation, the gravity model 15,46,47,72 , is equivalent to a shallow linear neural network. Based on this equivalence, we naturally define Deep Gravity by adding non-linearity and hidden layers to the gravity model, as well as considering additional geographical features.
The singly-constrained gravity model 15,47 prescribes that the expected flow,ȳ, between an origin location l i and a destination location l j is generated according to the following equation: where m j is the resident population of location l j , p ij is the probability to observe a trip (unit flow) from location l i to location l j , β 1 is a parameter and f (r ij ) is called deterrence function. Typically, the deterrence function f (r ij ) can be either an exponential, f (r) = e β 2 r , or a power-law function, f (r) = r β 2 , where β 2 is another parameter. In these two cases, the gravity model can be formulated as a Generalized Linear Model with a multinomial distribution 73 . Thanks to the linearity of the model, the maximum likelihood's estimate of parameters β 1 and β 2 in Equation (2) can be found efficiently, for example using Newton's method, maximizing the model's loglikelihood: where y is the matrix of observed flows, β = [β 1 , β 2 ] is the vector of parameters and the input feature vector is x(l i , l j ) = concat[x j , r ij ] for the exponential deterrence function (x(l i , l j ) = concat[x j , ln r ij ] for the power-law deterrence function) with x j = ln m j . Note that the negative of loglikelihood in Equation (3) is proportional to the cross-entropy loss, of a shallow neural network with an input of dimension two and a single linear layer followed by a softmax layer.
This equivalence suggests to interpret the flow generation problem as a classification problem, where each observation (trip or unit flow from an origin location) should be assigned to the correct class (the actual location of destination) chosen among all possible classes (all locations in tessellation T ). In practice, for each possible destination in the tessellation, the model outputs the probability that an individual from a given origin would move to that destination. To compute the average flows from an origin, these probabilities are multiplied by the origin's total outflow.
According to this interpretation, the gravity model is a linear classifier based on two explanatory variables, i.e., population and distance. The interpretation of the flow generation problem as a classification problem allows us to naturally extend the gravity model's shallow neural network introducing hidden layers and non-linearities.
Architecture of Deep Gravity. To generate the flows from a given origin location (e.g., l i ), Deep Gravity uses a number of input features to compute the probability p i,j that any of the n locations in the region of interest (e.g., l j ) is the destination of a trip from l i . Specifically, the model output is a n-dimensional vector of probabilities p i,j for j = 1, ..., n. These probabilities are computed in three steps (see Figure 1b).
First, the input vectors x(l i , l j ) = concat[x i , x j , r i,j ] for j = 1, ..., n are obtained performing a concatenation of the following input features: x i , the feature vector of the origin location l i ; x j the feature vector of the destination location l j ; and the distance between origin and destination r i,j . For each origin location (e.g. l i ), n input vectors x(l i , l j ) with j = 1, ..., n are created, one for each location in the region of interest that could be a potential destination.
Second, the input vectors x(l i , l j ) are fed in parallel to the same feed-forward neural network.
The network has 15 hidden layers of dimensions 256 (the bottom six layers) and 128 (the other layers) with LeakyReLu 74 activation function, a. Specifically, the output of hidden layer h is given by the vector z (0) (l i , l j ) = a(W (0) · x(l i , l j )) for the first layer (h = 0) and z (h) (l i , l j ) = a(W (h) · z (h−1) (l i , l j )) for h > 0, where W are matrices whose entries are parameters learned during training.
The output of the last layer is a scalar s(l i , l j ) ∈ [−∞, +∞] called score: the higher the score for a pair of locations (l i , l j ), the higher the probability to observe a trip from l i to l j according to the model. Finally, scores are transformed into probabilities using a softmax function, p i,j = e s(l i ,l j ) / k e s(l i ,l k ) , which transforms all scores into positive numbers that sum up to one. The generated flow between two locations is then obtained by multiplying the probability (i.e., the model's output) and the origin's total outflow.
The location feature vector x i provides a spatial representation of an area, and it contains features describing some properties of location l i , e.g., the total length of residential roads or the number of restaurants therein. Its dimension, d, is equal to the total number of features considered.
The location features we use include the population size of each location and geographical features extracted from OpenStreetMap 70,71 belonging to the following categories: • Land use areas (5 features): total area (in km 2 ) for each possible land use class, i.e., residential, commercial, industrial, retail and natural; • Road network (3 features): total length (in km) for each different types of roads, i.e., residential, main and other; • Transport facilities (2 features): total count of Points Of Interest (POIs) and buildings related to each possible transport facility, e.g., bus/train station, bus stop, car parking; • Food facilities (2 features): total count of POIs and buildings related to food facilities, e.g., bar, cafe, restaurant; • Health facilities (2 features): total count of POIs and buildings related to health facilities, e.g., clinic, hospital, pharmacy; • Education facilities (2 features): total count of POIs and buildings related to education facilities, e.g., school, college, kindergarten; • Retail facilities (2 features): total count of POIs and buildings related to retail facilities, e.g., supermarket, department store, mall.
In addition, we included as feature of Deep Gravity the geographic distance, r i,j , between two locations l i and l j , which is defined as the distance measured along the surface of the earth between the centroids of the two polygons representing the locations. All values of features for a given location (excluding distance) are normalized dividing them by the location's area.
Each flow in Deep Gravity is hence described by 39 features (18 geographic features of the origin and 18 of the destination, distance between origin and destination, and their populations).
We also consider a light version of Deep Gravity in which we just count a location's total number of POIs without distinguishing among the categories (5 features per flow in total), and a heavy version of it in which we include the average of the geographic features of the k nearest locations to a flow's origin and destination (e.g., 77 features per flow in total for k = 2). The performance of these two models is comparable to, or worse than, the performance of Deep Gravity (see Supplementary Note 2 and Supplementary Figures 4 and 5).
The loss function of Deep Gravity is the cross-entropy: where y(l i , l j )/O i is the fraction of observed flows from l i that go to l j and p i,j is the model's probability of a unit flow from l i to l j . Note that the sum over i of the cross-entropies of different origin locations follows from the assumption that flows from different locations are independent events, which allows us to apply the additive property of the cross-entropy for independent random variables. The network is trained for 20 epochs with the RMSprop optimizer with momentum 0.9 and learning rate 5 · 10 −6 using batches of size 64 origin locations. Our experiments aim to assess the effectiveness of the models in generating mobility flows within the region of interest belonging to the test set. Given the formal similarity between Deep Gravity (DG) and the gravity model (G), we use the latter as a baseline to assess Deep Gravity's improved predictive performance. Indeed, the gravity model is the state-of-the-art model for flow generation and it is thus preferred to a null model in which flows are evenly distributed at random across the edges of the mobility network 15,46,47 . Additionally, we define two hybrid models to understand the performance gain obtained by adding either multiple non-linear hidden layers or complex geographical features to the gravity model: • the Nonlinear Gravity model (NG) uses a feed-forward neural network with the same structure of Deep Gravity, but, similarly to the gravity model, its input features are only population and distance; • the Multi-Feature Gravity model (MFG) has the same multiple input features of Deep Gravity, including various geographical variables extracted from OpenStreetMap but, similarly to the gravity model, these features are processed by a single-layer linear neural network; Table 1 compares the performance of the models. For England, DG has CPC = 0.32, an improvement of 39% over MFG (CPC = 0.23), 166% over NG (CPC = 0.12), and 190% over G (CPC = 0.11) (see Table 1 We obtain similar results for Italy and New York State (Table 1): DG performs significantly better than the other models, with an improvement in terms of global CPC over G of 66% (Italy) and 1076% (New York State). The improvement of DG over G is again spread in all areas of the two countries (Figure 4d-i). This difference in the performance of DG among countries may be due to several factors, such as the differences in shapes and sizes of the spatial units, sparsity of flows, and mobility data sources.
To investigate the performance of the model in high and low populated regions, we split each country's regions of interest into ten equal-sized groups, i.e., deciles, based on their population, where decile 1 includes the regions of interest with the smaller population and decile 10 includes the regions of interest with the larger population, and we analyze the performance of the four models in each decile ( Figure 3 and Table 1). In England and Italy, all models degrade (i.e., CPC decreases) as the decile of the population increases, denoting that they are more accurate in sparsely populated regions of interest (Figure 3a, c). This is not the case for New York State, in which the model's performance increases slightly as the decile of the population increases ( Figure 3e).
Nevertheless, in all three countries, the relative improvement of DG with respect to G increases as the population increases ( Figure 3b, d, f). In other words, the performance of DG degrades less as population increases. This is a remarkable outcome because in highly populated regions of interest there are many relevant locations, and hence predicting the correct destinations of trips is harder.
DG improves especially where current models are unrealistic.
The introduction of the geographic features (MFG) and of non-linearity and hidden layers (DG) leads to a significant improvement of the overall performance. In England, the relative improvement of MFG and DG with respect to G is significant, with values of about 139% and 246%, respectively, in the last decile of population (see Figure 3a,b and Table 1). Even in the first decile of the population, we find a relative improvement of DG with respect to G of 3%. Similarly, in the other two countries we have an improvement of MFG and DG over G of 14% and 66% (Italy) and of 351% and 1076% (New York State), respectively. Note that DG's improvement on G is a common characteristic, as DG improves on G in all the regions of interest for all countries (see Figure 3). We find that the performances of NG and MFG are country specific. In England, MFG outperforms NG, as opposed to Italy and New York State where NG outperforms MFG ( Figure 3).
Despite these country-specific differences, we observe a clear pattern in our results that is valid for all countries: G has always the worst performance and DG has always the best performance, while performance of DG originates from the interplay between a richer set of geographics features (present in MFG but not in NG) and the model's nonlinearity (present in NG but not in MFG).
The performance of all models does not change significantly if we use regions of interest of size 10km (instead of 25km). In particular, all models have a CPC 25km around 0.03 higher than CPC 10km (see Supplementary Figure 3, Supplementary Table 2, and Supplementary Note 3).
However, the relative improvement of DG over G on the last decile is slightly smaller with a region of interest size of 10km (Supplementary Table 2): for example, in England it is about 220%, i.e., about 26% less than the improvement on the same decile for a region of interest size of 25km.
Geographic transferability. Neural networks trained on spatial data may suffer from low generalization capabilities when applied to different geographical regions than the ones used for training.
In the previous experimental settings, it may happen that for a large city covered by multiple regions of interest (e.g., London, Manchester) some of its locations are used during the training phase, hence leading to a good performance when applied to test locations of the same city. To investigate the model generalization capability, we design specific training and testing datasets so that a city is never seen during the training phase. This setting allows us to discover whether we can generate flows for a city where no flows have been used to train the model, a peculiarity that we cannot fully investigate if the model partially see a city (e.g., use some of the city flows during the training phase).
Given the nine England major cities, i.e., the so-called Core Cities 76 and London, the train-ing dataset contains the locations and the information of eight cities and the test set contains information on the city excluded from the training. In particular, we select 15 regions of interest corresponding to London, eight to Leeds, seven to Sheffield, five to Birmingham, four to Bristol, Liverpool, Manchester and Newcastle, and three to Nottingham. In this way, we can test whether DG is able to generalize by analyzing its performances according to a leave-one-city-out validation mechanism, i.e., generate flows on a city whose regions of interest never appear in the training set.
We denote this implementation with Leave-one-city-out-DG (LDG).
LDG produces average CPCs that are remarkably close to the DG's ones (see Supplementary

Discussion
The comparison of the performance of Deep Gravity with models that do not use non-linearity or do not include the geographic information reveals several key results.
All models are generally more accurate on scarcely populated regions, which have fewer locations and trips' destinations are thus easier to predict. More importantly, in highly populated regions where there are many relevant locations and hence predicting the correct destinations of trips is harder, the improvement of Deep Gravity with respect to its competitors becomes much higher, suggesting that our model improves especially where current models fail. We observe that Deep Gravity still outperforms all its competitors even when using a smaller region of interest size.
The addition of the geographic features is crucial to boost the realism of our approach. In

Data Availability
The commuting data for England are freely available at https://census.ukdataservice.
The flows data in New York State are freely available at github.com/GeoDS/COVID19USFlows and are described in Kang et al. 75 .

Code Availability
The Python code of Deep Gravity is available at github.com/scikit-mobility/DeepGravity.
The version of the code used to make the experiments in this paper is available at 83 .        The higher the population the higher is the improvement of DG with respect to G. For example, in England, for the last decile of the population, DG has an improvement of around 246% over G.

Acknowledgments
Similarly, in Italy DG has a relative improvement of around 66%, while in New York State DG is around 1076% better than G in the last decile.            DG-sum considers as geographic feature of the origin (destination) just the total number of POIs regardless of the specific category they belong to. Each flow in DG-sum is hence described by 5 features only (sum of POIs, distance between origin and destination and their population).
We find that DG-sum has a lower performance than DG, regardless the decile of the population they generally contain 5,000 to 15,000 residents and 2,000 to 6,000 households. Plots with a higher concentration of points closer to the main diagonal have a higher correlation and a higher CPC.
Note that, being MSOAs aggregations of OAs, the overall CPC is higher than that for OAs. Overall, DG achieves a significant improvement in the realism of the generated flows with respect to both the gravity model and models that do not use non-linearity or geographic information.
Supplementary Note 4: leave-one-city-out. In the leave-one-city-out validation we train in turn DG on all the regions of interest of eight cities and test it on the remaining one. In other words, we test the ability of DG of generating flows for a city that it "never seen", i.e., a city for which no region of interest is used during the training phase. Supplementary Figure 8