Limits of Predictability in Commuting Flows in the Absence of Data for Calibration

The estimation of commuting flows at different spatial scales is a fundamental problem for different areas of study. Many current methods rely on parameters requiring calibration from empirical trip volumes. Their values are often not generalizable to cases without calibration data. To solve this problem we develop a statistical expression to calculate commuting trips with a quantitative functional form to estimate the model parameter when empirical trip data is not available. We calculate commuting trip volumes at scales from within a city to an entire country, introducing a scaling parameter α to the recently proposed parameter free radiation model. The model requires only widely available population and facility density distributions. The parameter can be interpreted as the influence of the region scale and the degree of heterogeneity in the facility distribution. We explore in detail the scaling limitations of this problem, namely under which conditions the proposed model can be applied without trip data for calibration. On the other hand, when empirical trip data is available, we show that the proposed model's estimation accuracy is as good as other existing models. We validated the model in different regions in the U.S., then successfully applied it in three different countries.


Census data
The LEHD Origin-Destination Employment Statistics (LODES) datasets [4] used by OnTheMap version 6 were reported using 2010 census blocks. Data files are state-based and organized into three types: Origin-Destination (OD), Residence Area Characteristics (RAC), and Workplace Area Characteristics (WAC), all at census block geographic detail. Data is available for most states for the years 2002-2010. The sources of data include:  Unemployment Insurance (UI) Wage Records reported by employers and maintained by each state.  The Office of Personnel Management (OPM) provides information on employees and jobs for most Federal employees.  The Quarterly Census for Employment and Wages (QCEW) provides information on firm structure and establishment location.
What we used in this study is the Origin-Destination (OD) data. The structure of the OD files is in Table S1. We use row one to three in this study.

Bay area cell phone data
The Bay Area cell phone data are collected by a US cell phone operator and contain about half a million customers. Each time a person uses a phone (call/text message/web browsing) the time and the cell phone tower providing the service is recorded. This altogether generates 374 million location records in the three week observational period. A Voronoi tessellation is used to estimate the service area of a cell phone tower. It provides the rough region where a cell phone user can be located by his/her phone usage. Among these half a million users, we select 189,621 most frequent users to study the commuting flows of the Bay Area [16]. For each user, the most frequently connected tower during day time (6am to 6pm) is assigned as the tower of the working location while the most frequently connected tower during night (6 pm to 6 am) is assigned as the home tower location.

Rwanda, Lisbon and Santo Domingo cell phone data
The Rwanda cell phone data are collected by a phone company and contain more than 1 million users. Each time a person calls the time and the cell phone tower providing the service is recorded. There are around 215 million records over a period of 40 days. The entire Rwanda is covered by 196 towers while the capital city Kigali is covered by 47. We select 410,309 most frequent users for this study. The cell phone data from Portugal and Dominican Republic are of similar format. Lisbon has 62,790 frequent users while Santo Domingo has 52,125 frequent users.

K-means clustering of blocks
The 2010 Census LEHD Origin-Destination Employment Statistics (LODES) datasets contain home and work location counts at block level. San Francisco has 7,348 blocks while in transportation planning a city is often divided into a much less number of regions [12,11]. To make the estimation results at different scales comparable, here we adopted k-means clustering [7,3] to divide the study region into 100 locations. The blocks are clustered according to their geographical locations. The procedure is performed in the following way: Randomly pick 100 ( , ) coordinate pairs in the study region to represent the centers of the clusters. They are denoted as , = 1, … ,100. Each block's center location is denoted as a vector , = 1, … , . The goal is to find an assignment of to clusters, as well as a set of vectors { }, such that the sum of the squares of the distances of each data point to its closest vector , is a minimum. Use 1-of-K coding scheme to represent which cluster each data point should belong to. For each data point , we introduce a corresponding set of binary indicator variables ∈ {0,1}, = 1, … , 100, describing which of the 100 clusters the data point is assigned to. If data point is assigned to cluster k then = 1, and =0 ≠ . The objective function, , is to minimize the sum of the squares of the distances of each data point to its assigned vector : Here the distance measure ‖ − ‖ 2 is the distance of the two coordinate pairs on earth. To find the values for and , iteratively perform: 1. Keep fixed, find the values to minimize . This is simply to find the closest to each data point .

Keep
fixed, find the values to minimize . is a quadratic function of . Take the derivative and with respect to and set it to zero shows that = ∑ ∑ Iteratively perform these two steps until converge. Fig. S1 (a,b) shows the comparison of San Francisco's blocks before and after clustering.

IPF procedure for OD expansion
We use the Bay area as an example to show that cell phone data could provide a good commuting OD seed matrix. In part 1 we have deduced home and work locations for each user.
Here a location is a cell phone tower. There are 892 towers in the Bay area while in previous methods we divided San Francisco into 100 locations. In order to match these two different types of divisions we mapped the 892 cell phone towers to the previously defined 100 block clusters to form the 100 × 100 commuting OD matrix for the cell phone users. We should notice that the cell phone users we chose are like a sample from the whole population and the sampling rates in different block clusters may differ. In order to get the commuting OD matrix for the whole population from the cell phone user commuting OD matrix, we need to reweight or perform seed matrix expansion on the cell phone user commuting OD matrix. The iterative proportional fitting method is adopted [5].
Iterative proportional fitting is a procedure for adjusting a table of data cells such that they add up to selected totals. Unadjusted data cells may be referred to as "seed", and the selected totals may be referred to as "marginal". In our two dimensional case, the "seed" is the cell phone user commuting OD matrix denoted as , is the home location while is the work location. We've shown that population and POI are good representations of trip generation and attraction. We use them to represent the "marginals". The column marginal is the trip attraction of each location and the row marginal is the trip generation of each location. are normalized to have the same sum as . The numerical solution is: Some may doubt that the close fit of the expanded Bay Area cell phone user seed matrix to the actual census data is because we used quite accurate marginal (in this case the population density and the density of POIs), so that the seed matrix do not have much influence. We test this assumption by doing the following comparison: compare the travelling distance ( ) distribution of: 1) the census commuting OD data; 2) the cell phone user seed OD matrix without IPF expansion; 3) the IPF expanded cell phone user seed matrix; 4) the IPF expanded random seed matrix. The result is shown in Fig. S2. Among all others, only the IPF expended cell phone user seed matrix gives close fit to the census data. As for the IPF expanded random seed matrix, even though it has accurate marginal, it still deviates from the actual ( ) distribution. In this way the value of both the IPF method and the cell phone user seed matrix are shown.

Comparison of the unconstrained with the doubly constrained gravity model
In this section we compare the estimation results of the unconstrained gravity model and the doubly constrained gravity model on cell phone user commuting OD at city level. Here we use the cell phone records because the data is available at different countries so that we can perform a cross culture comparison. We choose 9 cities from the Bay area, Rwanda, Portugal and Dominican Republic: San Francisco, Oakland, San Jose, San Rafael, Lisbon, Kigali, La Romata, Santo Domingo, and Santiago. For each cell phone user we can estimate his/ her home and work location. Aggregating such results gives us the cell phone users' commuting OD matrix. Use the margins (cell phone user commuting trip production and attraction number for each tower) as inputs for the following models.
The unconstrained gravity model takes the form: is the flow between location and . Each location is a tower. is the number of cell phone users whose home location is tower , is the number of cell phone users whose working location is tower .
is the distance between them and is the distance decay function. and are parameters to be fitted from data. We adopt the power distance decay function: The model turns into:

=
The parameters , , and could be estimated using least square linear regression [6] after a simple transformation: The inputs of the regression model are , , and , the outputs are estimation results of , , and .
The , , and regression results for the 9 cities are in Table S2.
In some other studies [15,1] a similar regression method is applied. The difference is that trips are divided into short and long trips and the parameters are estimated separately. In [14]  The doubly constrained gravity model takes the form: = and are total trip production and attraction volumes at location and . For a study region with locations, there are 2 parameters of and , and one parameter of . Unlike the unconstrained gravity model, though it has 2 + 1 parameters, only one parameter needs to be predetermined. and can be estimated even without knowing by iterating: Let's use a very simple example to illustrate the algorithm.
Suppose there is an area with 4 zones. Their distance matrix and , are in Table S3.
Initially and are all set to 1 and are updated as:  Table S4.
Here we compare the results from: the unconstrained gravity model with parameters estimated in this study, the unconstrained gravity model with parameters estimated in previous study [14], the doubly constrained gravity model with parameters estimated in this study. For each model we compare the model estimation results with the cell phone user commuting OD matrix and calculate the correlation between them. Fig. S3 shows how the correlation changes from city to city and from model to model. In all cities the doubly constrained gravity model outperforms the unconstrained gravity model. It has correlation more than 0.8 in all the cities except in Kigali, the capital city of Rwanda. We've mentioned that the commuting flow in Rwanda is special because it's more agglomerated: a few OD pairs have very large flows and these OD pairs are not necessarily close to each other. This makes it hard for gravity model prediction. The comparison of the doubly constrained gravity model and the no gravity model with parameters estimated in this study are in Fig. S4-S5. Again the doubly constrained gravity model prevails at each measurement.

5) A statistical measurement of the commuting distance
As is proposed in some previous studies [2,8,9,10], there may exist some simple scaling for a given region of total area and population . The length scale of a region is represented by √ . The expected scaling of the total distance travelled by all the population should be of the form:

√~
In one limiting cases, if every individual is going to the nearest neighbor (with a typical distance 1 √ while = / is the average density of the city), = 1/2. In another case, if everyone goes randomly, = 1. The empirical cases show that the value is usually around 0.6.
We did the same measurement for the 1000 different regions in the US and the cell phone users in Rwanda, Santo Domingo, and Lisbon. The results are shown in Fig. S5. The corresponding value is 0.75. Since we are only counting the commuting distance, the larger value shows that people are willing to travel longer for working than doing other activities.

6) Correlation between supply and demand at different scales
We've used San Francisco, the Bay area, and the west coast of US to show that at large scales population density can represent both commuting trip generation and attraction while at small scales such as within a city the attraction is better represented by distribution of job opportunities, in this case the POI density. Then there remains the question: to which scale can population density represent both trip generation and attraction?
To generalize and quantify this result, we change the scale gradually and sample multiple regions at each scale to observe the change in the correlation of: populationcommuting generation, populationcommuting attraction, POIcommuting generation, POIcommuting attraction. The result is in Fig. S7. The figure shows clearly that at large scales the four distributions are close to each other, while at smaller scales commuting trip generation is better represented by population density while commuting trip attraction is better represented by POI density.

7) The relation between the number of opportunities and distance
The main difference between the radiation model (also the intervening opportunity model) and the gravity model is the latter measures distance directly while the former represent 'distance' as the number of opportunities in between. Of course we'd expect the number of opportunities and the distance should generally have a positive correlation. But how high this correlation is? Fig.  S8 shows the scatter plot of number of opportunities versus distance for each OD pair observed in the census in San Francisco, the Bay area, and the west coast. Although the positive correlation is clear, but given a certain distance the range of number of opportunities is large which is caused by the heterogeneity of the density of opportunities.
To calculate the fractal dimension of the POI distribution, we regress on by applying = . is the density of POIs and is fractal dimension. The for the three regions are respectively 1.69, 1.17, 1.60.

8) The form of the distribution
In the derivation of the extended radiation model, the chosen ( ) distribution is the exponential distribution, while it is very reasonable to expect people could have either a scale-free or a well defined scale λ value distribution, so we did further explorations on this. In practice, what we cannot observe is each person's actual value while what we can observe is the number of opportunities each person has considered before choosing the job destination, which is approximated by the number of point of interests. So the shape of the > ( ) distribution can inform us which distribution to use.
If the distribution is in a power law form, such as a Pareto distribution: Then the probability of not accepting the closest opportunities is: The Γ function is the upper incomplete gamma function, defined as: For example, set = 3 and = 1, the > ( ) plot then becomes the curve shown in Fig. S9.
Comparing with Fig. 3(a) in the main text shows that the shape of this > ( ) distribution is different from the observed > ( ) distribution in the Bay area, West U.S., Portland-Seattle region, and L.A.-Las Vegas region. So we'd expect in regions larger than a city such a Pareto distribution is not suitable. But Fig. S9 shows a similar shape to the observed > ( ) distribution in San Francisco. As we can observe from Fig. 3(b) in the paper, at the city scale such as San Francisco, it is harder to find a close fit to the empirical data. At this scale the vs. relationship shown in Fig. 3(b) is a raw approximation. This might partly be because in such a compact and dense city like San Francisco, an exponential distribution is not most suitable, and this leads to an open question that what forms of distribution should be used in dense and compact regions or how to better model such cases.
In general, people's value distribution should not be changed by how we choose the region scale and zone size. But since in transportation and urban planning, practitioners divide a region into zones; trips within zones or outside the region boundary are not considered. Thus this filters out part of the trips, which causes the distribution to change when the region scale and zone granularity change.
On the other hand, people's distribution may have a well-defined scale. In this case if we assume that people's values are the same, then: Again, it differs from the observations in regions larger than a city, but might be suitable for some city scale regions; this is the origin of the introduction of the α parameter to account for the differences of the zones that constitute trip origins and destinations.
To sum up, at city scale we could explore different forms of distribution to find out how different characteristics of a city (scale, population density, etc) influence the best distribution; or to develop data-driven models to justify the form of the resulting distribution within cities.

9) Further comparison between the extended radiation model and reference models
As is shown by Equation (16) in the main text of the paper, in the original radiation model, the flow between two regions and is proportional to 2 , is the number of opportunities in region while is the number of opportunities between and . The extended radiation model is more flexible in that the flow is proportional to , ∈ (1, +∞). This is because: In the main text equation (16) is: The derivation process is: The Taylor expansion of (1 + ) is: This holds when ∈ (−1,1).
α → 0 occurs in dense and compact regions such as San Francisco, in such regions opportunities are more homogeneously distributed, so n j is at least order smaller than , the first term of the expansion will dominate the value because the expansion term decays as ( ) , so: Therefore in equation (16): Take this back to equation (15): The α value is cancelled out when applying the result of equation (15) to equation (11). In equation (11) the α in the denominator and the numerator will cancel out.
When the parameter → 0, → 1. So the minimum value of is 1, while in some cases (like small and dense cities) the required value is smaller than 1. This is the reason that we conclude under such scales it is hard to capture the accurate flows with general expressions for the distribution (as opposed to fitting actual trips). More detailed characteristics of the region such as landuse and road networks, as well as a more detailed model, may need to be considered at intra city scales (regions within the daily scale).
The limitations of previous models are further illustrated in Fig. S10. Each black square represents a study region while each blue circle means a populated zone with the same number of population and opportunities. The rest of the region is assumed to be un-populated.
The limitation of the original radiation model is shown by comparing Fig. S10 (a) with Fig. S10 (b). Since the model is not taking distance into account, it will give the same flow estimation from zone 1 to zone 3 in both sub-figures. In reality because the distance between zone 1 and zone 3 is longer in sub-figure (b), we would expect less people commuting between zone 1 and zone 3.
The limitation of the no constraint gravity model is shown in Fig. S10 (c). Because the distance between zone 1 and zone 3 is the same as the distance between zone 3 and zone 5, the model will give the same flow estimation from zone 1 to zone 3 and from zone 5 to zone 3. But since there is a highly populated region zone 7 between zone 1 and zone 3, we would expect some people originating from zone 1 being attracted to zone 7, so that less people will travel from zone 1 to zone 3. The gravity model cannot handle situations like this.
The extended radiation aims to solve both limitations indicated above. Because the number of opportunities is taken into account directly, it solves the limitation shown in Fig. S10(c). Also the scaling parameter partly solves the limitation shown in Fig. S10 (a) and (b). In (b) the value will be larger than the value in (a), causing more people originating from zone 1 to travel to zone 2 and less people to travel to zone 3.
The above analysis shows that both the borders/interface effect and the distribution of population/opportunities have influences on the applicability of the models. The extended radiation model partly solves the limitations, but there are still some open questions such as: 1. How to better quantitatively measure the homogeneity of the population/opportunity distribution and incorporate a distance function.
2. How to quantitatively measure the influence of the shape of the region.    S2. Comparison of the travelling distance ( ) distributions. The census commuting OD data is in black. The cell phone user seed OD matrix without IPF expansion is in green. The IPF expanded cell phone user seed matrix is in purple. The IPF expanded random seed matrix is in red. Only the IPF expended cell phone user seed matrix gives close fit to the census data. As for the IPF expanded random seed matrix, even though it has accurate marginal, it still deviates from the actual ( ) distribution.