Understanding congested travel in urban areas

Rapid urbanization and increasing demand for transportation burdens urban road infrastructures. The interplay of number of vehicles and available road capacity on their routes determines the level of congestion. Although approaches to modify demand and capacity exist, the possible limits of congestion alleviation by only modifying route choices have not been systematically studied. Here we couple the road networks of five diverse cities with the travel demand profiles in the morning peak hour obtained from billions of mobile phone traces to comprehensively analyse urban traffic. We present that a dimensionless ratio of the road supply to the travel demand explains the percentage of time lost in congestion. Finally, we examine congestion relief under a centralized routing scheme with varying levels of awareness of social good and quantify the benefits to show that moderate levels are enough to achieve significant collective travel time savings.


Supplementary
: A typical depiction of rows of CDR data in Boston. User 12345678 makes a call from location A (Davis square), then goes on to make two calls from location B (Boston City Hall), then makes one call location C (MIT) at noon and another later, and makes one final call again from the location A at 8pm.     with the carriers that provide the data also vary. Supplementary Table 1 compiles descriptive statistics for these data sources for each city we explore in this paper.
Each individual call detail record consists of a hash string identifying the mobile phone user, a timestamp marking the time of the activity, and the described spatial information regarding the activity. Supplementary CDR data inherently contains noise, as expected in any similar dataset. One reason for noise is the set of algorithms mobile phone carriers use for tower-to-tower call balancing to improve service. This operation creates discontinuities in the data that do not represent actual movement. To remove this noise and correct for other similar discrepancies, we apply a procedure generally used for GPS traces, referred to as a stay-point algorithm.
Jiang et al. provide a thorough review of these techniques in [3] and we adapt the stay point algorithm originally described by Zheng et al. in [4]. In summary, stay-point algorithm simplifies a sequence of calls within a specified spatiotemporal area. In other words, calls within a certain radius and timeframe are bundled together. The pass-by points are removed, and stays remain. This mapping is made such that the representative point is the medoid of all such calls. For all cities here, except Boston where the data is triangulated, this algorithm is applied in a modified way. A tower-based CDR dataset only roughly describes the region from which the call was made, that is, the estimate of a user's position is only known up to the Voronoi cell for that tower. For this reason, the simplification of the series of calls is applied by serializing the calls made from towers within a certain distance.
For the temporal dimension, these calls are labeled as stays only if the user is known to be in that location for at least 10 minutes.
One key point worth noting is that CDRs are of passive nature: except for a very tiny portion of the data, a mobile phone user's location information is only visible in the data when he/she interacts with his/her phone.
Therefore it is certainly possible for a user to be in the location the data point classified as a pass-by, or alternatively be visiting other locations that cannot be distinguished due to lack of phone interaction. This issue and other similar shortcomings resulting from the nature of the data are discussed in detail in previous work [5,6,7].

Supplementary Note 2
At the census tract (or equivalent) scale, we obtain the population and the vehicle usage rate of residents in that area. For US cities, the American Community Survey provides this data on the level of census tracts (each containing roughly 5000 people). Census data is obtained for Brazil through IBGE (Instituto Brasileiro de where P drive alone (i) and P carpool are probabilities that residents in zone i drive alone or share a car, respectively. S = 2.5 is estimated to be the average carpool size [8].
Conversely, Boston and Bay have the highest vehicle usage rates whereas in Rio de Janeiro people are less car-oriented. To assess how similar our five cities are in terms of CDR data sampling we compare their expansion factors, defined as the ratio of the number of people living in a tract to the number of people assigned that tract as a home location. All cities have a mean below 100, although outliers exist.

Supplementary Note 3
Origin-destination (OD) information is traditionally modeled with data obtained from travel surveys, land use information and census data. First, estimates of trip production and attraction for zones are produced. These trips are then distributed among possible destinations across the city using calibrated gravity or radiation or similar models. Information from the survey are combined with mode choice models to split trips among travel alternatives. CDRs do not provide as detailed demographic and contextual information about travel patterns and behavior as household travel surveys do. Mobile phones offer good, but imperfect measurements of geographic position due to the uncertainty of the location estimates and the nonuniform sampling frequency. However millions of high resolution data points over a far longer observation period make CDRs a high potential data source. Methods developed to incorporate CDRs therefore aim to find a balance between a small and complete dataset that is household travel surveys, and a large but incomplete dataset, namely CDRs.
In incorporating CDRs into such methods, Alexander et al. and Colak et al. [6,7], outline a general framework.
Location frequencies are found to estimate each location's function for a user, and classify it as home, work or other. Consequently the trips between these locations are assigned a trip purpose: home-based-work (commuting, home-based-other or non-home-based are inferred. Morning peak commuting and total trips are estimated from filtered users by analyzing consecutive observations at different stay points during the morning peak period (6am-10am). These trips are then normalized to accurately represent actual daily number of trips by measuring how often a user uses their phone, their average number of trips, and the number of days that they were observed.
Finally, the number of trips are expanded by the ratio of the population of the source tract to the number cell phone users in that tract. To consider trips made only by vehicles, we weigh obtained person trips by vehicle usage rates in the home census tract of users. To estimate the peak hour traffic volume, the morning period of to 6am-10am was weighted in accordance to trip departure time distributions obtained in [7]. Peak hour demand occurs between 7:30am and 8:30am, and the average morning hour demand is multiplied by 1.5 to reflect the peak as per the departure time distributions. Another issue relating to the accuracy of findings is the choice of the administrative boundaries, that is, due to the spatial precision of the data, certain aggregation levels work better than others. This problem is analyzed in detail in previous work, where pseudocode to generate OD matrices and the comparisons to the outputs of traditional models can also be found [6,7,5].

Supplementary Note 4
While road networks supplied by local municipalities in the form of shapefiles can often be useful, we have implemented a parser to construct routable road networks from OpenStreetMap (OSM) data due to its global availability. Nodes in OSM data represent points representing points of interest or tags or an intersection, and ways contain references to nodes that are grouped. They may also contain attributes such as number of lanes or speed limit, although many roads have this information missing. What all roads have in common though is the road classification, varying between motorway, trunk, primary, secondary, tertiary, residential and trunk roads, as well as a some other irrelevant categories. For our purposes, we filter out roads with irrelevant categories, and residential roads as they are not central to the congestion problem, yet tend to increase computation time significantly. For easing computation, we also simplify the network by collapsing roads with only one incoming and one outgoing road, if they're in the same road classification. To infer the missing data, we map assign every road a speed limit, number of lanes and a corresponding capacity based on its category and information in [9].
Motorways are generally major highways and have a speed-limit of 60 mph with 3 lanes in a direction, whereas primary roads are 40mph with 2 lanes. We assume the free travel time on a segment i is t f,i = 1.3 * L i /v i , with L i the road segment length and v i the speed limit. To estimate the capacity of a road segment, we utilize the following relationship [9] using the speed (kms/hr) and the number of lanes: capacity, vehicles per hour = (1500 veh/hr + 30 · speed) * # of lanes, if 40 mph ≤ speed < 60 km/hr, (1700 veh/hr + 10 · speed) * # of lanes, if speed ≥ 60 mph.
More information about the road networks can be found in Supplementary Table 1.
Road network modeling is a lot more complex then the simple extraction of the topology. Realistic estimation of road capacities, lengths and travel times is essential. We demonstrate our findings in Supplementary Figure 3.
The road length and free travel times seem to follow a power-law, free travel times can range from ten seconds to as much as 20 minutes, and similarly for road lengths. Capacities are a direct result of road classes in OSM data: highways, trunks, primary, secondary and tertiary roads are all modeled to have different capacities and number of lanes. To assess overall supply more accurately, we also look at the product of the capacity and the length of the road networks. Our findings suggest that Bay Area, also in accordance with its size, has comparably larger supply.

Supplementary Note 5
A long-standing problem in highway engineering has been the characterization of the relationship between number of vehicles on a road segment, i.e. its volume, with the observed travel time on that road segment. Throughout the years a number of different characterizations have been developed ranging from conical volume-delay functions to more complex approaches [10,11,12]. One of the most simple and common metrics used in determining the travel time associated with a specific flow level is the ratio volume of vehicles on the road and its maximum flow capacity, also referred to as volume-over-capacity or V oC. At low V oCs, drivers enjoy large spaces between cars and can safely travel at free-flow speeds. As roads become congested and V oC increases, drivers are forced to slow down. Based on the guidelines set by the Bureau of Public Roads [13], the V oC of each road segment is used to estimate the travel time according to Eq. 1: where t f refers to the travel time under free flow conditions. α = 0.6 and β = 4 are calibration parameters.
The relationship is depicted in Supplementary Figure 4 As a second calibration step, once the path-level travel times are obtained, we adjust the travel times by where k bos = k rio = k bay = −0.1, k lis = 0 and k por = 0.1.

Supplementary Note 6
Traffic assignment is a very mature domain that has been studied extensively by urban and transportation planners. Static non-equilibrium models approaches consist of treating all users as homogeneous agents who make route choices prior to departure based on some heuristic related to current traffic conditions (e.g. the path that minimizes travel time). Incremental Traffic Assignment (ITA) is a variant of these static non-equilibrium assignment models that assigns batches of trips serially and updates costs between increments, as an improvement over the simplest all-or-nothing assignment methods. However, these methods results in solutions far from the Wardrop principles [14], where in the resulting system no driver should have an incentive to deviate from their route choice. Many methods to compute the equilibrium have been proposed in the literature [15], the easiest being from Frank-Wolfe (FW) solutions. FW based algorithms are quick to implement but slow to converge to the optimal solution. However they provide no information about which OD-pairs provide what amount of flow to which road segments. Path based algorithms take a step towards path enumeration, but in large networks with a high number of origin-destination pairs and alternative paths, the memory and computational requirement grow very quickly [16,1,17]. The more efficient approach is through the use of origin based algorithms, which are computationally feasible, have a fast convergence rate and do store path flows [18,19]. More complex assignment models aim to take into account the variability in travel times by adding stochasticity to link travel times [20].
The process with which people choose routes is also of great interest to researchers, under the umbrella of route choice models. Prato (2009) presents a good overview of the wide literature on this subject, ranging from logit models to path set generation algorithms [21]. For the scope and the aggregate nature of our work, we opt to implement a static assignment model.
In this work, we will follow Algorithm B, proposed in [1] along with modifications and improvements outlined in [2], an origin based algorithm that focuses on the equilibriation of a graph structure referred to as a bush, a directed acyclic graph (DAG) emanating from every origin node introduced to the graph as the centroid of the origin tract. These structures are used with the reasonably assumption that in the equilibrium flows, no directed cycles should exist as no driver has an incentive to increase his/her travel time. The computational efficiency of this algorithm stems from the fact that DAGs can be traversed in linear time. The algorithm used in this work is outlined in Supplementary Figure 5.
In these algorithms, the objective is to minimize the the distance between the current solution and the optimal solution. In this work, relative gap is used as the measure of convergence.
where t od and d od represent the demand and the travel time between an origin and a destination, and t e and v e represent the travel time and the volume on a road segment e. The numerator and the denominator essentially measure the same thing: the total travel time in the system. Theoretically, r g is supposed to be equal to zero.
This ensures that all drivers in the system are in fact taking the shortest possible routes, and the optimization problem is fully solved. Traffic assignment algorithms aim to bring r g as close to zero as possible.
A critical design element of the implementation of origin based algorithms is the modeling of tract centroids, representing an aggregation of all the actual origins and destinations within the area, and the connectors, the hypothetical segments representing driver movement within the tract before joining the modeled road network [22]. Supplementary Figure 6 depicts the implementation of connectors in this work, where tract centroids are connected to the four nearest intersections.