Pervasive infrastructures, such as cell phone networks, enable to capture large amounts of human behavioral data but also provide information about the structure of cities and their dynamical properties. In this article, we focus on these last aspects by studying phone data recorded during 55 days in 31 Spanish cities. We first define an urban dilatation index which measures how the average distance between individuals evolves during the day, allowing us to highlight different types of city structure. We then focus on hotspots, the most crowded places in the city. We propose a parameter free method to detect them and to test the robustness of our results. The number of these hotspots scales sublinearly with the population size, a result in agreement with previous theoretical arguments and measures on employment datasets. We study the lifetime of these hotspots and show in particular that the hierarchy of permanent ones, which constitute the ‘heart’ of the city, is very stable whatever the size of the city. The spatial structure of these hotspots is also of interest and allows us to distinguish different categories of cities, from monocentric and “segregated” where the spatial distribution is very dependent on land use, to polycentric where the spatial mixing between land uses is much more important. These results point towards the possibility of a new, quantitative classification of cities using high resolution spatio-temporal data.
Pervasive, geolocalized data generated by individuals have recently triggered a renewed interest for the study of cities and urban dynamics, and in particular individual mobility patterns1. Various data sources have been used such as car GPS2, RFIDs for collective transportation3, and also data from social networking applications such as Twitter4 or Foursquare5. A recent, very important source of data is given by individual mobile phone data6,7. These data have allowed to study the individual mobility patterns with a high spatial and temporal resolution8,9,10, the automatic detection of urban land uses11, or the detection of communities based on human interactions12.
Morphological aspects, such as the quantitative characterization and comparison of cities through their density landscape, their space consumption, their degree of polycentrism, or the clustering degree of their activity centers, have meanwhile been studied for a long time in quantitative geography and spatial economy13,14,15,16,17,18,19,20,21. Until the late 2000, these quantitative comparisons of urban forms were based on census data, transport surveys or remote sensing data, all giving an estimation of the density of individuals and land uses in the city at a fine spatial granularity but at a much more coarse grain when considering the temporal evolution. We note here that early studies in quantitative urban geography22,23 estimated the density of individuals at various hours of the day in city centers using transport surveys and handmade cord counts and could follow the morphological and socio-economic evolution of cities during a typical weekday. Additionaly many traffic surveys in cities worldwide have long provided a general knowledge of the timing of urban mobility. However, given their temporal resolution and the lack of adequate data, these studies could not investigate some interesting questions related to some dynamical properties of the spatial structure of cities: how much does the city shape change through the course of the day? Where are the city's hotspots located at different hours of the day? How are these hotspots spatially organized? Is the hierarchy and the spatial organization of hotspots robust through time? Is there some kind of typical distance(s) characterizing the permanent core, or ‘backbone’, of each city? Mobile phone data contain the spatial information about individuals and how it evolves during the day. These datasets thus give us the opportunity to answer such questions and to characterize quantitatively the spatial structure of cities24. In this article, we address some of these questions using mobile phone data for a set of 31 Spanish cities shown on Figure 1. We focus on the spatio-temporal properties of cities and, defining new metrics, study their structural properties and exhibit interesting patterns of urban systems.
Our analysis is based on aggregated and anonymized mobile phone data and concerns 31 Spanish urban areas studied during weekdays. These urban areas are very diverse in terms of geographical location, area, population size and density, as illustrated in Figure 2. In particular, the wide range of population sizes will allow us to test some scaling relations and also to identify various behaviors. We will first describe the dataset and then present the results obtained about several aspects of cities.
Our analysis is based on a mobile phone dataset provided by a Spanish telecommunications operator. The aggregated records represent the number of unique individuals using a given antenna for each hour of the day. No individual information or records were available for this study. These data provide some snapshots of the spatial distribution of people in the city at successive points in time. We have this information for the 31 Spanish urban areas of more than 200,000 inhabitants, and for 55 days. The number of users (per hour) represents in average 2% of the total population and at most 5% of the total population. These percentages are almost the same for all the urban areas. Given the irregularity of the spatial distribution of the antennas in each city and from one city to another, we spatially aggregated the number/densities of users recorded each hour in each mobile phone antenna on a regular square grid of varying cell size a, in order to simplify comparisons of indicators between cities, as shown on Figure 3. The choice of the spatial scale of data aggregation is known to be an important source of bias in spatial analysis25, hence we tested the robustness of our results on three different sizes of grid cells (see section Methods for details).
In order to get a preliminary grasp of the data we plot the time evolution of the number of users along the day and see if it follows the same pattern in every city. Figure 4 shows the average number of mobile phone users per hour according to the day of the week for six of them. Globally, the number of phone users is significantly higher during the weekdays than during the weekends, except at night time. From 11pm to 8am, the number of users is relatively low, it reaches a minimum at 5am during weekdays and at 7am during the weekend. For all cities we observe two activity peaks, one at 12am during weekdays (1pm during the weekend) and another one at 6pm during weekdays (and at 8pm during the weekend).
In order to compare these values obtained for different cities, we rescale the values by the total number of users for an average weekday. We show the results in Figure 5. The rescaled plot suggests the existence of a single ‘urban rhythm’ common to all cities. The data collapse is very good in the morning, while in the afternoon we observe a little more variability from one city to another. It is interesting to note that in four cities located in the western part of Andalusia (Sevilla, Granada, Cordoba and Jerez de la Frontera) the activity restarts later in the afternoon, around 5pm one hour later than in the other cities.
Global weighted indicators versus hotspots analysis
Essentially, the mobile phone data give access to the local density ρ(i, t) of users at a location i and at a time t. The difficulty is then to study this complex object which displays variation in time and space. We will consider here two main directions to tackle this problem. The first one is to define global indicators that consider all points and weight them by the user density. The second approach consists in identifying local maxima of the function ρ(i, t), or in other words, the hotspots. There are pros and cons in each method. Looking at hotspots is convenient since it provides a clear picture of the important locations in the city, but contains some arbitrariness in their determination. On the other hand, working with weighted indices does not require to identify hotspots but at the cost of producing results more difficult to interpret. These two approaches can however be seen as complementary since they highlight different properties of the city: weighted indices inform us about the global properties of a given city, while the hotspots give us a more local look and allow us to concentrate on the ‘heart’ of the city. This is why in the following we will successively apply the two methods.
Urban dilatation index
The average weighted distance DV (t) between individuals in the city (see section Methods for the precise definition) and its evolution during the course of an average weekday provides a first interesting indicator about the organization of the city. Figure 6 (a) shows the evolution of this normalized average, weighted distance during a typical weekday. We can essentially distinguish two broad categories according to the spatial organization of residences and activities:
In the case of the simple picture of a typical monocentric city with predominant Central Business District (CBD), the city collapses in the morning when people living in the suburbs commute to their workplaces, and expands in the evening when they get back home. We then expect in this case a large variation (at the city scale) of the average distance DV. In this case, activity and residential places are spatially “segregated”.
For more polycentric cities, where residential and work places are spatially less separated, we expect a smaller variation of DV than the one observed for monocentric cities. Here activity places and residential areas are more “mixed”.
For all cities we observe the same typical pattern: we see two peaks, one around 7 am, when people switch on their mobile phones, probably at home or when they are in transportation system's entry points (see Figure 6(a)). We then see a decrease of the distance (the city ‘collapses’), displaying spatial concentration of individuals during the middle of the day, mainly corresponding to the activity period for most individuals (workers/students). During the afternoon we see a second, smaller peak dispersed over 4–5pm, when people start going back home. This afternoon peak is less pronounced, suggesting a higher variety of mobility behaviors at the end of the day. The interesting feature of theses curves is the variation amplitude that informs us about the importance of this collapse phenomenon. Despite the fact that several factors such as phone use or behavioral factors affect these variations, we observe a common pattern: a pronounced peak at the beginning of the day and a minimum usually observed at the middle of the day. From this curve it is then natural to calculate for each city a ‘dilatation coefficient’ defined as We show in Figure 6(b) the rank plot of this dilatation index obtained for the 31 cities where we can distinguish roughly three groups of cities. For the first group with a value of µ around one, the average distance stays approximately constant throughout the day. This means that whatever the hour of the day, the spatial spread of the high density locations does not change significantly. High density locations correspond to different activities depending on the moment of the day, and a small value of the dilatation coefficient implies that daytime activity places (work places, schools, leisure places) are not more spatially concentrated than residences. Homes and activity places are more entangled, supporting the picture of more ‘mixed’ cities, such as Madrid for example. In the opposite case of large values, the spatial organization of the different high-density locations changes significantly along the day. A typical example would be a monocentric city where individuals are localized in the CBD during the day and where residences are spread all around the center. In our set, Zaragoza for example is representative of this type of cities. For the intermediate group the cities display a less marked behavior, probably resulting of a combination of monocentric and polycentric features.
Identifying the hotspots
This problem corresponds to identify local maxima in the surface of density of users. A simple method amounts to choose a threshold δ and to consider that every point i with a density larger than this threshold ρ(i, t) > δ is a hotspot at time t. Most of the methods so far rely on this simple argument but there is obviously some arbitrariness in the choice of δ. In contrast here (all technical details can be found in the Methods section), we discuss two extreme choices for the threshold value. The lower threshold δmin corresponds to the average value of the density, which is indeed a reasonable, minimal requirement to be a local maxima. Based on considerations about the Lorenz curve of the density, we are also able to determine another value δmax which can be considered as the maximal, reasonable value for δ. In the following we will distinguish the ‘Average’ method from the ‘Loubar’ method which correspond to the two values δmin and δmax, respectively. The most important point here, is that once the lower and upper bounds for the threshold are determined and allow for the identification of hotspots, all the results obtained should be robust with respect to the choice of δ. In other words, if a given result is qualitatively the same when considering the lower and upper bounds for δ, the result can safely be considered as an intrinsic feature of the system.
Number of hotspots
We first focus on the number of hotspots. Using both methods, ‘Average’ and ‘Loubar’, for each city we count the number of hotspots at each hour of the day, compute the average over the day and see how this average number scales with the population size of the city. This measure is motivated by recent theoretical and empirical work29 that has highlighted a clear sub-linear relation between the population size of cities and their number of activity centers (defined as employment hotspots). For the U.S., it has been shown that the number of activity centers Na (determined from employment data) scales as with β ~ 0.64. Figure 7 displays the number H of hotspots versus the population for the set of the 31 biggest Spanish cities considered here. The power law fit confirms the result obtained in29 that there is a sublinear relation and, remarkably enough, that the value of the exponent is of the same order. We note here that this result is robust against the thresholding criteria used to define hotspots (see also section Methods for aggregation grids with different cell sizes). We also note here that recent empirical work30 has highlighted the sensitivity of the values of scaling laws exponents to the choice of city boundaries. This result underlines the crucial role of city definition when attempting to identify patterns of behavior across cities, and the need for consistency in defining the spatial boundaries of cities for such comparisons26. That is the reason that has led us to rely on the spatial delimitations of urban areas, which are harmonized delimitations based on the ratio of home-work commuting journeys (see Methods for details).
Stability of the hotspots hierarchy
Another interesting feature to inspect in cities is the stability of their hotspots and the evolution of their relative importance in the city according to the hour of the day, which is related to the evolution of the hierarchy of places in the city. In order to capture the behavior of cities about these aspects, we plot various indicators. We start with the histogram of the persistence of hotspots: for each city we count the number of one-hour time bins during which each cell has been a hotspot. We then plot the distribution of the hotspots ‘lifetime’ (measured in number of one-hour bins), as shown in Figure 8 for the eight largest Spanish cities. Figure 8 highlights the importance of ‘permanent’ hotspots, i.e. locations which are hotspots during the whole day. Each city has its number of important locations, those that form the ‘heart’ of the city. In addition to the permanent hotspots we also observe two other main groups: a set of intermediate hotspots (with lifetime of the order half a day) and ‘intermittent’ hotspots that are present only a few hours per day. We note that these groups are robust with respect to the hotspot definition, that is when defined with the ‘Average’ criterion (top line of each histogram) and with the ‘LouBar’ criterion (bottom line).
The permanent hotspots are the most important locations in the city in terms of individuals density. An interesting question is whether their rank (according to the density) is constant or changes during the day. In order to test the stability in time of the hierarchy of permanent hotspots, we compute the Kendall tau value τ (t) of the set of permanent hotspots (see the Methods section for definition and for the plots). Our results show that the heart of the cities is indeed very stable both in space and in time, whatever their size.
Spatial structure of the hotspots
Another important question about hotspots concerns their spatial organization. We start with the specific group formed by the permanent hotspots, as defined by our more restrictive criteria ‘LouBar’ (see Methods section). We compute how distant they are from each other, compared to the typical size of the city given by , where A is the city's area. We show in Figure 9 the rank-plot of our ‘compacity coefficient’ defined as where 〈Dper(i)〉 is the average distance between permanent, weekday hotspots in city i, and Ai is the area of the city i. This indicator informs us how the permanent hotspots are sprawled all over the city's space, and it is thus a measure of the compacity of the city core: for cities with values around 0, the permanent hotspots are very close to one another, when compared to the spatial extension of the urban area. On the contrary, a value close to one indicates that these always-crowded places are spread all over the whole city space (see figure 9). It is interesting to note in Figure 9(b) that the compacity of a city seems to increase with the population size. At least for a large subset of cities, we indeed observe this trend, which is consistent with the idea that the larger the city, the more spread are the hotspots (and the more polycentric it tends to be).
For each city, once we have determined the hotspots and have classified them into permanent, intermediary and intermittent, we measure the average distance between hotspots within each group. For example we can look at 〈Dper hotspots〉/〈Dint hotspots〉, the ratio between the typical distance separating intermittent hotspots and the typical distance separating permanent hotspots. Since the intermittent hotspots are those with a lifespan of six hours at most, they are more inclined to capture the residential locations, while the permanent hotspots represent the dominant places of the city, that is, zones that are very dense both during daytime and nightime. On Figure 10 (a) we plot the histogram of this ratio for all cities, for the two hotspots delimitation criteria (see section Methods for these plots with different sizes of the aggregation grid). We can see in this plot that the distribution is centered around 0.6 (with similar results for the more restrictive Loubar criterion). We also computed the ratio of the average distance between intermittent hotspots and the average distance between intermediary hotspots (i.e. those that are not intermittent or permanent, so those who are present between 7 and 23 hours per day). We plot the histogram of this ratio for all 31 cities in Figure 10 (b). The distribution is peaked around 0.95-1, with lower values of standard deviation, which means that intermittent and intermediate hotspots are, on average, as much dispersed and that the significative differences lie in the spatial organization of permanent hotspots vs. non permanent hotspots.
We have shown in this study that it is possible to extract relevant information from mobile phone data, not only about the mobility behavior of individuals, but also about the structure of the city itself. We have defined various indices that allow us to characterize some dynamical morphological properties of cities, including the evolving average distance between individuals in the city through the course of the day. Such dynamical properties can serve as a basis to propose new classifications of cities. We have also presented a generic method to determine the dominant centers, the hotspots, and we have confirmed recent results -obtained on completely different data- showing that the number of activity centers in cities scales sublinearly with the population size of the city. We have also highlighted some properties of hotspots in Spanish cities, such as the strong stability of the hierarchy of the hotspots along the day, whatever the city size. These results constitute a step towards a quantitative typology of cities and their spatial structure, an important ingredient in the construction of a science of cities.
They also raise questions that could be adressed in future studies. In particular, we could ask if these morphological patterns are universal, and to what extent they are specific to Spanish cities. More generally, they might be specific to european cities whose urbanization history is older than in other continents, resulting in urban systems with specific morphological properties14,26. Also, it would be interesting to investigate if the time dynamics observed here are similar in cities of recently urbanized and fast growing regions. In this respect, repeating the measures proposed in this paper on cities worldwide where mobile phone datasets are available, would bring invaluable information on the spatial organization of urban systems.
Finally, an inevitable direction for further studies will be to bridge the existing knowledge about centrality patterns in cities with those revealed by new sources of geolocalized data. This could for example include the comparison of recent results based on pervasive geolocalized data with morphological properties of cities extracted from mobility surveys and remote censing data (see for example17,21 for recent international comparisons). The centrality extracted from the road network structure has also been shown recently to be correlated with economical activity27,28 and it would interesting to understand how these network properties compare with patterns extracted from pervasive geolocalized data.
Spatial delimitation of cities
Comparing the spatial structure of cities of very different population sizes and areas requires to rely on a harmonized definition of cities that goes beyond the arbitrariness of the administrative boundaries26,31. To this end we have chosen to rely on the urban areas defined by the AUDES initiative (Areas Urbanas De ESpaña)36 which capture some coherent delimitations of cities regarding the home-work commuting patterns of individuals living in the core city of the metropolitan areas and in their surrounding municipalities. These delimitations are built upon statistical criteria based on the proportion of residents of surrounding municipalities that commute to the main city to work.
Average distance between individuals and dilatation index
We started with the Venables index16, defined as: with si(t) = ni(t)/N(t) the share of individuals present in cell i at time t, and dij the distance between i and j. When all activity is concentrated in one spatial unit only, the minimum value zero of V is reached. An important point of this dilatation index is that one doesn't need to determine hotspots to compute it. By normalizing V by the densities, we can compute a weighted average distance, the ‘Venables distance’ with si(t) = ni(t)/N(t) the share of individuals present in cell i at time t. In order to compare the value of DV across cities, we compute with A the area of the city. By considering all pairs of cells and weighting their distance by the densities of individuals in each of them, DV (t) signals how much the important places of the city at time t are distant from each other.
Identification of the hotspots
The data gives access to the spatial density ρ(i, t) of users at different moments. The full density is a complex object and we have to extract relevant and useful information. The locations that display a density much larger than the others - the hotspots - give a good picture of the city by showing where most of the people are. The hotspots thus contain important information about points of interest and activities in the city.
The determination of centres and subcentres is a problem which has been broadly tackled in urban economics32,33,34. Starting from a spatial distribution of densities, we have to identify the local maxima. This is in principle a simple problem solved by the choice of a threshold δ for the density ρ: a cell i is a hotspot at time t if the instantaneous density of users ρ(i, t) > δ. This is for example what was done in32 to determine employment centres in Los Angeles. It is however clear that this method introduces some arbitrariness due to the choice of δ, and also requires prior knowledge of the city to which it is applied to choose a relevant value of δ. Nonparametric methods have also been applied to determine the number of centres, some based on the regression of the natural logarithm of employment density on distance from the centre33, some on the exponent of the negative exponential fit of the density distribution35. Limits of these methods stand in the fact that they return a unique number of centres that could be biased when the actual density distribution is not properly fitted by an exponential law. Here we will propose an alternative method that allows us to control the impact of this choice.
A first simple criterion is to choose the point that corresponds to the average of the distribution at time t: all the cells whose density is larger than m are hotspots. This is indeed a weak definition of what can be considered as a hotspot, and we propose here to use it as a ‘lower’ bound δmin = m.
In order to understand how the various properties of hotspots will depend on this definition, we introduce a more restrictive definition which will be considered as an upper bound of what can be considered as a hotspot. In the following we discuss how to find this upper bound. In order to characterize the disparity of the activity in the city and to isolate the dominant places, we first plot the Lorenz curve of the density distribution in the city at each hour. The Lorenz curve, a standard object in economics, is a graphical representation of the cumulative distribution function of an empirical probability distribution. For a given hour, we have the distribution of densities ρ(i, t) and we sort them in increasing rank, and denote them by ρ(1, t) < ρ(2, t) < … < ρ(n, t) where n is the number of cells. The Lorenz curve is constructed by plotting on the x-axis the proportion of cells F = i/n and on the y-axis the corresponding proportion of users density L with: If all the densities were of the same order the Lorenz curve would be the diagonal from (0, 0) to (1, 1). In general we observe a concave curve with a more or less strong curvature, and the area between the diagonal and the actual curve is related to the Gini coefficient, an important indicator of inequality used in economics.
In the Lorenz curve, the stronger the curvature the stronger the inequality and, intuitively, the smaller the number of hotspots. This remark allows us to construct a new criterion by relating the number of dominant hotspots (i.e. those that have a very high value compared to the other cells) to the slope of the Lorenz curve at point F = 1: the larger the slope, the smaller the number of dominant individuals in the statistical distribution. The natural way to identify the typical scale of the number of hotspots is to take the intersection point F* between the tangent of L(F) at point F = 1 and the horizontal axis L = 0 (see Figure 11). This method is inspired from the classical scale determination for an exponential decay: if the decay from F = 1 were an exponential of the form exp −(1 − F)/a where a is the typical scale we want to extract, this method would give 1 − F* = a. We note here that the average criterion corresponds to the point of the Lorenz curve with slope equal to 1. Indeed, the general expression of the Lorenz curve for the set of densities ρ(i, t) whose cumulative function is F(ρ) is: where ρ(F) is the inverse function of the cumulative. This point thus satisfies which gives m = ρ(FAvg) or in other words, the hotspots will be those with densities larger than the average. In contrast, our more restrictive criterion based on the slope at F = 1 gives where ρM is the maximum value of ρ(i, t) (for a given time t). We thus see that in general FAvg < F* and that this new criterion, more restrictive, does not only depend on the average value of the density but also on the dispersion: as ρM increases, the value of F* increases and therefore the number of detected hotspots decreases.
All other possible and reasonable methods will then give a value comprised in the interval [FAvg, F*] between the average criterion and our criterion (also denoted by ‘LouBar’). Instead of choosing a particular point, we will thus study most of the properties computed for hotspots with the two methods, giving us both a lower and upper bounds. In particular, we will be able to test the robustness of our results against the arbitrariness of the hotspot identification method. Figure 12 shows the location of the hotspots selected according to the two methods/criteria at different moments of the day, in the metropolitan area of Barcelona. These maps can be regarded as the extremes of hotspots maps that reasonable hotspots definition methods could produce (i.e. with a number of hotspots comprised between FAvg and F*).
Influence of the spatial scale of aggregation
In the hotspots identification process, the size of the grid cells on which we aggregate the numbers/densities of users is another arbitrary parameter (cf. section Methods). Since we don't want to determine this value separately for each city, we consider that several sizes should be tested for each city and that it is reasonable to consider that this cell size a can vary from 500 meters to 2 km. Figure 13 gives an idea of how much the proportion of hotspots change from one cell size to another. The cell size a should primarily be chosen based on what is considered as a reasonable size for an urban hotspot. From the pedestrian point of view, every size between 500 metres and 2 kilometres seems a priori acceptable. Below 500 m, it would clearly be necessary to aggregate contiguous hotspots: for example, for a = 100 m (10−2 km2 cells), two contiguous hotspots could not as easily be distinguished as two different ones from a pedestrian point of view. In contrast, a size of 2000 m can be considered as an upper bound for the same reasons: if two contiguous cells are classified as hotspots, it is reasonable to identify them as two distincts neighbourhoods. It is however a question of perception and should be discussed carefully. In the hypothesis of a = 1000 m (1 km2 cells), we chose to consider that two adjacent hotspots are two different hotspots. For reasonable sizes of grid, the values of the indicators should be robust with a change of the cell size. We then tested the sensitivity of our results with respect to different resolutions.
Number of hotspots
In Figure 14 we show the scaling relation between the number of hotspots with the population and the effect of the grid size. Here we see that the scaling results and the value of the exponent are robust against a change in (i) the threshold used for identifying the hotspots and (ii) the size of the grid cells.
The Kendall rank coefficient is used as a test statistic to establish whether two lists of random variables may be regarded as statistically dependent. To each cell i we associate its rank ri(t) in the ordered density distribution at time t. Kendalls τ value indicates how much the hierarchy changed between t − 1 and t. For a set of pairs (i, j), it is equal to the difference between the number of converging pairs (i.e. ρi was larger (resp. smaller) than ρj at (t − 1) and is still larger (resp. smaller) at t) and the number of diverging pairs (ρi was smaller (resp. larger) than ρj at (t − 1) and is larger (resp. smaller) at t). The Kendall values τ(t) are plotted on Figure 15.
Under the null hypothesis of independence of two lists, the distribution of τ has an expected value of zero and for larger samples, the variance is given by Any value of τ larger than this null-value signals the existence of relevant correlations.
We show in Figure 15 the evolution of Kendall τ values calculated for the set of permament hotspots during daytime in an average weekday, for 31 Spanish urban areas with more than 200,000 inhabitants. The curves are ranged by decreasing order of population size (the biggest city in the top left corner, the smallest in the bottom right). The red curves correspond to the daytime evolution of the Kendall τ for the hotspots selected with the ‘LouBar’ more restrictive criterion, the blue ones to the Kendall τ of the hotspots selected with the ‘Average’ criterion. These results indicate that the hierarchy of permanent hotspots is indeed very stable in time.
The authors acknowledge funding from the EU commission through project EUNOIA (FP7-DG.Connect-318367).
About this article
Scientific Reports (2015)