## Introduction

Scientific studies of human location data have a long history starting from finding empirical laws of migration using population data in 18851. Since then and until 2000 human location data collection was limited to questionnaires or population surveys over long time periods. In the twentieth century, flow of cars was extensively analyzed because human flow could not yet be tracked2. The situation has changed drastically in the beginning of this century by the modern information technology3,4,5,6. For example, mobile phones with GPS provide detailed information of locations of enormous number of people, simultaneously. By analyzing such detailed observational data, study of human mobility became much more precise and intensive. Recent studies of human mobility can be roughly categorized into two groups: One is focusing on statistical properties of individual trajectories7,8,9, and the other is global migration between cities10,11. Microscopically, trajectories of human locations may look random, but actually, they are very different from Brownian motion of fine particles. It is found that human trajectories can be approximated by the Levy flight model of power law length distribution of jumps12. Individual trajectories can be classified into several social activity classes13,14,15, and predictability of each trajectory has been discussed16,17. Beside daily activities, researchers also analyzed specific important phenomena such as panic behavior right after sever earthquakes18,19 , and resilient features of traffic congestions20. Macroscopically, in the field of human migration, the so called gravity law has been widely applied21,22,23. Besides the gravity model, the intervening opportunity class model, where the flow amount is proportional to the opportunity of the destination and inversely proportional to the intervention opportunity between the origin and the destination, has been widely studied24,25,26,27,28. Also, probabilistic human mobility prediction are widely performed for congestion and advertisement optimization29,30,31,32. Recently, the potential within the big cities has been estimated using the vector field generated from the Origin-Destination matrix, which includes the number of people traveling between all pairs of spots33. However, studies of collective human flow (vector field) within scale of cities, which we call here mesoscopic scales, have been rarely addressed.

In this paper, we introduce and develop a framework to perform a mesoscopic analysis of collective human mobility, within urban areas. We analyze GPS location data of mobile phones with the information of velocity and location, and observe the temporal evolution of collective flow patterns of human mobility within big cities. In our framework we regard human flow like water flow and observe temporal changes of drainage basin structures within and around large cities applying the concept of power laws.

## Results

### Discretized human flow patterns in a city

In order to observe collective human flow patterns we first divide the urban map into a square lattice of units of sizes $$500\times 500\hbox { m}^2$$ and calculate the mean velocity vector averaged over moving people of non-zero velocity at each square in a time interval of 30 min snapshots as schematically shown in Fig. 1a-left. The time intervals of 30 min is applied from 5:00 to 25:00 ($$=1$$ a.m. in the next day). Then, for each square we calculate the projected component values of the mean velocity vector in the 4 directions, {north, east, south, west} and choose the direction with the largest component as the discretized representative direction of the square as shown in Fig. 1a-right. Figure 1c,d show examples of detailed flow pattern maps which are located at the small west part in the center of Tokyo (the purple square in the wider map, Fig. 1b), in the morning (7:30–8:00) and afternoon (13:30–14:00) of a weekday, respectively. In Fig. 1c a typical morning rush-hour flow pattern is observed. We can see many arrows directing to the right or bottom toward the center of Tokyo which is located at the right-down corner of the map. The arrows of the squares which include railways can be also seen to be highly correlated with pointing towards the city center. In Fig. 1d a typical afternoon flow pattern is shown with red arrows that indicate the flow directions that are different from the morning pattern seen in Fig. 1c. It suggests that the directions of arrows at afternoon do not point to a certain data and are more like random.

### Drainage basin structures and the population of moving people

For characterizing the flow patterns quantitatively we introduce and develop the concept of drainage basin which has been useful in the study of river flow patterns34. We define basins for a discretized flow pattern in the following way. As shown in the upper part of Fig. 2a, we consider simply that the people in a square flow mainly into the neighboring square in the direction of the arrow, and we regard these two squares belonging to the same drainage basin. By applying this rule to all squares we can uniquely define drainage basin clusters as schematically shown in the lower part of Fig. 2a. The clusters are shown in different colors where the darkness is proportional to the number of upstream squares, implying that more people move in the darker areas. In Fig. 2b,c, the top 15 basins of Tokyo area are shown for the morning rush hour and the afternoon (see Fig. S5 in Supplementary Information for evening flow patterns). In the morning flow pattern (Fig. 2b), we can see several huge drainage basins pointing towards the center of Tokyo. In the afternoon pattern (Fig. 2c), the sizes of the basins are significantly smaller and there seems to be no specific flow direction.

We now consider the large drainage basin areas and test their similarity in different months. We first prepared 12 monthly averaged flow patterns for the year of 2015 in the morning and the afternoon just like Fig. 2b,c. Next we calculate the Jaccard index which is defined as the ratio of the overlapped area divided by the area of union, where the area is the top 15 drainage basins for each monthly flow pattern. Figure 2d shows that the values of Jaccard indexes between the morning basin patterns (blue) are much larger than those between the afternoon basin patterns (red) implying that the morning basin patterns are similar throughout the year, while the afternoon basin patterns are changing monthly. We also apply the same analysis to artificially made flow patterns in which the arrows are randomly shuffled (dashed line), and find that the values are close to the results of the afternoon. This random nature of the afternoon patterns is more directly confirmed by observing the mean velocity correlation between squares at distance r which decays at small r to 0 (see Supplementary Information, Fig. S4). We apply the above analysis to other 8 large cities in Japan, and confirm that the above properties are very similar.

Next we study the basin size distribution for each of the 9 largest cities for morning rush hours, see Fig. 2e. Here, the y-axis shows the cumulative distribution, that is, the probability that a randomly chosen basin’s size is larger than the value at the x-axis, which is normalized by the mean basin size of each distribution. We find that the distributions are well approximated by power laws based on applying the Kolmogorov-Smirnov test with the best estimated exponent values around $$2.4 \pm 0.2$$ (see the “Materials and methods”)35. The basin size distributions for the afternoon are shown in Fig. 2f in semi-log plot. We find that the basin size distributions in the afternoon are nearly linear for all cities implying that the size distributions can be roughly approximated by exponential functions. The basin size distributions for randomly shuffled flow patterns also follow exponential-like distributions of a similar slope.

Next, we focus on the population of moving people in each drainage basin in the morning rush hour. In Fig. 2g, we find the CDFs are approximated by power laws by applying the Kolmogorov-Smirnov test with the best estimated exponent values around $$1.2 \pm 0.2$$ (see the “Materials and methods”)35. This difference between the power law exponents of the distributions for drainage basin area S, and the population of moving people p is surprising, since it means that population is not proportional to the drainage area. We suggest here that indeed these two quantities fulfill a non-trivial nonlinear relation as:

\begin{aligned} p \propto S^{2}, \end{aligned}
(1)

which is supported by Fig. 3a. Note that the formula has an error of about $$\pm 0.2$$ by error propagation of the power law exponents of the distributions for drainage basin area and the population of moving people. To further test this surprising nonlinear scaling we plot the population of moving people in each basin as a function of the basin diameter L, which is defined as the maximum distance between two points in the basin (see Fig. 3c), in Fig. 2h finding a novel cubic law as:

\begin{aligned} p \propto L^{3}, \end{aligned}
(2)

which contradicts the natural intuition of $$p \propto L^{2}$$. This result suggests that human flow in urban cities is not simply gathering people uniformly in the drain like the case of water flow, but the flow intensity is enhanced in an extra dimension ($$L^{3}$$, 3 dimensions) as basin increases causing extremely high density. From Eqs. (1) and (2) another non-trivial scaling relation is expected,

\begin{aligned} S \propto L^{1.5}, \end{aligned}
(3)

which means that the geometry of the basins, i.e., the area of main traffic, are characterized by a fractal structure with the dimension 1.5.

### The fractal structure of drainage basin

Figure 2 suggests that the population of moving people in a drainage is proportional to the square of the basin size (Eq. 1), and that drainage basin has a fractal structure (Eq. 3). We directly support these relations in Fig. 3a showing the relation between the population in a basin and the basin size. Also, the number of moving people is given by:

\begin{aligned} p_{b} \propto S_{b}^{2.0} \propto {\overline{\rho }}_{b}(S_{b})S_{b}, \end{aligned}
(4)

where $${\overline{\rho }}_{b}(S_{b})$$ is the averaged population density of moving people in the b-th basin of size $$S_{b}$$. Here, the density $${\overline{\rho }}_{b}(S_{b})$$ is defined as the total number of moving people in the basin $$p_{b}(S_{b})$$ divided by its area size $$S_{b}$$. As seen in Fig. 3b, $${\overline{\rho }}_{b}(S_{b})$$ and $$S_{b}$$ have a linear relation. To better understand this unexpected scaling relation, we define a diameter (see Fig. 3c) and distance (see Fig. 3d) for each basin. The non-trivial 3-dimensional relation, Eq. (2), is equivalent to the following relation for each b-th basin:

\begin{aligned} p_{b}(L_{b}) \propto L_{b}^{3.0}, \end{aligned}
(5)

where $$L_{b}$$ is the diameter of the b-th basin. In Fig. 3e, the basin size is found to scale with a power 1.5 of the diameter $$L_{b}$$. Thus, we identify a new scaling relation:

\begin{aligned} S_{b}(L_{b}) \propto L_{b}^{1.5}. \end{aligned}
(6)

This implies that the fractal dimension of drains is $$D=1.5$$. Since, the population density is proportional to its drainage size, it suggests that the population density in basins is given by:

\begin{aligned} {\overline{\rho }}_{b}(L_{b}) \propto L_{b}^{1.5}, \end{aligned}
(7)

which is derived from $$p_{b}(L_{b})$$ divided by $$S_{b}$$.

To deeper understand the above finding, we assume that the population density of moving people in a basin is characterized by the distance from the most dense populated square, $$\rho _{b}(r)$$, where the distance r denotes the distance from the most dense square (which we call the center of drainage). The total number of moving people in a basin $$p_{b}(S_{b})$$ is then given as:

\begin{aligned} p_{b} = \int ^{L_{b}}_{1} \rho _{b}(r) \Delta S_{b}(r)dr, \end{aligned}
(8)

where $$\Delta S_{b}(r)dr$$ is the area of drainage basin at the distance r from the center between r and rdr, so that the area $$S_{b}$$ is given as:

\begin{aligned} S_{b}=\int ^{L_{b}}_{1} \Delta S_{b}(r)dr. \end{aligned}
(9)

From the fractal property, $$S_{b}(L_{b}) \propto L_{b}^{1.5}$$, we expect $$\Delta S_{b}(r) \propto r^{0.5}$$. We also assume the following power law functional form for the population density of moving people:

\begin{aligned} \rho _{b}(r) = \rho _{b,max}(L_{b})r^{-\alpha }, \end{aligned}
(10)

where $$\rho _{b,max}(L_{b})$$ is the maximum value of the population density of moving people in a drainage of diameter $$L_{b}$$. In Fig. 3f, we find that $$\rho _{b,max}(L_{b})$$ follows the following power law:

\begin{aligned} \rho _{b,max}(L_{b}) \propto L_{b}^{2.0}. \end{aligned}
(11)

Fig. 3g indicates the population density of moving people decreases with the distance from most densely populated square in each basin as $$\rho _{b}(r) \propto r^{-0.5}$$. Therefore, the population density of moving people is given as:

\begin{aligned} \rho _{b}(L_{b},r) = \rho _{b,max}(L_{b})r^{-\alpha }=L_{b}^{2.0}r^{-0.5}. \end{aligned}
(12)

Finally, the population of moving people in a basin is calculated as:

\begin{aligned} p_{b}(S_{b})= & {} \int ^{L_{b}}_{1} \rho _{b}(r) \Delta S_{b}(r)dr \nonumber \\\propto\, & {} \int ^{L_{b}}_{1} L_{b}^{2.0} r^{-0.5} r^{0.5} dr \nonumber \\\propto\, & {} L_{b}^{2.0} L_{b}^{1.0} \nonumber \\\propto\, & {} L_{b}^{3.0} \nonumber \\\propto\, & {} S_{b}^{2.0}. \end{aligned}
(13)

### Other scaling relations characterizing the city traffic

To further and independently understand and support these unexpected scaling relations we analyze the office floor area and daytime worker population using the two governmental official census data36,37, which include the worker population and the office floor area for the 23 wards of Tokyo. We regard the Imperial Palace as the city center of Tokyo and the distance r for each ward is defined by the linear distance to the ward office. In Fig. 4a, we show the relation between floor area as a function of distance from the city center to quantify the effect of skyscrapers. The gross floor area of offices and shops in the ward is found to be inversely proportional to the $$0.8\pm {0.3}$$ power of the distance r. Also, in Fig. 4b, the density of the floor area in the ward $$f_{A}$$, which is defined by the gross floor area divided by the whole area of the ward, decreases proportional to the $$-1.4\pm {0.3}$$ power of the distance r. Next, we show in Fig. 4c,d that the daytime worker population in the city, P, and the daytime worker population density in the city, $$\rho _{A}$$, which is defined by the daytime worker population divided by the whole area of the ward, show quite similar scaling relations as the office floor area and the density of floor area in the ward, respectively. That is, the office floor area per person $$\rho _{F}$$ is roughly constant for any ward as naturally expected.

As seen from these results the city structure in view of human capacity is quite non-uniform, and the density of working people tends to increase towards the city center. These relations are the cumulative amount for the whole city which can be regarded as a potential driving force of the strong human flow from suburbs to the city center. The typical non-trivial scaling relation, Eq. (2), is a snapshot property of the resulting strong non-uniform human flow observed in each drainage basin at the most congested morning rush hour, see also further discussion in the “Materials and methods” Section, Eqs. (4) to (13) and Fig. 3.

## Discussion

In this paper we analyzed GPS data of location and velocity of over 2 hundred thousand users in Japan. As shown in Supplementary Information 1.1 and 1.2 individual user’s location changes have been observed from early morning to midnight using this data. Here, we did not pay attention to such individual traces, but we focused on collective motion of people around big cities. In order to characterize macroscopic human flow pattern we introduced a coarse-graining method described in “Velocity discretization” explained in detail in the “Materials and methods” section, and we naturally defined drainage basins as schematically shown in Fig. 2a. This velocity discretization procedure is a rough simplification, however, we believe that characteristics of macroscopic flow patterns are captured with this method.

The basin area distributions in the afternoon can be approximated by exponential functions, which is consistent with the assumption that during afternoon most of the moving directions are uncorrelated indicating that people move independently. On the other hand, in the morning rush hour there appear strong human flows toward the city center causing huge drainage basins. The cumulative basin size distributions at the peak rush hour are approximated by a power law with a non-trivial exponent about 2.4, which are the same for 9 big cities.

Relating to this morning rush hour property, we found additional scaling relations characterized by power laws such as the non-trivial three dimensional relation, Eq. (2), that is, the population of moving people in a drainage basin of diameter L is proportional to $$L^{3}$$. As discussed in the section, “The fractal structure of drainage basins”, this cubic law is shown to be consistent with the fractal geometry of basin structure with the fractal dimension 1.5 (Fig. 3e), and the power law decay of population density in each basin, $$r^{-0.5}$$ (Fig. 3g).

This power law decay of population density is expected to be deeply related to the structure of cities; skyscrapers with huge human capacity, that are located near the city center and many trains are gathered also towards the city center. Figure 4 shows examples of non-trivial scaling relations for daytime population and office floor area as a function of the distance from the city center confirmed for Tokyo. We expect that similar relations hold for any city.

We conjecture that our new view of macroscopic human flow patterns in metropolitan areas is applicable to all cities around the globe and reveals universal flow patterns within urban areas.

## Materials and methods

### The data

Our GPS data is provided by the Japanese private company, Agoop, which operates application programs of smart phones. The GPS data consists of the user ID, date, time, longitude, latitude, velocity in longitude and velocity in latitude, where velocities are estimated by Doppler effect of the electromagnetic wave frequency. The number of users is about 260,000 in Japan, and for each user the GPS data is collected every day except from 1 a.m. to 5 a.m. with intervals about 30 min. For protection of privacy, the user IDs are randomized every day. The observation period is the whole year of 2015. The total data size is about 1 TB. We applied the data trimming to the original data set in advance (see Fig. S3 in Supplementary Information).

### Velocity discretization

We divide the map into square lattice of sizes $$500 \times 500 \,{\hbox {m}^2}$$ (based on Japanese Industrial Standards) and calculate the mean velocity in each square by taking average of velocities of individuals with non-zero speed within the square in a time interval of 30 min. Since the update time intervals of the GPS position data is not constant but depends on users, we introduce a weighted average for the case that signals are transmitted for more than 2 times in an interval of 30 min. First, using data identified based on User’s IDs and time, we calculated $$n_{{ID,T,k}}$$, which is the transmission frequency of a user with ID number, ID, during the k-th time interval of 30 min on date T. We define $$n_{{ID,T,i,j,k}}$$, as the transmission frequency in the square which is located in the i-th in east-west direction and the j-th in south-north direction (see Fig. 5).

Existence probability that a user with a given ID exists in the square located at (ij) during the time interval of k on the T-th day is defined as:

\begin{aligned} w_{{ID,T,i,j,k}}=\dfrac{n_{{ID,T,i,j,k}}}{n_{{ID,T,k}}}. \end{aligned}
(14)

Also, population in a square (ij) at the k-th time interval is given as:

\begin{aligned} {p}_{{T,i,j,k}}= & {} \sum _{ID}{w_{{ID,T,i,j,k}}}, \end{aligned}
(15)

where the summation is taken over all user IDs. Next, we define each ID’ s average velocity in the square (ij) of the k-th time intervals:

\begin{aligned} \overline{\varvec{v}}_{{ID,T,i,j,k}}=\dfrac{\sum _{(i,j,k)\ \in A_{i,j,k}}\varvec{v}_{{ID,t}}}{n_{{ID,T,i,j,k}}}, \end{aligned}
(16)

where $$\varvec{v}_{{ID,t}}$$ is the value uniquely determined from ID and time t, and we do not define this value in the case that the denominator is zero. The averaged velocity of the square (ij) in the time interval k is defined as follows:

\begin{aligned} \overline{\varvec{v}}_{{T,i,j,k}}= & {} \dfrac{\sum _{ID} w_{{ID,T,i,j,k}}\overline{\varvec{v}}_{{ID,T,i,j,k}} }{{p}_{{T,i,j,k}}}\,\,\,\,if\,\,\,\Bigl ({p}_{{T,i,j,k}}\ne 0\Bigr ), \nonumber \\= & {} No\,Data\,\,\,\,if\,\,\,\Bigl ({p}_{{T,i,j,k}}=0\Bigr ). \end{aligned}
(17)

### Kolmogorov–Smirnov (KS) test for the power law distribution hypothesis

We conduct a statistical test according to the following procedure20,35.

1. 1.

We define the null hypothesis $$\hbox {H}_{0}$$ as the data fit with a power law and the alternative hypothesis $$\hbox {H}_{1}$$ as the data does not fit a power law.

2. 2.

The power law exponent in the following equation is estimated by the maximum likelihood estimation method. The estimated probability density function f(x) is

\begin{aligned} f(x)=Cx^{-(\alpha +1)},\,\,\, x\in [x_{min},\infty ), \end{aligned}
(18)

where the normalization constant is $$C=\dfrac{\alpha }{x_{min}^{-\alpha }}$$. The log-likelihood function is defined as:

\begin{aligned} \ln {L(\alpha )}=n\ln {\alpha }-n\ln {x_{min}}-(\alpha +1)\sum _{i}^{n} \ln {\frac{x_{i}}{x_{min}}}, \end{aligned}
(19)

where n indicates the number of data values used for the maximum likelihood estimation method. Differentiating the above equation by $$\alpha$$, the estimated $$\alpha$$ is given by:

\begin{aligned} \alpha =n[\sum _{i}^{n}\ln {\frac{x_{i}}{x_{min}}}]^{-1}. \end{aligned}
(20)

To estimate the distribution, $$x_{min}$$ must also be determined. The difference between the data and the estimated distribution is given by:

\begin{aligned} D_{x_{min}}=sup|F_{data}(x)-F_{model}(x)|, \end{aligned}
(21)

where $$F_{data}(x)$$ and $$F_{model}(x)$$ are the cumulative distribution function of the real data ($$x\in [x_{min},\infty$$) and the estimated model (exponent $$\alpha$$). The parameter $$x_{min}$$ that minimizes the value of $$D_{x_{min}}$$ is the optimal one.

3. 3.

The KS statistic D is defined as:

\begin{aligned} D=sup|F_{data}(x)-F_{model}(x)|. \end{aligned}
(22)

Taking the difference between the model and the data at each value of x, the maximum value is defined as D.

4. 4.

Next, 10000 random number data sets composed of n number of data obeying to the power law of the exponent $$\alpha$$ are created. KS statistic $$D^{*}$$ for each random number data set $$F_{syn}(x)$$ is given as:

\begin{aligned} D*=sup|F_{syn}(x)-F_{model}(x)|. \end{aligned}
(23)

We count the number of random samples which fulfill $$D<D^{*}$$, and the p-value is defined by dividing this number by the total number of random samples. As summarized in Tables 1 and 2, the basin size distributions and distributions of moving people in basins, can be regarded as power laws for all 9 cities.