Position tracking using cellular phones can provide fine-grained traveling data between and within cities on hourly and daily scales, giving us a feasible way to explore human mobility. However, such fine-grained data are traditionally owned by private companies and is extremely rare to be publicly available even for one city. Here, we present, to the best of our knowledge, the largest inter-city movement dataset using cellular phone logs. Specifically, our data set captures 3-million cellular devices and includes 70 million movements. These movements are measured at hourly intervals and span a week-long duration. Our measurements are from the southeast Sangliao Basin, Northeast China, which span three cities and one country with a collective population of 8 million people. The dynamic, weighted and directed mobility network of inter-urban divisions is released in simple formats, as well as divisions’ GPS coordinates to motivate studies of human interactions within and between cities.
|Design Type(s)||time series design • source-based data analysis objective • behavioral data analysis objective|
|Measurement Type(s)||movement quality|
|Technology Type(s)||digital curation|
|Factor Type(s)||geographic location • temporal_interval|
|Sample Characteristic(s)||Homo sapiens • China • populated place|
Machine-accessible metadata file describing the reported data (ISA-Tab format)
Background & Summary
Popular use of cellular phones enables measurements of large-scale human mobility traces, which have become readily available and served as proxy for human mobility. The underlying interactions of meta-populations within and between cities have been extensively studied both in applied work (e.g., inter-urban mobility1, urban activities2, urban evolution3, heterogeneous responses during extreme events4), and epidemiology studies of mobility networks5,6.
To study human movements, especially among cities, the analytic framework of mobility networks provides a useful way to characterize interactions among people in different sites. Although transportation and interaction patterns between locations change at hourly and daily scales, many studies of human mobility assume they are static7,8,9, neglecting the nature of mobility dynamics. This is, arguably, due to the lack of fine-grained public datasets that could describe the mobility dynamics between cities. There are some open access datasets covering small geographical locations taking into account the time ordering of interactions, such as networks of wifi hotspots within a city10 and networks of students in a university campus11. However, fine-grained movement datasets covering large geographical regions including multiple cities with large populations are still missing from the open-access datasets.
In this paper, we curate and amass a fine-grained dataset of mobility to study inter-urban interactions. We capture cellular position tracking of millions mobile phone users from an open-data program in Changchun city. Each location in our dataset represents a group of cellular stations in an official administrative division.
We assume that individual stays at a location if her location is the same at least for half an hour in an hour time interval. Directed movement of each individual from a source location O to a destination location D denotes a change of location for the corresponding individual. We record the time of the directed movement as the time of arriving D in our dataset. The overall directed mobility network of locations is finally compiled by sequentially processing the directed movements for all individuals. In the network, a node represents a location. A weighted edge represents the total number of users’ movements between a pair of locations in each hour.
The dataset contains movements of near 3-million anonymized cellular phone users among 167 divisions (henceforth locations), covering 4 geographically adjacent areas (Changchun City, Dehui City, Yushu City, and Nong’an County) for a one-week period starting on August 7, 2017. This total geographic area, located in the southeast Songliao Basin in the center of the Northeast China Plain, Northeast China, covers more than 20 square kilometers and, in 2017, had a population of nearly 8 million.
To facilitate the use of the open data, we process the above raw dataset to extract a dynamic and directed mobility network of locations. We make these networks available through files in CSV file format, separated by commas. There are 2 files released in 2 folders. The first file denotes the mobility network with four columns ordered by origin location, destination location, their weight and time. For spatial analysis applications, we also provide a geospatial file denoting the GPS information for each location, containing three columns ordered by location associated with its latitude and longitude.
Although this described dataset is a major step towards enabling research about human mobility, it has several limitations. First, despite the fact that the dataset covers a cohort of millions of movements, it is only for a one-week period in summer time. Depending on the application, longer periods of time intervals might be needed. Second, we define a user has movement only when s/he stays in a new location at least half an hour. This may also induce bias as it ignores quick movements. Third, the individual’s destination position is the last known recorded location of the individual. This recording might cause bias. The individual might actually already be in D during the whole period of t and t − 1. Fourth, the individual’s original position might have been unrecorded at an earlier time (e.g., an hour or a day) than her/his recorded arrival time t, since it depends on the last time that the user used her/his phone. We caveat the researchers to be careful about their conclusions when using these data.
Original data sources
Our data consist of location records of millions of anonymized cellular phone users for one week starting from August 7, 2017. These locations include 4 geographically neighboring areas (i.e., Changchun City, Dehui City, Yushu City, and Nong’an County). A cellular phone is assumed to be located at the location of the closest cellular base station that it interacts through sending or receiving signals. In the raw movement data each base station is a unique unit. Note that a set of cellular base stations can serve a metapopulation to provide services together.
There are over 12,000 cellular stations with their exact GPS location information. Using the input of GPS positions associated with cellular stations, we can get their official administrative division codes in 2017 version using the Amap APIs (https://lbs.amap.com/api/webservice/guide/api/georegeo), as well as the GPS information of each administrative division. In total, these cellular stations located in 167 divisions, with 100 in Changchun, 27 in Yushu, 18 in Dehui, and 22 in Nong’an. Each division includes 72 stations on average with a standard deviation of 65 stations. We group together a set of base stations as one location if they are within the same division.
There are nearly 3-million phone users in this study. Most of these users are active with enough credits left in their accounts. The accounts with no credit stop receiving signals automatically in a few days by the company’s system. For each user, we aggregate the corresponding location records into hourly movements. Specifically, we assume an individual stays in a location at least half an hour to be considered in that particular location. If a user is spending less than 30 minutes in a location, we assume s/he does not visit the corresponding location during the corresponding hour. Some trips may have large time intervals perhaps due to phones being out of battery power. As such, we do not consider trips whose duration more than 12 hours (less than 0.3% of the total trips). And accordingly, each individual movement from an original location O to a destination location D at time t denotes that an individual moves from O and arrive D at time t. Each location is identified by an anonymized identification code. The demographic information associated with these divisions can be found in the coming official 2018 statistical yearbook (http://cyfd.cnki.com.cn/N2018050240.htm). The telecommunications operator kindly agreed to grant us the rights of sharing these anonymized movement traces and licensing this derived dataset in the framework of the mobility network as Open Data under Attribution 4.0 International (CC BY 4.0) license. We release the raw files of hourly mobility networks in the figshare website12, as well as initial matlab code (Supplemental File 1) used in this paper.
Defining the mobility network
Considering each place (a city or a country) as multiple metapopulations in different locations, we construct the directed mobility network for each hour of the week. Each location is represented as a node in our network. Edges are directed, connecting nodes where users move from origins to destinations and weighted by the total number of users’ movements in each hour-location scenario. An individual directed movement from location i to location j at time t denotes that in a user’s movement, location j emerges after the previous location i at time t.
This dataset is released by 2 comma-separated values (CSV) files, each in a folder, including more than 70-million movements12. The first file includes the hourly mobility network with four columns ordered by origin location, destination location, edge weight, and arriving time. The weight is the number of movements per hour between the origin location and destination location. The second file includes the GPS information for each location, containing three columns ordered by location associated with its latitude and longitude.
Finally, two folders are used to to group these files12. The first folder (Week-Mobility-Network) includes (Mobility.txt), the file of the hourly-mobility network for the entire week. The second folder (GPS-Location) includes (GPS.txt) the file of latitude and longitude information for each location in the mobility network.
Mobility.txt In the mobility network, each row represents the total number of hourly movements by people from locations i to j in the corresponding day. There are four columns ordered by origin location, destination location, their weight, and time. The format for this file is the following.
Origin: numerical administrative division code for each origin location;
Destination: numerical identification for each destination location;
Weight: total number of movements between an origin location and a destination location in the corresponding hour ####-##-##T##+08, following ISO 8601 format (YYYY-MM-DDTHH+08);
Time: hourly time of arriving at this destination location in ISO 8601 format (YYYY-MM-DDTHH+08), denoting that W movements end in the range from T##-1 to T## in ####-##-##. For example, “2017-08-07T09+08’ denotes there are W movements arriving destination D between 8:00 to 9:00 on 2017-08-07 in the time zone of UTC+8.
GPS.txt The GPS information for each location. The format of this file is organized as three columns ordered by location identifier and the corresponding latitude and longitude information.
Location: numerical administrative division code for each location;
Latitude: numerical values for the latitude of the corresponding location;
Longitude: numerical values for the longitude of the corresponding location.
The reliability of location and time information of users’ movements in the network data largely depends on the reliability of the underlying source data. We verify the consistency via the geographic-explicit distribution of locations. We visualize 400 locations on a geographic map, as shown in Fig. 1.
In the mobility network, nodes are defined as locations, and edges weighted by the mobility flows between nodes. We verify the consistency of the mobility network with people’s daily life with the hourly movement flows over seven days of the week, as shown in Fig. 2a. A movement denotes an individual movement, whose origin node is different from its destination node. For each hour, we count the number of movements between locations as the hourly movement flow. The hourly movement flows of all working days show two traffic peaks (morning and evening). The morning period is starting at 9:00, and the evening is beginning at 17:00. Both are approximately 4 hours long, similar to the reported mobility flows in the literature for another Chinese city of Shanghai with the morning period starting at 9 am, and the evening period starting 4 pm2. As for weekends, traffic peaks are slightly lower and especially weak in the afternoon. Figure 2b shows the trip durations for 24 hours. The y-axis denote the proportion of trip number over all across trip durations. We can observe that trips with less 12 hours account for over 99.7% of the total trips.
The degree of a node denotes the total number of hourly movements passing through the corresponding node during the 168 hours of the week. Figure 2c shows the degree distribution as compared to the degree distribution of another mobility network for another Chinese city (i.e., Shenzen)7. We can observe that the part of the log degree distribution for high degree values follows a Gamma distribution with a mean value of 10.9387. In contrast, the reported log degree distribution of the mobility network for Shenzhen7 shows a quite different Gamma distribution with a mean value of 5.5516.
Network structure analysis
Additionally, we analyze the community structure of the mobility network using the Louvain community detection algorithm13. In each day, the inter-urban mobility network often consists of communities-groups of metapopulations in locations who are highly intra-connected, but only loosely interconnected14,15. Figure 3 shows the community structures for each day with colors denoting different detected communities. To explore the interactions of inter-urban mobility, we consider the inter-urban community, which represents nodes in this community belong to different locations. We consider three community-based measures. Specifically, R is defined as the percentage of nodes in the community that indicate inter-urban movement. High R denotes the strong movement between locations, resulting in multiple inter-urban locations ending up in the same community. M denotes the mean size of nodes in a community. High M denotes the high average size of locations in a local affiliation. N represents the number of communities with more than 10 nodes. High N denotes the high variability in mobility with more local affiliations. We can observe Sunday is special with the highest R and the lowest N, bridging weekday and weekend inter-urban mobility patterns and connect otherwise disconnected inter-urban locations.
Matlab code for data analysis of location correction and mobility network construction can be obtained freely from Supplemental File 1 with no restrictions to access.
Liang, X., Zhao, J., Dong, L. & Xu, K. Unraveling the origin of exponential law in intra-urban human mobility. Scientific Reports 3, 2983 (2013).
Du, Z., Yang, B. & Liu, J. Understanding the spatial and temporal activity patterns of subway mobility flows. Preprint at, https://arxiv.org/abs/1702.02456 (2017).
Lee, M., Barbosa, H., Youn, H., Holme, P. & Ghoshal, G. Morphology of travel routes and the organization of cities. Nature Communications 8, 2229 (2017).
Gao, C. & Liu, J. Network-based modeling for characterizing human collective behaviors during extreme events. IEEE Transactions on Systems, Man, and Cybernetics: Systems 47, 171–183 (2017).
Bai, Y. et al. Optimizing sentinel surveillance in temporal network epidemiology. Scientific Reports 7, 4804 (2017).
Gao, C. & Liu, J. Modeling and restraining mobile virus propagation. IEEE Transactions on Mobile Computing 12, 529–541 (2013).
Yan, X.-Y., Zhao, C., Fan, Y., Di, Z. & Wang, W.-X. Universal predictability of mobility patterns in cities. Journal of The Royal Society Interface 11, 20140834 (2014).
Simini, F., González, M. C., Maritan, A. & Barabási, A.-L. A universal model for mobility and migration patterns. Nature 484, 96 (2012).
Gonzalez, M. C., Hidalgo, C. A. & Barabasi, A.-L. Understanding individual human mobility patterns. Nature 453, 779 (2008).
Lenczner, M. & Hoen, A. G. CRAWDAD dataset ilesansfil/wifidog (v. 2015-11-06). https://crawdad.org/ilesansfil/wifidog/20151106 (2015).
Madan, A., Cebrian, M., Moturu, S. & Farrahi, K. et al. Sensing the “health state” of a community. Pervasive Computing 11, 36–45 (2012).
Du, Z.-W. et al. Bai Inter-urban interactions of mobility via cellular position tracking in the southeast Songliao Basin, Northeast China. figshare, https://doi.org/10.6084/m9.figshare.c.4226183.v4 (2018).
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, P10008 (2008).
Girvan, M. & Newman, M. E. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America 99, 7821–7826 (2002).
Clauset, A., Newman, M. E. J. & Moore, C. Finding community structure in very large networks. Physical Review E 70, 066111 (2004).
National Bureau of Statistics of the People’s Republic of China Complete Administrative Division Codes of 2017. http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017 (2018).
This research is supported by the National Natural Science Foundation of China (NSFC) (Grant Nos 61772230, 61702215 and 61702138), Changchun Science and Technology Bureau (18DY005), CQ CSTC (No. cstc2018jcyjAX0274), National Science and Technology Support Program (2015BAK41B04), Key Projects of Science and Technology Development Plan of Jilin Province (Grant No. 20160204021GX), Provincial Special Projects for Industrial Innovation of Jilin Province (Grant No. 2017C032-1). Natural Science Foundation of Jiin Provincial Science and Technology Department (Grant No. 20180101332JC), and National Society Science Foundation of China (Grant No. 16BGL180). ZW would like to acknowledge funding from the Models of Infectious Disease Agent Study (MIDAS) program grant number U01 GM087719.
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
ISA-Tab metadata file
About this article
Cite this article
Du, Z., Yang, Y., Ertem, Z. et al. Inter-urban mobility via cellular position tracking in the southeast Songliao Basin, Northeast China. Sci Data 6, 71 (2019). https://doi.org/10.1038/s41597-019-0070-1