Background & Summary

Urban green spaces have been proven to provide social1,2, economic3,4, environmental5,6, physical7,8,9,10,11,12 and mental13,14,15,16,17 health benefits. Particularly during the epidemics, such as the COVID-19 pandemic, parks and green spaces have garnered increased attention for their essential and irreplaceable roles18,19,20,21,22. While park visitation constitutes only a small fraction of human daily mobility, there is a growing demand for temporal and spatial resolution park visitation data. This need stems from the call for green city planning, as well as the desire to gain detailed insights into the processes and patterns of green space visitations and its influencing factors in urban environment. Such insights are essential to a deeper exploration of park social impacts of accessibility, diversity, effectiveness, equity, and sustainability. As the average green space exposure in Global South cities (e.g., China, India, and the Middle East) is only 14.39%23, there is an urgent need to improve health and well-being through green space planning, which reinforces the need for real data to facilitate effective management and planning strategies that are highly beneficial in establishing healthier urban green spaces.

Earlier research, now considered outdated, grounded in geographic parameters, assumes that visitors come from nearby communities or a defined proximity range24. Whether defined through buffer, proximity, distance, or gravity models, these studies neglect more complex mobility behaviors25. Incorporating real human behaviors would provide a more convincing and comprehensive analysis on how people actually experience and benefit from parks. However, accessing park data is often indirect due to the longstanding principles of openness and accessibility that parks uphold, with most parkland remaining freely accessible. During the COVID-19 pandemic, the Chinese government introduced a unique active reporting mechanism: mandatory QR code scans at key locations to track entry and exit, including parks, which required users to log both the start and end of their visits. While such active reporting theoretically enabled precise visitation records, its discontinuation post-pandemic and strict privacy protocols now prevent public access to the data. The agent-based biased diffusion process simulates citizens’ movement26, but it still assumes citizens always head for the nearest Public Green Area (PGA) from their position, which is not entirely realistic. Recently, data-driven methodologies have gained significant traction. Especially, the availability of large-scale human movement and behavior data collected from mobile devices has made it possible to explore complex, real-world park visitation patterns. An increasing number of studies have begun to utilize mobile phone data to research factors influencing accessibility27,28,29, catchment areas30, attractiveness31, segregation32, and equity25,33,34,35 of parks. Additionally, some studies have also explored the issue of urban park accessibility through the lens of online review data36 and social media data37, offering insights into its fairness. Constructing park visitation networks from real fine-grained individual mobility data in urban settings has become significant for understanding patterns of park visitations.

While the use of mobile data for park research appears to encounter few obstacles, there are very few publicly available datasets derived from mobility data specifically related to parks. Previous studies using mobile datasets, whether publicly available or not, have been constrained by several key limitations, and none have fully overcome all of them. In other words, while these studies provide valuable insights, they are not perfect and still face one or more of the following issues: (1) small-scale user coverage, such as 22.3 million mobile devices38, 1.5 million cell phone IDs30, or 330,160 residents39; (2) limited temporal scope, with observation periods as brief as 24 hours38 or one month31, which are insufficient to capture seasonal or long-term patterns; (3) spatial bias, as they focus only on sampled parks30,31,38,40; and (4) attribute isolation, where a bipartite weighted framework has been established to explore park exposure and demand32, but it still lacks environmental built-in information and does not account for temporal granularity in network construction. Moreover, most of these visitation data sourced from mobile data has not been made publicly available. Though public dataset in Toronto, Canada35 provides insights into park accessibility fairness from the perspective of parks and includes socioeconomic attributes at the census tract level, it lacks connections between residential areas and parks, thus missing the perspective from residential areas and failing to complete the “demand-supply” loop. To date, no publicly available dataset comprehensively captures both park demand and attractiveness across the entire urban park visitation network at various time and property scales.

Considering privacy protection, releasing individual-level park visitation records should not be discussed. Therefore, our GreenMove dataset is constructed as a bipartite dynamic mobility network between residential polygons and parks (Fig. 1). It is based on mobile phone data collected from multiple telecom operators in Shanghai, encompassing 10 million anonymized users over the period from January to April 2014. All mobile data were anonymized prior to receipt, in strict accordance with data protection regulations. We identify individual stays via spatial clustering from raw mobile records and categorize them as home, work, or other activities. We then recognize the other locations those are geographically overlapped to parks to identify the park visitations. Shanghai has a total of 226 towns, which are the finest granularity of administrative division in the context of China’s national population census. However, the coarse granularity across this administrative regions renders it difficult to provide a comprehensive and consistent description of park visitation behaviors within such an extensive area. Therefore, we redefine a division approach for the Shanghai metropolitan area by using cell towers to partition it into Voronoi grids as residential polygons. Daily park visitation behaviors of residents whose home addresses fall within each Voronoi grid are then aggregated, quantifying park demand and attractiveness across the network. There are 38,055 residential polygons and 394 parks in Shanghai, 2014.

Fig. 1
figure 1

A schematic overview of the study. The upper row illustrates the process of identifying park visitations from mobile phone data, while the lower row shows the city subdivided into Voronoi grids, each enriched with various attribute indicators. These are then combined with dynamic weather conditions to construct a dynamic bipartite network between the park and polygon nodes, which is also subject to temporal variation as daily visitation patterns fluctuate.

The GreenMove we propose not only offers a network science perspective on residential polygons, parks, and their connectivity, but also integrates extensive socioeconomic data (e.g., housing prices, POI density), weather attributes (e.g., daily temperature, precipitation), and the presence of parks within a specific radius. Researchers can utilize these data to more accurately understand the factors and drivers influencing park visitation patterns, offering actionable insights for equity-focused studies. Furthermore, our network that extends to real population levels can assist policymakers in making informed decisions for future urban park layout planning in real-world scenario. This contributes to creating healthier urban green spaces, enhancing residents’ well-being and improving people-oriented urban quality, with the ultimate goal of promoting ecological and social welfare. On the other hand, the evolving patterns of urban parks over time mirror trends in urban growth, highlighting shifts in the structure and scale of cities. Therefore, the GreenMove can also bridge the gap in green space visitation patterns during urban morphological evolution and envision the future appearance of urban green spaces.

Methods

Geography, built environments, and socioeconomic are all drivers in explaining the availability of urban green spaces41. Therefore, our study integrates a substantial amount of data from various sources and the comprehensive workflow is shown in Fig. 1. Firstly, the mobile phone data are collected by the telecom operators when users were interacting with the cell towers through phone calls, text messages, and any internet data access activities. These anonymized records were systematically archived by the Shanghai Big Data Center, which are formatted as [anonymized user ID, longitude, latitude, timestamp]. With research access granted through its data platform, we simply use the geographical locations of cell towers and the occurrence time of users’ activities to pinpoint them. Although the spatial accuracy is at the cell tower level, the extensive user coverage and strong temporal continuity make cell tower localization a highly effective method of gaining insights into human mobility behaviors in urban areas. Such mobile phone data has been shown to delineate urban dynamics42 and reconstruct human activities43. Additionally, the derived locations of cell towers from this mobile phone data enable us to achieve a more granular segmentation of the city. The mobile phone data employed in this study spans a four-month period and covers 10 million anonymized users in Shanghai.

The income level serves as a crucial criterion for identifying disparities among social groups. To further delineate social groups within urban environments, we employed the 2020 housing price data sourced from Anjuke44, a prominent real estate service platform in China. This data enables the computation of the average housing price in residential polygons, thereby serving as an indicator of both economic vitality and the purchasing power of residents. The advantage of utilizing residential housing prices lies in their ability to be easily aggregated into any granular regions based on geographical locations.

In addition, abundant POIs are integrated, where POI kernel density estimation (KDE) refers to the spatial distribution of POIs within a city or region. This metric can assess various factors such as the quality of life, commercial activities, and the vibrancy of the area. The POI data utilized in our study is derived from Baidu Maps45. Given that weather factors such as temperature, rainfall well impact individuals’ travel decisions, we leverage publicly available daily weather data sourced from Open-Meteo46.

Furthermore, census data represents the most reliable and foundational source for understanding population distribution, conducted every ten years and made publicly available with geographic precision down to the town level in China. We utilized the 2020 Seventh National Census data of China to validate the sampling of mobile phone users, and scale them to the level of the real population.

The Shanghai Big Data Center provided hourly visit flow data of four parks in the first half of 2024, serving as direct validation for park visitation identification, assuming, of course, that the four parks did also exist in 2014.

Extract stays and location types

Mobile phone data contains a wealth of temporally-stamped human behavioral dynamics, reflecting users’ continuous trajectories as they travel between multiple locations over time. Therefore, we employ the steps outlined below to extract stays and location types47. The first step involves distinguishing between stays and pass-bys within continuous trajectories in both temporal and spatial dimensions, where pass-bys serve as transitions between stays. Stays, which are of primary interest, represent various user activities, including residence, work, shopping, and park visitations. In this study, we employ thresholds of 10 minutes (temporal) and 300 meters (spatial) to define stays: continuous records exceeding 10 minutes within a 300-meter spatial range will be identified as a stay point (Fig. 1). These thresholds are informed by the urban mobility model TimeGeo47, which processes mobile phone data that aligns closely with the mobile data employed in our research and has been widely applied in studies, including those focused on deploying public charging stations48, delineating urban dynamics42, exploring park exposure32, analyzing exposure to different street context types49, and planning for sustainable buildings50, etc. The spatial threshold of 300 meters is based on the resolution of mobile signal data, with an accuracy ranging from 200 to 300 meters. And the 10-minute duration threshold is chosen to distinguish between short stops, such as transient pass-by behaviors in transit. We also conducted an empirical comparison of alternative threshold values using a subsample of one million users, as shown in Fig. 2a,b. The most significant variation in stay identification was observed between the 5-minute and 10-minute thresholds, with the curves showing the greatest divergence (Fig. 2a). And the 5-minute threshold curve was more sensitive to changes in the spatial threshold. These support concerns that shorter time thresholds may capture transient behaviors rather than true stays. In contrast, the identification of stays was less sensitive to variations in spatial thresholds, as evidenced in Fig. 2b. However, the greatest sensitivity to spatial threshold changes was observed between 300 m and 350 m. These findings collectively reinforce the robustness of the selected thresholds for stay detection in our study.

Fig. 2
figure 2

Analysis of the process for identifying effective park visitation: (a) and (b) show sensitivity analyses of the spatial and temporal thresholds for defining stays. (c) Analysis of activities lasting longer than 12 hours. (d) Analysis of overnight activities. (e) Spatial representation of the relative change in visitors across different regions after highlighting stable visitation. (f) Diverse stable visitation patterns across various parks located in different regions.

It is important to note that the recorded locations are inherently offset, as they are derived from the positions of cell towers in the raw dataset, and that multiple nearby stay points may actually refer to the same location due to the varying trajectories in four months. Consequently, the second step involves clustering these disparate stay points into designated stay regions, with a maximum area threshold of 300 meters established for the stay regions. To achieve this, we use a grid-based clustering approach, which involves the following steps: partitioning the area into rectangular cells, mapping stay points to respective cells, and iteratively merging the cell with the highest number of stay points along with its unlabeled neighboring cells into a new stay region, as exemplified in Fig. 1. Following this process, each mobile phone user is assigned a unique home location, enabling us to designate the most frequently visited stay regions during weekday nights (from 19:00 to 8:00 the following morning) and weekends as home. After excluding home regions, workplaces emerge as the next target for identification. We define a region as a workplace if it satisfies the following criteria: occurring during weekday daytime hours (from 8:00 to 19:00), having the highest product of visitation frequency and distance from home, being visited at least three times, and being located no less than 500 meters from home. Users engaged in those work activities will also be assigned the “commuter” label. The remaining stay regions will be categorized as other, reflecting a wealth of park visitation behaviors.

Effective park visitation identification

In Shanghai, parks exhibit a characteristic of smaller area and higher density in the urban center, while larger and sparser parks are found in the outskirts. This pattern is consistent with the distribution of cell tower density. For small parks (minimum area 5068.65  ~ m2), their size may be insufficient to encompass the surrounding cell towers. Conversely, for larger parks in suburban areas, the sparse distribution of cell towers can also lead to overlooked park visitations. Expanding a 50-meter buffer around all parks would help address this issue, allowing to minimize the potential omission of park visitation behaviors.

When matching stay regions with park boundaries, consecutive stay regions are merged to determine the duration of park visitations, retaining only the entry and exit times. To eliminate pseudo-park visitations due to overlaps with commercial areas or roads, records of individuals who spent more than 12 hours in a park, which account for approximately 3.23% of the total 27,817,961 activity records, are filtered out, as recreational visitations typically do not exceed this duration. As shown in Fig. 2c, our analysis identified a distinct pattern in the start times of these long-duration activities, with bimodal peaks in the early morning and late evening. These activities, likely related to work or commuting rather than park visits, tend to end in the early morning, further suggesting they are departures from home. We also assume no overnight stays in parks and therefore exclude visitations spanning multiple days, accounting for approximately 3.68% of the activities already filtered by the 12-hour criterion. These visits typically ended in the pre-dawn hours or around 8:00 (Fig. 2d), suggesting irregular park usage. Their removal helps ensure our dataset focuses on typical leisure and recreational park visits.

Moreover, when constructing the network, it is meaningless to establish an edge between the residential polygon and the park for individuals who visit the park only once within the four-month period, as such behavior is considered incidental. Consequently, the analysis is narrowed to include only those who make at least 8 visitations to the park during the four-month period (criteria sensitivity analysis shown in Technical Validation section), resulting in a total of 14,540,972 stable park visitation records from 826,332 valid park visitors (referred to as visitors), reflecting an approximately 84.75% decrease. The percentage changes across different regions, as observed in Fig. 2e, all exceed 80% and show little deviation from the average, suggesting that the filtering process is generally consistent across regions. Some regions of Chongming Island experienced a nearly 100% reduction in visitors, which is likely due to the small residential population and its geographic isolation. This approach helps retain visits to parks with consistent, regular usage, focusing on the group of users whose park visits are less affected by seasonal fluctuations or other external factors. Such visitations are likely to remain relatively stable over time, unless significant urban transformations or policy changes occur. Parks from different regions also displayed diverse patterns (Fig. 2f), further reflecting the richness and diversity within our dataset.

Residential polygons partitioning and labeling

Voronoi residential polygons division based on cell towers

In the context of administrative divisions in China, the smallest publicly accessible unit for regional classification is typically at the town level. According to the data from the Seventh National Census conducted in 2020, Shanghai comprises 226 towns, each with a substantial coverage area, indicating a significant diversity in both the residents and regional characteristics within these areas. To enhance the granularity of residential area partitioning, we delineate Shanghai into Voronoi grids based on the spatial distribution of cell towers, which serve as residential polygons. It can be characterized as a partition with the finest resolution, resulting in a total of 38,055 residential polygons, which aligns well with the mobile phone data derived from interactions with cell towers.

POI in polygons

Utilizing fine-grained Voronoi residential polygons enhances the consistency of POI within them, allowing for a more accurate reflection of local socioeconomic characteristics. Data from Baidu Maps45 was employed to apply KDE for incorporating POI properties within the polygons and the POIs are categorized into 12 types here, which can be referred to in rows 2 to 13 in Table 1.

Table 1 Properties of polygon node.

Other socioeconomic indicators in polygons

Leveraging housing price data from Anjuke44, the average housing price of all communities within each polygon is calculated to reflect the income and consumption levels of residents within the area. In cases where there are no residential communities within the polygon, or if the existing communities lack price information, the price of the nearest community is employed as a proxy.

Another indication of land value is the distance to the Central Business District (CBD), which to some extent reflects the completeness and convenience of transportation. Areas closer to the CBD generally imply shorter commuting times and lower transportation costs. In the case of Shanghai, Lujiazui as the city’s CBD, is used to calculate the straight-line distance from each polygon.

Parks around polygons

Considering the fact that parks have their service radius38, and that urban services frequently compete with each other51, we incorporate the areas of and the distances to the five nearest parks to each polygon as its park-related properties. Additionally, to account for the influence of polygon size and the uneven distribution of surrounding parks, particularly in terms of distance, we also include the number of parks within 2 km, 2-5 km, and 5-10 km radius in the polygon properties.

Scale up to the real population

To ensure GreenMove reflects city-wide park visitation patterns rather than just sampled active mobile phone users, we expand the users with valid park visitations to match the level of the real population. Using town-level population data from China’s Seventh National Census, we implement a two-stage expansion: (1) At the polygon level, we first geolocate park visitors’ residences to corresponding polygons and determine the ratio of park visitors to mobile phone users within each polygon \((\frac{visitor{s}_{{\rm{polygon}}}}{user{s}_{{\rm{polygon}}}})\). (2) At the town level, we calculate the penetration of mobile phone users by comparing the users with the actual population derived from the census \((\frac{user{s}_{{\rm{town}}}}{Censu{s}_{{\rm{town}}}})\). The final expansion ratio for each polygon is derived as follows:

$$expansion\,ratio=\frac{user{s}_{{\rm{town}}}}{Censu{s}_{{\rm{town}}}}\ast \frac{visitor{s}_{{\rm{polygon}}}}{user{s}_{{\rm{polygon}}}}.$$

Finally, we divide the observed visitor count in each polygon by its corresponding expansion ratio to obtain population-adjusted metrics, ensuring our network reflects true demographic-scale mobility patterns.

Bipartite networks

We establish bipartite networks between residential polygons and parks, comprising a daily dynamic network Gd(VdEd), as well as a more granular daily time segmented network Gd,s(Vd,sEd,s) sustained over four months. On a given day d or at a given time (ds), if visitation does exist from residential polygons to parks, an edge is established between the two nodes.

Daily networks:

  • \({V}_{d}^{r}\) represent the active residential polygon nodes on day d, and \({V}_{d}^{p}\) represent the active park nodes on day d. Here,“active” refers to nodes that have edges on day d.

  • Ed(rp), the set of edges, consistently connects nodes that exhibit visitation behavior between residential polygons and parks on day d, encapsulating the total flow between these entities on day d as edge attributes. Additionally, it encompasses the proportion of commuters within the total flow, as well as the distance between nodes.

Daily time segmented network:

The periods of active park visitation are categorized into four time segments: morning (6:00-11:00), noon (11:00-13:00), afternoon (13:00-18:00), and evening (18:00-22:00).

  • \({V}_{d,s}^{r}\) represent the active residential polygon nodes during segment s on day d, and \({V}_{d,s}^{p}\) represent the valid park nodes during segment s on day d. Here, “active” refers to nodes with edges present during the segment s on day d.

  • Ed,s(rp) narrows the scope to different time segments throughout each day, still incorporating total flow, commuter ratio, and distance information, consistent with the edges Ed(rp) in daily network.

Tables 1, 2, and 3 provide a detailed explanation of the properties of polygon nodes, properties of park nodes, and properties of edges, respectively. In the context of bipartite networks, the edge weights consist of flow, flow ratio, commuter ratio, and distance, with their precise definitions provided in Table 3. The flow is derived by aggregating the visitation counts between polygons and parks, with the raw counts being appropriately scaled. The flow ratio can be derived in two distinct ways: the first, defined as flow_ratio_polygon, refers to the proportion of visitors to a specific park relative to the total outflow from the polygon’s population. The second, defined as flow_ratio_park, represents the share of visitations from a particular polygon within the overall inflow to the park. The commuter ratio is derived from the assignment of commuter labels, as outlined in the “Extract Stays and Location Type” section. With the assigned labels, we can directly quantify the extent to which visitations on edges are attributable to commuters, thus constructing the commuter ratio for each edge. It can be used to determine a park’s function based on the attributes of its visitors. The distance is defined as the geographic separation between a polygon and a park, calculated using the coordinates of the polygon’s base station and the park’s geographical location, applying the Haversine formula for precise distance measurement.

Table 2 Properties of park node.
Table 3 Properties of edge.

Figure 3a,b present an aggregated view of the comprehensive super-network over four-month period. In Fig. 3b, both types of nodes exhibit characteristics of a scale-free network, with polygon nodes displaying a power-law distribution. Figure 3c quantifies the temporal evolution of edge attributes within the daily network. The period from January 31 to February 6 corresponds to the Chinese Spring Festival, during which the net outflow of Shanghai residents, coupled with the tradition of family reunions, together contributed to a reduction in both the number of edges and the flow within the network. Notably, the number of edges and their associated flow tend to fluctuate in synchrony. The network’s average commuter ratio and the average distance from polygons to parks display significant weekly oscillations, with the vertical dashed lines in the figure indicating the start of each week (Monday). The average travel distance remains around two kilometers, reflecting a preference for nearby parks, however, this does not imply that individuals exclusively visit parks within this proximity. And as shown in Fig. 3e, we observed a slight increase in flow in March and April compared to January and February. Since we also integrated weather data, we are able to capture the impact of temperature and precipitation on park visits, revealing variations such as those illustrated in Fig. 3d, where flow on rainy days is generally lower than on dry days. After excluding rainy days, we also observe higher flow during warmer temperatures.

Fig. 3
figure 3

Statistical features of the GreenMove network: (a) Comprehensive representation of the super-network. The link represents the total park visitation flow during the four months, with the color gradient indicating the flow intensity. Lighter shades correspond to higher visitation volumes. (b) Node degree distribution in the super-network. (c) Statistical properties of the edges in the daily networks. (d) Variation in network flow due to weather conditions (precipitation and temperature). (e) Monthly variation in network properties.

Figure 4 presents the analysis of daily time segmented network, where monthly variations are comparatively stable (Fig. 4a), and shifts between time segments follow a more regular pattern (Fig. 4b). The diagram also differentiates between the network behaviors on weekdays and non-working days (referred to as “holidays” in the following context). Please note that the plots for the Mean Commuter Ratio and Mean Distance in Fig. 4 are not identical. Specifically, for weekdays in January, the approximate percentages for the mean commuter ratio in the four time periods are 22.195%, 25.115%, 25.962%, and 26.728%, while the corresponding percentages for the mean distance are 25.081%, 25.338%, 24.837%, and 27.744%. The differences are subtle but present. Additionally, we observe that the commuter ratio follows consistent trends across different times of the day (morning, noon, afternoon, and evening) over the four-month period in Fig. 4b. On weekdays, the commuter ratio typically peaks in the evening, reflecting the behavior of commuters who visit parks after work, constrained by their professional schedules. Conversely, holiday patterns display flattened temporal distributions, reflecting spontaneous leisure usage unbound by work schedules, thus disentangling temporally constrained functional park usage from leisure-driven visitation patterns.

Fig. 4
figure 4

Statistical features of daily time segmented network: (a) Daily change of four main statistical features during the four months on weekdays and holidays, respectively. (b) Monthly change of the values of the four features.

Data Records

Dataset has been uploaded to figshare52. The GreenMove generates two principal types of datasets: daily and intra-day segmented, both structured as bipartite networks derived from Voronoi residential polygons and parks in Shanghai. The datasets cover a period from January 1, 2014, to April 29, 2014. Notably, raw mobile phone data for a six-day period in February is missing, amounting to fewer than a few hundred mobile user records. Consequently, data for these days (February 3 and February 12-16, 2014) were omitted from the GreenMove. In total, the dataset encompasses a complete span of 113 days, constituting 113 independent networks for daily network and 452 independent networks for daily time segmented network. They facilitate the delineation of park visitation network dynamics with different granularity.

The complete structure of each network object and its corresponding weather object are serialized and preserved in one pickle (.pkl) format file, with weather information sourced from Open-Meteo46. Daily network is housed within the “daily_network_exp_8-4_geometry” folder, each network distinctly named according to its corresponding date. Daily time segmented network is located within the “daily_segment_network_exp_8-4_geometry” directory, which includes subdirectories labeled by date, each containing network files designated as morning, noon, afternoon, and evening. Additionally, the dataset used for GBM training and testing is stored separately in CSV format, under the file name “train_daily_pairflow_exp_8-4.csv”.

In the bipartite network, nodes fall into two categories. These two types of nodes are respectively labeled as “poly” and “park” for differentiation. The relevant information pertaining to the nodes and edges is affixed directly to the entities of the respective nodes or edges.

Due to restrictions on node activity within the network, the number of nodes is not uniform across all data files. Furthermore, we constructed a comprehensive super-network spanning four months, compiling all interactions from residential polygons to parks throughout this interval. This dataset is catalogued under the filename “4month_network_exp_8-4_geometry.pkl”.

Although multiple networks are constructed independently, the information content contained within the network objects remains uniform. Tables 1, 2, and 3 present a detailed delineation of the information included in each respective network and Table 4 provides an overview of the information attaching to weather object.

Table 4 Properties of weather object.

Technical Validation

Our validation process is carried out in a sequential and systematic manner: (1) The first step involves an examination of the home locations of the raw mobile phone users, comparing them with actual census demographics, and evaluating the feasibility of scaling up to the real population at the same time. (2) The second step assesses the robustness of the threshold for identifying stable park visitation mode. (3) The third step provides a more direct evaluation of park visitation by contrasting with real park visitation data. (4) The last step verifies the predictability of the pairwise flow between nodes.

Comparison between mobile phone data and census demographics

By aggregating the home locations of mobile phone users to the corresponding town as available in the census data, we evaluated the correlation between mobile phone users and actual census population. As illustrated in Fig. 5b, the analysis reveals a remarkable Pearson correlation coefficient of 0.98 between mobile phone user data and actual census data, thereby validating the reliability of mobile phone data in representing real population mobilities, as well as the identification of home locations and visitors expansion to real population.

Fig. 5
figure 5

Correlation analysis of Seventh National Census population density with mobile phone users density, (a) presents the distribution of population density at the town level for both mobile phone users and the actual census demographics (in per km2), (b) denotes the Pearson correlation coefficient between the two.

Sensitivity analysis of stable park visitation mode

Since single park visit over a four-month period are considered non-recurring, this step lies in identifying and verifying consistent visitation patterns to ensure network reliability. Given the significant disparities in visitation patterns between weekdays and holidays, we separately examine park visitation behaviors during weekdays and holidays to explore stable visitation patterns. The 24-hour park flow variation curves are presented in Fig. 6b, where the visit flow are aggregated according to entry times into the park, clustered into one hour increments. By progressively elevating the threshold for the frequency of park visitations over a four-month period, we observe changes in flow after each threshold selection. In Fig. 6b, regardless of the threshold, the visit flow pattern on weekdays typically exhibits a morning peak and a secondary evening peak, corresponding to morning exercise routines and the end of the workday for workers. A midday peak likely reflects leisure activities during office workers’ lunch breaks, aligning with common expectations of park visitation behavior. In contrast, park visitation times on holidays tend to be relatively delayed compared to weekdays.

Fig. 6
figure 6

Exploration and validation of stable park visitation patterns using the KS statistic, distinguishing between weekdays and holidays. (a) To evaluate the sensitivity of threshold selection, each point represents the KS statistic between data subsets filtered by two consecutive thresholds (e.g., 4 and 6). A low KS value indicates similar distributions, suggesting the threshold is robust for filtering; (b) illustrates the visit flow under each threshold.

Subsequently, the Kolmogorov-Smirnov (KS) statistic is employed to quantify the maximum deviation between the flow distributions of two samples filtered through adjacent thresholds. This metric is derived by assessing the empirical cumulative distribution function (ECDF) of the flow samples. The calculation of the KS statistic is as follows:

$$KS={\max }_{x}\,| {F}_{X}(x)-{F}_{Y}(x)| ,$$

here, FX(x) and FY(x) denote the ECDF for selected flows under different thresholds, where each function signifies the proportion of observations in sample X and Y that do not exceed x. As the threshold increases, the KS statistic tends to decrease. Starting from a visit frequency of 8, the KS statistics between adjacent thresholds are lower than those observed previously, and the rate of decline in subsequent values also decelerates (Fig. 6a). Overall, selecting the visit frequency of 8 as the screening threshold emerges as the most reasonable choice.

Validation of real visit flow in specific parks

To validate our park visit identification, supported by the Shanghai Big Data Center, we obtain the real flow data of eight parks: Daning Park, Peace Park, Guyi Garden, Century Park, Ganquan Park, Yu Garden, Penglai Park, and North Sichuan Road Park. The eight parks encompass a wide variety, including both small urban parks and larger suburban parks, located in various districts of Shanghai such as Jiading, Hongkou, Jing’an, Pudong, Putuo, and Huangpu. They range in size from 1.5 hectares to 150 hectares and include both community parks and large recreational parks. The variation in location, size, and purpose makes these parks highly representative, further supporting the reliability of our identification results.

This dataset, sourced from the first half of 2024, tracks the number of visitors within the parks on an hourly basis. By aligning our flow data of these four parks in four months to 24 hours, we found that in Fig. 7, as expected, the fit with the real flow data was relatively close, with Pearson correlation coefficients mostly exceeding 0.9, confirming the robustness of our dataset. This finding underscores the stability of structural patterns in urban park visitation, which tend to remain relatively consistent over time unless influenced by significant urban transformations or policy shifts. While our dataset does not aim to capture the precise dynamics of current visitation behavior, it remains highly valuable in revealing long-term patterns in urban park usage.

Fig. 7
figure 7

Validation results comparing real flow data from Daning Park (A), Peace Park (B), Guyi Garden (C), Century Park (D), Ganquan Park (E), Yu Garden (F), Penglai Park (G), and North Sichuan Road Park (H) are analyzed through time-series flow curve display and Pearson’s correlation coefficient: (a) denotes weekdays, while (b) indicates holidays.

Furthermore, the observed discrepancy in weekday visitation patterns at noon in Century Park (D) can be attributed to the rapid urban development surrounding the park between 2014 and 2024. The expansion of residential and commercial zones around Century Park results in a significant rise in morning exercise visitors, while midday visitation appears relatively low. This highlights the importance of historical benchmarks, such as our 2014 dataset, in understanding and interpreting the evolution of human-park interaction in the context of ongoing urbanization.

Predictability of the pairwise flow in network

To explain the predictability of the flow in Ed(rp), that is, the pairwise flow between polygons and parks, a tree-based model, Gradient Boosting Machine (GBM), was employed to conduct predictive modeling and generate forecasts. The effectiveness of the out-of-sample forecasting capability was subsequently evaluated using three key metrics: the coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE) (as illustrated in Fig. 8a). The 0.79 R2 value on test set, achieved with a relatively simple machine learning model, provides compelling evidence of the pairflow’s predictability. The remaining 21% of the variance unexplained by the model could be attributed to factors such as individual preferences, mood fluctuations, air quality, transportation conditions, and park-specific amenities. However, personal preferences and intentions, due to their inherently subjective nature and the difficulties in quantifying them, lie beyond the model’s ability to capture.

Fig. 8
figure 8

Validation of the predictability of our data: (a) The model performance presented effectively reflects the predictability, while (b) provides an analysis of feature contributions and underlying mechanisms using SHAP and PDP in this forecasting task. Specifically, PDPs are presented for several features, including the total population within the polygons, the distance between residential polygons and the CBD, and the area of the parks.

Furthermore, model transparency and feature interpretability were enhanced using SHapley Additive exPlanations (SHAP)53, an explainability tool based on Shapley values from game theory, which quantifies the contribution of each feature to individual predictions (Fig. 8b). The features are ranked in descending order of their impact on the prediction outcome. We then conducted analysis using Partial Dependence Plots (PDP) from the scikit-learn library54 to examine the marginal effects of several key features (Fig. 8b). The population within the polygon exerts a positive influence on pairflow. Similarly, the area of parks also exhibit positive associations with it. In contrast, the distance between the residential polygon and the CBD demonstrates a sharp negative impact on pairflow beyond 20 kilometers, which aligns with the significant decrease in residential population density inside and outside the Shanghai Outer Ring Road.

Usage Notes

The GreenMove format is easily accessible: It can be efficiently loaded and accessed from its pickle file format via the pickle module in Python. We also provides an example for network loading on GitHub (https://github.com/yuki-Feng0307/GreenMove). For researchers employing the GreenMove on a daily basis, take note of the distinction between weekdays and holidays, as China has a unique make-up day policy, so not all weekends are non-working days.

Despite our efforts to comprehensively incorporate socioeconomic attributes within networks, some omissions remain. Notably, we are unable to acquire detailed data regarding the internal amenities and services of parks. And our dataset only covers visitation data from January to April, meaning that behaviors associated with summer, such as a preference for evening park visits to cool off or stroll, are not captured. We have made efforts to compensate for this by focusing on stable visitation patterns, attempt to isolate the demand for park usage that is not influenced by seasonal variations. Additionally, mobile phone usage exhibits variation across geographic regions, genders, and age groups55, a characteristic inherent to mobile phone data that may also introduce supplementary biases. For instance, behaviors such as elderly individuals not carrying phones to parks may lead to an underestimation of park visitation. The limited precision of cell tower signals in suburban areas may also introduce biases. Although parks in suburban Shanghai are generally large, and we have used a buffer zone to mitigate the impact of sparse tower distribution that may cause stays to inaccurately fall outside park boundaries, other biases still persist, such as misclassifying short-term pass-bys as stays due to the extended connection time caused by the wide coverage of sparse towers, or missing stay events because of weak signals or obstructions from surrounding buildings in suburban areas. However, the overall impact of these biases mentioned above on dataset quality is minimal. This is evidenced by validation against real park visit flow data, which demonstrates that despite these challenges, the mobile data remains sufficiently reliable for identifying park visits.

Furthermore, Shanghai’s commitment to expanding its city park initiative is evident, with a substantial increase in the number of parks. By the end of 2020, the total number of parks in Shanghai reached 406, and a significant surge in park development occurred in 2023, culminating in 162 new urban and rural parks. This expansion has increased the total to 832 parks, with 60% of urban parks now offering 24-hour access, which enhances the feasibility of activities such as morning and evening runs or walks, yet simultaneously raises concerns regarding nighttime safety and light pollution. The temporally dynamic evolution of urban parks is reflective of broader urban development trends, revealing changes in city scale and structure. Our dataset identifies the year 2014 as a critical juncture, providing a baseline of pre-rapid-expansion park usage patterns and valuable insights for urban studies at both temporal and spatial scales. As the current dataset only includes data from a single year, it restricts applicability in analyzing temporal dynamics or changes in visitation patterns over time. Ideally, when integrated with other datasets to form a multi-year dataset for longitudinal research, these combined benchmarks will reveal how the tempo of urbanization reshapes human-park interaction paradigms over decades.