Background & Summary

Human services and health1,2, disaster assessment3,4, global change5, infrastructure construction and urban planning6, human-environment coupling system7 and other applications rely heavily on population spatial data. The genuine population data originates from official census data, however there are several limitations in practical applications, such as difficult to achieve scale conversion, a long update time, and the inability to provide specifics about the population’s geographical distribution within administrative divisions8. It’s difficult to overlay census data with environmental data due to a lack of defined spatial references and consistent data units, which makes interdisciplinary study on human-environment systems limited9.

Early research used the population density model10,11,12,13 and different mathematical techniques of interpolation14,15,16,17,18 to mimic the population distribution inside a census data unit. The advancement of remote sensing and geographic information system (GIS) technology has opened up new possibilities for calculating spatial population distribution weights19. To obtain population data gridding and therefore increase accuracy and resolution20,21,22,23, several research included multi-source data and spatial variables such as land use and cover24,25,26, residential units27,28, transportation network29, night lights30,31,32. Many researchers are now combining GIS with computing technology to create intelligent models, such as random forest, genetic algorithms, multi-agent systems, and cellular automata33,34,35. This allows the model structure to be more flexible and the application scale to be more detailed.

Based on existing technological progress, widely-used data sets have been created internationally, such as the Gridded Population of the World (GPW)36, the Global Rural Urban Mapping Project (GRUMP)37, the Global Human Settlement Population Grid datasets (GHS-POP)38, the WorldPop39, and the LandScan40. Besides, the 1 km grid population dataset of China serves for China41. According to the data review, most of these datasets have long update periods, such as 5-year intervals42. Only a few datasets, including WorldPop and LandScan, provide continuous population data updated annually. And some years within the interval, such as 2018, lack widely available population datasets.

However, as a medium for refined population distribution, remote sensing-aided data are not a direct indication of population distribution and the intensity of human activity influence34, and refined population maps based on direct correlation of individual behaviors with refined global data are lacking. Second, current data are utilized to generate input population data, which is extrapolated from China’s 2010 county population census to target years using a county growth rate38,43. Every ten years, China conducts a population census. During this period, both the total population and the rate of growth change dramatically, therefore utilizing census data to forecast population in the middle years would result in substantial mistakes.

To remedy these gaps, we present POP201844, a gridded ambient population data set for mainland China in 2018 with 0.01° resolution. Large volumes of geospatial big data, such as mobile call data45 and traffic trajectory46, are utilized to estimate and simulate the geographical distribution of the population, attributable to the fast growth of mobile location-based services (LBS). Big data can help to improve social sensing and multiscale understanding of population distribution47,48,49. Some scholars have tried to use big data provided by Tencent, an internet company, as a social indicator in studies related to population distribution and mobility50,51,52. As illustrated in Fig. 1, we used the crawler to capture the real-time geo-location query number of user location given by Tencent’s location-based service (LBS) data and calculated the yearly average LBS data in 2018, which indicates that each grid population is a temporally averaged measure of population depending that POP2018 is the ambient population, according to Dobson et al.23. We utilized the National Bureau of Statistics of China’s 2018 Chinese mainland sample survey permanent population data, which is the most reliable demographic data in non-census years. The log linear spatially weighted regression model was used to establish the relationship between the two data, and the population number corresponding to the annual average LBS data in each grid was finally estimated.

Fig. 1
figure 1

The research and production framework of population spatial distribution map.

Methods

The population data

Residential population statistics in mainland China were obtained from the National Bureau of Statistics 2018 national sample survey permanent population data with 2851 county-level units, equivalent to the level 3 of the global administrative unit layer, whose sample size accounts for about 1‰ of the country’s total population. The number of permanent residents, the name of the province, the city, the county, and the county’s administrative number are all included in the statistics. The permanent population refers to those who have lived in the county for more than six months and reflects the population’s real distribution. Population sample survey results are the most reliable permanent population data available in the non-census year. We also gathered data from the Dongguan Bureau of Statistics on town-level permanent populations. County-level permanent population data were utilized for regression model creation, while town-level data were employed for accuracy testing of population data products, as recommended by Gaughan et al.53.

County administrative boundaries

The boundaries of China’s administrative divisions are downloaded from the national catalogue service for geographic information (www.webmap.cn). To create the 2018 county-based permanent population distribution map (Fig. 2a), assign sample survey population data to administrative divisions based on county names and administrative codes.

Fig. 2
figure 2

County-level permanent population (a) and Tencent positioning data (b).

User location big data

Location services provided by Tencent, an Internet company (https://cloud.tencent.com/solution/lbs) recorded the number of user location signals in grids with a spatial resolution of 0.01° and spatial reference GCS WGS84 every 5 minutes. Similar to Facebook and WhatsApp in the international market, Tencent is one of the most popular internet service provider in China, and its products (including WeChat, QQ, online maps, etc.) have over 1 billion users across 200 countries. More than 90% of the its’ users in 2018 are located in China49, covering people from all walks of life, different age groups and different regions. We use crawler technology to access Tencent’s positioning service in real time every 5 minutes, sum the positioning data for a day, and generate a spatial distribution map of daily positioning times, resulting in a total of more than 100 thousand maps of positioning data in 2018, including about 800 million online user’s data, with attributes such as time, longitude, latitude, and positioning times. We used the LZW-compression technique to save the map data in Geo-Tiff format for the analysis.

We used an arithmetic average to obtain average daily users location data from March to June and September to December in 2018, excluding the impact of Spring Festival transportation, students’ winter and summer vacations, holiday travel, when there is a large movement of people in China (Fig. 2b). Equation (1) is as follows:

$$Tencent=\frac{1}{n}{\sum }_{i=1}^{n}Count\_{d}_{i}$$
(1)

where Tencent is the average positioning count of Tencent big data in 2018, Count_di is the daily positioning counts of Tencent big data on day i, and n is the total number of non-holiday days from March to June and from September to December.

Construction of a grid-scale population spatialization model

The main statistical regression models we considered and compared include multiple linear regression54, polynomial regression55 and logarithmic linear model56, to fit the functional relationship between social perception data and census data57,58. The total number of Tencent user location big data in each county is calculated, which is then utilized for correlation analysis with the permanent population. The Pearson’s correlation coefficient between LBS big data and the permanent population is 0.82 (Fig. 3a). In the plot of linear fitting results (Fig. 3a), large scatters are concentrated in low values. After log-transformation of both the LBS count number and the permanent population, the correlation coefficient between them is 0.90 (Fig. 3b).

Fig. 3
figure 3

County-level statistical population and Tencent location number (a) and their logarithmic (b) kernel density plots.

Considering the spatial correlation of population density, we constructed a logarithmic geographically weighted regression (GWR) model. The R2 of GWR is 0.91 (p < 0.05), which is higher than that of OLS (R2 = 0.81), and the residual sum of squares (RSS) of GWR (RSS = 201.78) lowers 224.5 when compared to OLS (RSS = 426.28). The local variable parameter model can better capture the geographic heterogeneous relationship between population distribution and Tencent positioning data. Therefore, a regionally weighted regression with local variable parameters can more accurately portray the pattern of smooth population change in local locations.

The log linear GWR model is used to fit demographic and Tencent data from 2851 county-level units in China, Eq. (2), which expresses the connection between the total number of Tencent positioning times at county-level and the permanent population at the end of the year:

$$ln\;Count{y}_{i}={a}_{i}\times ln\;Tencen{t}_{i}+{b}_{i}+{\varepsilon }_{i}.$$
(2)

where Tencenti is the total number of daily positioning visits of Tencent big data in the ith county-level region. Countyi) is the permanent population at the end of the year 2018 in the i-th county. a is the superlinear impact of the number of residents at the end of the year on the total number of daily positioning visits of Tencent big data. b is scale ratio. εi is the residual and \({\varepsilon }_{i} \sim N(0,{\sigma }^{2}),\;Cov({\varepsilon }_{i},{\varepsilon }_{j})=0\;(i\ne j)\). We assume that the grid cells in each county have the same parameter. There are 1745 counties with the error between the estimated value and the actual value between −0.1 and 0.3, accounting for 61.2%, while only 12.3% of the counties in the central area have high residuals (residuals larger than 0.3 or less than −0.6) (Fig. 4a). The Local R2 is larger than 0.6 in 2674 counties, accounting for 93.8% of all counties, demonstrating that the GWR has an excellent local fitting impact.

Fig. 4
figure 4

Geographically weighted regression fit residuals in counties (a) and local R² (b).

Population mapping

We use the built GWR model to estimate grid value by substitute the Tencent positioning data with a resolution of 0.01° in the Eq. (2). The demographics are redistributed by county using the estimates for each grid as weights, as shown in the Eq. (3):

$$pop201{8}_{ij}=\frac{Count{y}_{i}}{{\sum }_{j=1}^{n}weigh{t}_{ij}}\times weigh{t}_{ij}$$
(3)

where pop2018ij represents the final population of the jth grid in the ith county. weightij represents the estimated value of the jth grid in the ith county from the GWR model. Countyi is the population statistics of the ith county. n is the total number of 0.01° grids in the ith county, thus finally obtains the fine scale population spatial data POP2018.

Accuracy assessment

We compared our result with the WorldPop59 and LandScan60 datasets, both of which have a resolution of 30 arc. The unconstrained gridded population data of WorldPop is used in this research. LandScan utilizes sub-national census counts given by the International Program Center, Bureau of Census, whereas WorldPop uses county totals based on China’s 2010 county population census data. Both WorldPop and POP2018 adopt the “top-down” population spatialization idea. Comparing with the widely recognized data from WorldPop can help us understand the difference of spatial distribution depicted by ambient and residential population. We collected the permanent population of 33 towns provided by the Dongguan Bureau of Statistics, which is the population who lived in each town for more than half a year. We use the town-level population as validation data to test the accuracy of POP2018. According to Ye34 et al. (2019), we compared mean square error and goodness of fit. We also selected cities with a population of less than 5 million (Huangshi), 5 to 10 million (Xi’an), and more than 10 million (Shanghai), and compared the details of the population distribution of the three data in these three cities.

Data Records

Table 1 shows the data involved in the article. The 0.01° grid population data of China mainland in 2018 can be accessed freely at the figshare repository44 (https://doi.org/10.6084/m9.figshare.20400717.v1). The data collection contains one.rar file, labelled China_POP_0.01deg_2018.rar. It contains two GEOTIFFs and a package of a polygon feature, which are the annually average Tencent LBS data in 2018, the 0.01° grid population data of China mainland in 2018 and the county-level boundary map joining with statistical population in 2018. All data were mapped using the Albers equal-area projection. The original LBS data files are saved as text JSON file. Due to the fine temporal resolution (5 minutes), the amount of the original dataset too huge to upload, which can be requested from corresponding authors.

Table 1 Categories of data used to fit the model and evaluate the accuracy of the new population density map.

The county-level population sample survey and the Tencent position big-data are used to create a high-resolution gridded population distribution dataset for China (2018). The grid value in Fig. 5 reflects the individuals who have been physically distributed in the grid for more than half a year, and the unit is person. The dataset with spatial reference GCS WGS84 given in GeoTiff format, closely portrays the geographical distribution pattern of people in China (2018). It demonstrates that the population distribution presents a clustered distribution pattern, forming multiple population hotspots (red dots). Larger hotspots are located in urban agglomerations with a high level of modernization and urbanization, such as the Yangtze River Delta and the Pearl River Delta, as well as large cities such as Beijing, Tianjin, Chengdu and Chongqing. The area of the hotspot can represent the population scale level, showing the hierarchical distribution of the population among towns. The North China Plain, the Sichuan Basin and the middle and lower reaches of the Yangtze River all have relatively dense small and medium-sized hot spots, showing a relatively dense urban spatial system in the plain area. In the suburbs or between cities, the population distribution is mainly distributed along the traffic lines and presents a network shape, which also reflects the actual situation that human activities on the traffic network are stronger than farmland in the outer suburbs. POP2018 not only reflects the dispersed population distribution caused by mountainous and hilly areas in southeastern China, but it also outlines the main population distribution areas in northwestern China, which are squeezed by a large area of plateau, deserts, and large mountains, such as the Gansu Corridor, the southern piedmont of Tianshan Mountain, and the oasis area around the Taklimakan Desert and the southern Tibet Valley.

Fig. 5
figure 5

0.01° resolution spatial population data for 2018 across mainland China (POP2018 dataset).

We select cities with a population size greater than 5 million to zoom in, more details can be observed from Fig. 6. The population is mostly concentrated in the city’s center region, indicating a development pattern of extending from the core to the periphery. For large cities in the central and western China with a large agricultural population, such as Chengdu, Chongqing, and Changsha, high-density central urban areas, there are more areas in the transition stage of population on the fringes. While some large cities in the eastern coastal areas, such as Beijing, Shanghai, etc., have a high-density population core area in stark contrast with the sparsely populated suburbs, which might bring problems such as congestion in the central city and high housing pressure. Cities in the northeastern China, such as Harbin, Shenyang and Dalian, the population is concentrated in the city center, and the connections with surrounding towns are less apparent. The population distribution map provides a more accurate basis for understanding the current situation of urban development and urban system planning.

Fig. 6
figure 6

Estimated population spatial distribution in cities with population of more than 5 million.

Technical Validation

POP2018 has the smallest error and the highest accuracy between the population allocated to the town and the statistical permanent population data (Fig. 7). The population estimation errors of landscan and worldpop are smaller in sparsely populated towns, while the errors increase as the population increases (Fig. 7b,c). It can be seen that the distribution of POP2018 to the urban center area with agglomeration shows the advantages of positioning big data as auxiliary data. The coefficient between the population allocated to each town by POP2018 and the actual permanent population of each town reaches 0.97, which is approximately equal, while WorldPop and LandScan have slopes of 1.14 and 0.79. Based on the slope, WorldPop underestimated the population of most towns, which would cause it to over-allocate the population to one or two towns located in the center of the city. Conversely, LandScan overestimated the population of most towns. Both POP2018 and WorldPop fit the actual permanent population well (\({{\rm{R}}}_{POP2018}^{2}=0.91\), \({{\rm{R}}}_{WorldPop}^{2}=0.89\)). The mean square error (MSE) of POP2018 is the smallest at 22.48, indicating the smallest deviation between the estimated value and the actual value. In the towns with larger population, the advantages of POP2018 are shown, and the errors are smaller than those of the other two data. The estimation errors for both POP2018 and worldpop for towns with a statistical population of less than 400,000 are small.

Fig. 7
figure 7

Scatter plot of POP2018 (a), LandScan (b) and WorldPop (c) and Dongguan township statistical population.

We compare the detailed characterization of cities with different population sizes in the three datasets (Fig. 8). The central locations of the clusters of high-population areas estimated by the three data are roughly similar, and different data are sensitive and consistent in identifying urban areas with high population density. These orange and red areas are smaller in size than the blue areas, but have an order of magnitude higher population, reflecting the large difference in population densities between urban and rural areas. The difference between the three data on the distribution of urban population is that WorldPop assigns the highest population to the central urban area. The number of red grids in Xi’an and Shanghai is significantly more than other data. In comparison, LandScan underestimates the urban population due to the fact that the number of grids with a value between 20,000 and 40,000 is considerably less than other data. POP2018 balances the performance of WorldPop and LandScan in the city center, assigning a moderate number of grids with unusually high population.

Fig. 8
figure 8

Population distribution of POP2018 (a), WorldPop (b) and LandScan (c) in the three cities of Huangshi, Xi’an and Shanghai.

For sparsely populated outer suburbs and rural areas, POP2018 and LandScan identify population settlements closer, while in terms of population buffers with population values in the middle, the distribution characteristics of POP2018 and WorldPop are similar. In rural areas, both POP2018 and LandScan grids with a population of 1,000 to 10,000 show scattered settlements (Fig. 8a,c), while WorldPop’s discrete spatial pattern is not obvious in comparison. Areas with a population of 1,000 to 10,000 around the central area of the city can be regarded as the suburban areas of the city, which are the transition areas from towns to villages. Both POP2018 and WorldPop show a circle structure that spreads outward, and the boundaries are similar (Fig. 8a,b). POP2018 combines the characteristics of WorldPop and LandScan in the display of population space, which not only has the characteristics of scattered distribution of rural settlements, but also is consistent with WorldPop’s high-density population distribution boundary in cities.

Usage Notes

This paper provides a population data and production method of ambient population, which is defined as the time average of population, taking into account activities such as human work, shopping, eating, and traveling61, which can better reflect characteristics of population distribution than residential-based population data. For example, central business districts have a higher concentration of human activity than residential neighborhoods, despite the former being less inhabited. The production and application of environmental population is the future development direction of population spatiotemporal distribution research.

The POP2018 can be applied to overlaying analysis with natural environment data such as land use, vegetation index, night light, and DEM, facilitating the study of interdisciplinary fields of nature and the humanities. Simultaneously, the problem of collinearity between the provided population and other spatial data can be also effectively avoided in this application, given that the weight of POP2018 is calculated based on Tencent’s user location data independent of other environmental data.

WorldPop and LandScan have high influence and reference value, who have produced global population grid data to fill the vacancy of population spatial information of countries or regions with missing statistical data. The comparison with WorldPop and LandScan shows that the data and methods provided in this paper are more accurate and precise in estimating the population distribution in China, especially at the scale below the county level, illustrating the advantages of local scholars and institutions in spatializing the local population distribution. These local scholars and research have a better understanding of their respective national conditions and can obtain more suitable methods and reliable input data, which effectively guarantees the quality of data products from the source and shorten production cycle of making a global population distribution map. We believe it makes sense to establish a data providing platform composed of local data produced by local research institutions.

The provided data fills in the gaps in fine-scale population distribution data between census years. Using sample survey statistics has a smaller error than the population calculation based on the growth rate, and is more in line with the actual situation in China, which makes it possible to update population data annually. Since there are spatial differences in the fitting effect of the GWR model and our data validation has not been completed nationwide, it is recommended that users estimate the variation in accuracy in different geographies when using the data. In population spatialization technology, the merging of demographic data and big data is being investigated. Unlike the usually employed indirect cofactors, big data is created directly by people, which more precisely represents the actual situation of population distribution and opens up new possibilities for fine-scale population spatialization48.