## Background & Summary

Since the First Industrial Revolution in Britain in the 1760s, manufacturing has been one of the major drivers of the world’s economic development. The center of gravity of the global manufacturing industry has shifted many times due to changes in the world’s economic layout1. In the early 1980s, labor-intensive, low-technology, and high-energy industries became less prevalent in the United States, The United Kingdom, Japan, and other developed countries, as well as the “Four Tigers” of East Asia (South Korea, Singapore, Malaysia, Taiwan) and other emerging industrial countries. China, which is the largest developing country in the world, has begun to play a key role in international cooperation in the division of labor and the global manufacturing value chain2. Over the next 40 years, the implementation of the reform and opening-up policy has promoted the rapid growth of the manufacturing industry, industrial transformation and upgrading, and the globalization of the economy3. The added value of China’s manufacturing sector reached 26.59 trillion yuan, and it has ranked first worldwide for 12 consecutive years, accounting for nearly 30% of the global manufacturing sector and generating more than 100 million jobs.

Regional manufacturing expansion is characterized by both temporal and spatial changes. As the quality and quantity of the manufacturing industry grow over time4, the manufacturing industry eventually becomes concentrated in a few regions5. This industry is the driving force behind migration, industrial upgrading, segregation, and many other social phenomena6,7. China’s coastal areas are densely populated and thus provide an ideal region for the agglomeration of the manufacturing industry; these regions have thus made major contributions to China’s economic growth8,9,10. The rapid increase in land, labor, and other costs, coupled with the implementation of China’s regional development strategies aimed at reducing regional differences, such as the development of the western region and the rise of the central region, the manufacturing industry, especially labor-intensive industries, have begun to move to central and western regions on a large scale11,12.

Few empirical studies of Chinese manufacturing firms have been conducted at the firm level, as detailed geographic data on various industrial firms are often not readily available (Table 1). Most research on the distribution of industry has been conducted at the macro provincial and municipal levels13,14; changes in the spatial pattern of the manufacturing industry often need to be analyzed using more micro-scale data. However, obtaining official micro-scale data is often difficult either because access to some data is prohibited by certain policies or because the data have never been collected and processed. Table 1 shows a comparison of existing data collection methods for studying industrial patterns. As these data collection methods often lack classification systems or accurate geographic data, they have various limitations, which results in poor quality data that are lacking in temporal and spatial resolution. The most accessible databases are from the Industrial Enterprises Database (IED)6,15; however, these data have only been taken at large scales and at a firm’s registered address, which can often differ from the address at which the operations of the firm are carried out. For example, records of small-scale branch factories are lacking6. More detailed quantitative data would greatly improve the robustness of industry-related research.

The distribution of classified manufacturing firms could be mapped by combining the strengths of the manufacturing classification of the IED with the strengths of map POI data. The IED is the most promising source of data for mapping the distribution of classified manufacturing firms, as it provides accurate firm names and manufacturing classifications (Table 1). However, these data are only sample data and lack latitude and longitude coordinates for the firms, and the statistical objects are large and include medium-sized manufacturing firms with an annual turnover of more than 20 million yuan in China, which means that the original IED does not include manufacturing firms with an annual turnover of less than 20 million yuan16,17. Map POI data have precise spatial information and have been widely used in various fields18. One of the greatest challenges in manufacturing research is the absence of a clear classification system of manufacturing firms, yet a robust manufacturing classification system is essential if map POI data are used to study the distribution of manufacturing firms. However, the manual processing of data is time and labor-consuming14,19,20,21,22,23. The machine learning method, coupled with small sample classification data, such as the “Firm name – Manufacturing classification” in the IED, can be used to identify manufacturing types in the map POI data and develop base maps for manufacturing research.

## Methods

The number of POI is nearly equal to the actual number of facilities in cities18. Therefore, large-scale manufacturing patterns can be studied at high spatial resolution using the POI classification. In this paper, seven types of Chinese gridded manufacturing datasets (GMDs) were generated, which used IED data as accurate training sample data to classify and identify map POIs through machine learning. Data on the distribution of seven categories of manufacturing in China were obtained. The technical process employed includes four steps:

• Building a learning sample library for manufacturing types by the IED.

• Collecting and pre-processing map POI data24.

• Classifying manufacturing based on machine learning.

• Drawing the high spatial resolution grid map of China’s manufacturing industry in 2015 and 2019 (Fig. 1).

After the data were produced, their quality was tested.

### Construction of the “name-manufacturing type” machine learning sample database using IED

The name-manufacturing type learning sample database was constructed using IED. According to the naming method of Chinese firms and Fa Li’s (2018) approach, there is a link between the name of a facility and the type of manufacturing industry to which it belongs15. Thus, a name-manufacturing sample database or keyword dataset can be constructed for machine learning to study the manufacturing classification. In the process of searching for relevant literature, we found that the IED can be used as the learning sample database. China’s IED, established by the National Bureau of Statistics, covers all state-owned industrial firms and non-state-owned secondary sector firms above a designated size, with manufacturing accounting for more than 90% of the statistics. The IED data sources are official, but they only sample large and medium-sized manufacturing firms with an annual turnover of more than 20 million yuan in China and lack geographic coordinates and branch factory addresses. Data from the IED for China in 2013 include some large firms (ca.  344,875 firms); these have been strictly classified and marked, and their industry names are in strict accordance with the national economic industry classification standards in GB/T 4754-2017. Therefore, machine learning and non-subjective supervised classification can achieve better results than manual labeling using the two-field name - manufacturing type classification in the IED. To reduce the redundancy of classification types, industries with similar types were merged into larger types following the method of Shen (2021)6,18. First, we manually marked 27,689 non-manufacturing samples (e.g., service industry, agriculture, and supply industry) as non-manufacturing samples in the classification sample. Next, the names of leading industry types in the development zones in various regions in the 2018 edition of “China Development Zone Review Announcement Catalogue”25 were summarized, and the names of different industry types with high frequency were extracted; industry types with similar names were merged into a larger category. The names of the manufacturing industry categories corresponding to the firm names in the IED were then summarized. Finally, the corresponding manufacturing industry was divided into seven categories by combining the development zone and the IED summary classification, which provided the learning sample database classification standard. Therefore, the “Name-Manufacturing Type” machine learning sample database contains 372,564 firm names and their corresponding classification. The study samples are provided at https://doi.org/10.6084/m9.figshare.1980840726.

### Collection and preprocessing of map POI data

Nearly every company, enterprise, firm, site, or facility in China (even those not registered with the Industry and Commerce Department) has a specific location on government-approved web maps. Our data were obtained from the Amap website (https://www.amap.com/) through web scraping. Because some factories or firms do not appear on the map after they have gone bankrupt or changed addresses, and remote branch factories are shown on the map, these data have higher precision than questionnaire and statistical data. We collected data using the services provided by Amap’s API and divided them into batches, which involves using web crawlers to collect the POI names and locations of firms across China from January 2nd to 20th, 2015, and from December 20th to 30th, 201927,28. To reduce the potential interference from the original data, preprocessing was carried out as follows:

• “Scenery,” “shopping,” “catering,” “road name,” and other types of built-in classification from the map unrelated to firms were filtered out, and only the firm and its corresponding latitude and longitude were retained to increase the classification speed.

• By identifying and matching word items, points with the same place name but different marks within 1 km around the POI points (e.g., East Entrance, West Gate) were deleted to solve the problem associated with counting the same POI.

The final map POI dataset contained firm names and spatial information (approximately 5.24 million in 2015 and 7.35 million in 2019).

### Map POI name-manufacturing type classification algorithm based on the Naive Bayes algorithm for machine learning

To enable the computer to learn and classify the manufacturing industry samples, Chinese word segmentation was performed on the classified samples and POI names. To transform text into a data structure that a computer can process, the text needs to be sliced into semantic units. In the first step of our machine learning, we used the jieba module (the Chinese word segmentation module in the Python) to segment the names of Chinese firms29,30. The Chinese Thesaurus was used to perform forward maximum matching for POI name field information, which was segmented into several words for keyword recognition, manual tagging, or machine learning training. Next, meaningless fields for information classification and possible special symbols were removed (such as punctuation, spaces, and bom characters), which will affect the classification results, and only Chinese and English characters were retained.

After word segmentation was complete, a machine learning classifier was built. The Naive Bayes algorithm was used to learn the name-classification samples of the IED after word segmentation. As proposed by F. Sebastiani, text classification can be understood as a function of acquisition, where S = {s1,s2, …,sn} indicates the document or string to be classified, and C = {c1,c2… cm} represents the set of categories in the predefined classification system. The goal of text classification is to find an evaluable mapping: f: S  C.

The classification space is an m-dimensional Euclidean space, and the value of each dimension is in [0,1]; this represents the probability of the input s in each dimension after f mapping, which can be determined by calculating the probability distribution of the category to which s belongs. The formal mathematical definition of text classification is shown in Eq. 1:

$${C}_{ij}=\left\{\begin{array}{l}1\;String\;{S}_{i}\;{\rm{belongs}}\;{\rm{to}}\;{\rm{category}}\;j\\ 0\;String\;{S}_{i}\;{\rm{does}}\;{\rm{not}}\;{\rm{belong}}\;{\rm{to}}\;{\rm{category}}\;j\end{array}\right.$$
(1)

The Naive Bayes classifier was used in this study. First, a probabilistic evaluation learning approach was used for each matched manufacturing classification in the IED name S:

According to the Bayesian Equation, the probability that s belongs to ci for any input s is:

$$p({c}_{i}| s)=\frac{p(s| {c}_{i})p({c}_{i})}{p(s)}$$
(2)

Equation (2) is the likelihood function of the Bayesian classifier. Maximizing it over ci gives the class to which the input s belongs:

$$\bar{c}=argma{x}_{j}{\rm{p}}({c}_{i}| {\rm{s}})$$
(3)

That is, $$\bar{c}$$ is the category that makes the conditional probability p(ci|s) take the maximum value among all categories C = {c1,c2… cm}.

In Eq. (2), p(s|ci) is the prior probability. To calculate the likelihood function, each prior probability needs to be calculated p(s|ci).

If we denote the s participle as s = {w1, w2,…, wk} and assume that the occurrence probability of each word in s is independent of each other, then:

$$p\left(s| {c}_{i}\right)=p\left({w}_{1}| {c}_{i}\right)p\left({w}_{2}| {c}_{i}\right)...p\left({w}_{k}| {c}_{i}\right)={\prod }_{i=1}^{k}p\left({w}_{i}| {c}_{i}\right)$$
(4)

p(w1|ci) represents the probability that the participle wi appears in the category cj, and it can be calculated by the following equation:

$$\widehat{P}({w}_{i}| c)=\frac{count({w}_{i}c)}{{\sum }_{w\in v}count(w,c)}$$
(5)

From Eqs. (25), the category to which s belongs can be determined.

We implemented the above word segmentation, learning, and classification process in the Python environment. Based on the text algorithm of the Naive Bayes classification, the IED in 2.1 was used as the machine learning training sample, and map POI data in 2.2 were used as the sample for classification in the above machine learning model; the trained model was applied to the name classification of manufacturing firms. The classification results of POI firm names were imported into the geographic information database combined with the original geographic coordinates of the POI. The model algorithm code is provided in the Figshare repository (https://doi.org/10.6084/m9.figshare.19808407)26.

### Patterns in China’s manufacturing industry in 2015 and 2019

Scale can substantially affect the results of analyses of spatial economic patterns. If an analysis is conducted at an excessively large-scale, small-scale patterns can be overlooked; by contrast, if the scale of the analysis is too small, general patterns are often not detectable18. To fit commonly used LandScan population data (https://landscan.ornl.gov/) and GDP grid data, we divided China into a grid of 0.01° latitude by 0.01° longitude. After classification, seven different types of manufacturing distribution data were projected onto the grid, and the number of points of the different types of manufacturing categories in each grid was counted. Greater numbers of points indicate greater numbers of industries. After data processing, there were 4.56 million (2015) and 6.19 million (2019) firm points.

For this dataset, we first constructed a list of cities with the full names and abbreviations of prefecture-level and above cities in China (including municipalities directly under the Central Government, prefecture-level cities, regions, leagues, autonomous prefectures, Hong Kong, Macao, and Taiwan). More information on the administrative divisions in China is provided in ref. 31. In the property table, each grid corresponds to a field, which indicates the province and city to which it belongs and the quantity of each of the seven types of manufacturing industries in each grid in 2015 and 2019. All coordinates were based on the WGS84 projection, and the grid was divided according to LandScan population data (0.01° latitude by 0.01° longitude).

We found that the distribution of grid values in 2015 and 2019 fit the probability density function well (Fig. 2), which is consistent with the power-law distribution observed for most of the socio-economic characteristics of the large-scale data32,33. The geographical changes in China’s manufacturing industry can also be observed in the map (Fig. 3a–c), and most economic activities were concentrated in the eastern coastal areas, especially in the Beijing-Tianjin-Hebei urban agglomeration (BTH), the Yangtze River Delta (YRD), the Pearl River Delta (PRD), and the Chengdu-Chongqing City Group (CC)11. Unlike the yearbook data, the changes in manufacturing firm types and their spatial distribution among these four urban agglomerations and inner cities can be clearly observed (Fig. 3 and Fig. 4).

## Data Records

Our data have been deposited in the Figshare repository (https://doi.org/10.6084/m9.figshare.19808407)26. The database contains an SHP file (data 2015–2019.shp) with field names as shown in Table 3. Each line of the file represents a grid cell record.

For ease of use, we also included county, city, and province information for each set of cell coordinates. A summary of the basic statistics of the manufacturing industry at the provincial level is shown in Table 4. The number of manufacturing firms is highest in Guangdong, Jiangsu, Shandong, Zhejiang, and Shanghai; thus, these regions have the most developed manufacturing industries.

## Technical Validation

As quantitative data of the manufacturing industry have not yet been classified, we determined whether the data and their classification were accurate to evaluate the reliability of the data. Technical verification of the data was carried out using three approaches: classification data accuracy verification, grid data verification, and social and economic data verification. Because the official manufacturing distribution data have not been published, and the above data do not represent the actual distribution of manufacturing firms, we provide the classification accuracy for reference after each verification.

### Classification data verification

The purpose of classification data verification is to verify the classification accuracy of the classifier. According to machine learning, the result can be displayed after computing the accessory model algorithm code. The precision index of the training samples was AF = 0.77, MC = 0.92, ME = 0.96, OM = 0.90, PM = 0.82, PP = 0.91, TC = 1.00, and WF = 0.96; the precision of the non-manufacturing class was 0.93. We manually checked 73,500 (≈1% of the total) firm names in 2019, and the total accuracy was 92%. The precision index for each class was AF = 0.86, MC = 0.94, ME = 0.93, OM = 1.00, PM = 0.98, PP = 0.86, TC = 0.97, and WF = 0.99; the precision index for the non-manufacturing class was 0.95.

### Verification with the published gridded data

The purpose of the validation with the published gridded establishment dataset (GED) was to determine whether the general patterns in the distribution of the manufacturing firms were accurate. To verify their accuracy, the GED, the only grid data measuring economic activities in mainland China, was obtained from ref. 6. We matched our grid data with secondary industry data from the 2015GED and then ran a simple regression to estimate the correlation between the two datasets. The R2 was 0.620, indicating that the two datasets were consistent (Fig. 5). Some deviation might be explained by the fact that the database comprises data on all the secondary industries in mainland China, whereas our data were on the manufacturing industries across China.

### Verification with social and economic data

The purpose of the validation with social and economic data was to determine (1) industry registration data validation, (2) whether the data describing macroeconomic industries in the yearbook were relevant to the GMD, and (3) whether the relative proportions of the different manufacturing firms were similar to samples from the IED.

Although the above data do not represent the actual distribution of different manufacturing industries, we provided the classification accuracy for reference after each verification.

1. 1)

Industry Registration Data Validation

To verify whether the number of our manufacturing firms is similar to the firm registration data, we aggregated the number of manufacturing firms registered on the Chinese mainland in 2019 from the Qichacha website (https://www.qcc.com/, screened until 2019.12.30), and we compared these data with our 2019 GMD industrial enterprise database. The results of the fitted model are shown in Fig. 6. In general, the total number of manufacturing firms in each city in our data was highly consistent with the number of registered firms in Qichacha (R2 of 0.935). The possible reasons for the error include 1) the actual address of the firm is not at the place of registration, and 2) some firms only have registration information but no actual factory address.

2. 2)

Yearbook data verification

We summarized the number of manufacturing firms at the city level and compared them with the social and economic indicators of secondary GDP and manufacturing employment at the city level. Secondary GDP was derived from the City Statistical Yearbook, and manufacturing employment was derived from the CEIC (China Entrepreneur Investment Club: China Economic Database). Some cities were excluded due to a lack of statistical data. In Figs. 7 and 8, we show the results of the two models: the linear regression of the total number of manufacturing industries with secondary GDP and manufacturing employment. Overall, our data perform well in estimating these socioeconomic variables, with R2 values exceeding 0.72 in all cases.

3. 3)

Manufacturing type classification verification

To verify whether the proportions of different manufacturing types in our classification results were consistent with those in the sampled data from the industrial firm database, we aggregated the number of firms into seven different types of manufacturing industries at the city level in 2015 and compared them with the IED. We present the results for the seven models in Fig. 9. In general, the proportions of manufacturing types in each city in our data were highly consistent with those sampled from the industrial firm database, with an R2 value ranging from 0.64 to 0.82. However, we found that a non-linear correlation might also appear (Fig. 8a,g). This might be explained by the fact that the number of firms sampled in the IED is low in some large cities; alternatively, the manufacturing firms in these cities might have broken into multiple branches since the IED data were collected6.

## Usage Notes

The GMD can be used in geographic information systems such as ArcGIS and QGIS. In GIS software, datasets can be imported as vector layers. To match with other geographic datasets (industrial park boundaries or water, air quality monitoring records34), users can apply spatial join capabilities in GIS software to link attributes from the GMD to other data based on spatial relationships35. Resampling methods can also be used if the resolution of the GMD is inconsistent with other data sources36.

We used 2015 and 2019, which are the two final years of China’s 12th Five-Year Plan and 13th Five-Year Plan18 (The data after 2020 are not representative because of the COVID-19 pandemic; consequently, research data until the end of 2019 were used), and datasets from these years provide highly representative data. If granular dynamic data need to be updated for one to two years in a specific region (such as at the province level) or in the future after 2019, our classifier can be used to process the data per these specific needs.

In addition, the names of the counties, cities, and provinces provided in this dataset are based on the administrative boundaries in 2019. The administrative divisions of the grid were determined according to the centroid of the grid. If the GMD is matched with other statistics by county, city, and province name, the effect of name changes at various scales needs to be considered. The grid cells of the GMD are the same size and location as those in the LandScan data. As the data are based on WGS84 coordinates, the spherical area of the grid cells varies among regions (the side length of the equatorial grid is approximately 1.1 km, and the side length of the Beijing grid is approximately 0.85 km).