Background & Summary

Since the 1990s, the global chemical industry has undergone extensive restructuring and relocation1. Since 2011, China has emerged as the world’s largest chemical industry market, with sales reaching around $2.04 trillion in 2021, contributing to half of the global chemical market growth over the past two decades2. Simultaneously, various factors have prompted chemical enterprises to cluster together3,4, forming Chemical Industry Parks (CIPs) with well-established infrastructure, strict management systems, and clear geographical boundaries5. However, CIPs are also high-risk areas, often characterized by large-scale operations and high-density storage of chemicals, making them prone to hazards and major accidents6,7. These areas can trigger catastrophic domino effects in the event of fires, explosions, or chemical leaks8,9,10,11, resulting in significant property damage, casualties, environmental pollution, and ethical concerns12,13. For instance, in November 2005, an accidental explosion at the Jilin Petrochemical Group in China led to nitrobenzene contamination in the Songhua River, causing environmental pollution throughout the basin and triggering an international dispute between China and Russia14. In March 2019, a major explosion occurred in the Ecological CIP in Xiangshui County, Jiangsu Province, resulting in 78 deaths, 76 severe injuries, 640 hospitalizations, and direct economic losses of 1.986 billion yuan15. Therefore, public awareness and concern about CIPs are extremely sensitive.

Furthermore, factors such as water resource availability, convenient transportation conditions, and favourable environmental emissions have led chemical enterprises to choose to establish factories along rivers and coastlines16. As China’s inland waterway transportation artery, the Yangtze River offers lower shipping costs and facilitates the convenient transport of raw materials and products, making its vicinity a hub for chemical industries worldwide. Statistics indicate that there are over 400,000 chemical enterprises in the entire Yangtze River Basin, with the chemical output accounting for approximately 46% of China’s total17. These enterprises are primarily concentrated in regions such as Jiangsu, Zhejiang, Shanghai, Hubei, Sichuan, and Chongqing, where the high population density and numerous sources of risk contribute to a challenging environmental situation. In recent years, as China has intensified its efforts to promote the green development, the protection of the Yangtze River has reached unprecedented levels. Governments at all levels along the Yangtze River are actively transitioning CIPs, while the protection of the shoreline are increasingly prioritized18. The promulgation of the Yangtze River Protection Law of the People’s Republic of China in 2020 aims to strengthen the ecological environment protection and restoration in the Yangtze River Basin. Effective from March 1, 2021, the law explicitly prohibits the construction or expansion of CIPs within a 1-km range along the Yangtze River (https://www.mee.gov.cn/ywgz/fgbz/fl/202012/t20201227_814985.shtml).

However, in reviewing the released data on the distribution of CIPs in China, we are only able to obtain official statistical information, lacking spatial distribution data that could provide more detailed insights. Additionally, conducting large-scale surveys in the region is often costly and time-consuming, making it difficult to accurately capture the distribution characteristics of CIPs in a timely manner. Recognizing the importance of the CIPs distribution along the Yangtze River for the protection efforts, we aim to provide a publicly available dataset detailing CIPs map along the river19. This dataset serves as a scientific basis for future refined governance of chemical industries along the Yangtze River. Moreover, it offers guidance for addressing the issue of chemical industries polluting rivers, not only along the Yangtze River but also for addressing similar challenges along other global rivers.

At present, the use of Deep Learning (DL) for classification of land features in high-resolution remote sensing imagery has become increasingly popular20,21. However, the training process often requires a large number of labeled samples; otherwise, the model may easily overfit on limited training samples and perform poorly when predicting new, unknown datasets22. Currently, there is a lack of sample datasets covering general features of CIPs. Most studies rely on labeled samples of storage tanks for training, which cannot be applied to the entire park area in practice23. Furthermore, research tends to focus on smaller areas such as urban centers, limiting the functionality of large-scale, timely, and accurate mapping. In contrast, Machine Learning (ML) applications in remote sensing are more mature, with increasingly clear classification of fine details of land features24,25. In this context, employing ML method RF based on GEE and utilizing Sentinel-2 imagery, offers a viable solution to classify CIPs. By embedding training samples region by region and combining active learning through iterations, it is possible to predict CIPs within the study area. This approach enables timely and accurate completion of remote sensing image recognition tasks for CIPs on a large scale26.

The CIPs map19 provided by this study, covering a 5-km range along the Yangtze River with a resolution of 10 m in 2021, fills a gap in the comprehensive detection of CIPs along the Yangtze River. By openly sharing this dataset, it can reveal the spatial patterns, density ranges, and proximity to urban areas, water bodies, and environmentally sensitive areas of CIPs along the river. This dataset will support the formulation and implementation of relevant policies for the protection of the Yangtze River.

Methods

Study area

The Yangtze River (Fig. 1), the world’s third-longest river spanning a length of 6363 km, boasts an annual average runoff of approximately 9600 billion m3. Serving as one of the most densely populated and industrialized regions globally, the Yangtze River Basin has shaped a world-class economic belt27. Encompassing 11 provinces and municipalities, the basin covers an area of around 2.0523 million km2, accounting for 21.4% of China’s total land area. In 2022, the region achieved a total GDP of 55.98 trillion yuan, contributing 46.5% to China, with a population of 609 million, representing 43.1% of the total China. Compared to other regions in China, the level of urbanization in the Yangtze River Economic Belt has also shown a rapid increase, with the urban population proportion rising from 49.25% in 2010 to 62.96% in 2021.

Fig. 1
figure 1

(a) The study area is the core region of the Yangtze River Basin. (b) The upper reaches of the Yangtze River mainstream begin in Yibin City, Sichuan Province (104.62°E, 28.77°N), with the boundary between the upper and middle reaches at Yichang City, Hubei Province (110.17°E, 30.75°N), and the boundary between the middle and lower reaches at Hukou County, Jiangxi Province (116.20°E, 29.73°N). The lengths of the upper, middle, and lower reaches are 1040 km, 955 km, and 938 km, respectively.

The Yangtze River plays a crucial role as a lifeline for human existence, a catalyst for economic growth, and a protector of China’s natural heritage. Therefore, this study focuses on the core area within a 5-kilometer range along the main stream of the Yangtze River, identifying the locations of chemical industrial parks.

Main input data

Sentinel-2 imagery data

Sentinel-2 provides global multispectral remote sensing data with a resolution of 10 m, displaying advantages in high resolution and mixed pixels compared to NASA’s Landsat data, making it a better choice for precise identification of CIPs28. When extracting Sentinel-2 remote sensing imagery data along the Yangtze River using Google Earth Engine (GEE), the autumn season window from October to December is selected to avoid interference from more clouds, fog, and vegetation during climate characteristic of the Yangtze River.

Feature extraction is crucial for classification based on remote sensing imagery. A series of spectral and texture features (Table S1) are computed from Sentinel-2’s multispectral data to construct a multidimensional feature space and enhance inter-class differences. Specifically, spectral features such as Normalized Difference Vegetation Index (NDVI), Soil Adjusted Vegetation Index (SAVI), Modified Normalized Difference Water Index (MNDWI), and Normalized Difference Built-up Index (NDBI) are utilized. NDVI enhances the differentiation between vegetation and CIPs, while SAVI further corrects NDVI values influenced by soil brightness in areas with low vegetation cover. MNDWI, a widely used water index, enhances the separability between CIPs and water bodies. NDBI indicates CIPs have high-density hard surfaces and higher indices, while normal residential built-up areas have fewer buildings and more green spaces.

Moreover, the texture features of CIPs’ factory facilities in remote sensing imagery are significantly different from other artificial land covers. In this study, texture features are computed using the popular Gray-Level Co-occurrence Matrix (GLCM), including mean, variance, Contrast, dissimilarity, entropy, correlation, inverse difference moment, and angular second moment. Generally, larger window sizes provide coarser texture information, and the contribution of texture features to classification accuracy depends on both texture scale and object scale. Considering the characteristic scale of storage, manufacturing, and transportation equipment within CIPs, a GLCM window size of 3 × 3, corresponding to an actual object area of 900 m2, is selected. After extracting spectral and texture features, a multidimensional attribute feature space is constructed through concatenation, and classification is performed on this feature space based on RF and active learning strategies.

Chemical industrial park POI data

Open social data has been widely employed in remote sensing research, with POI, representing geospatial and classification information, playing a crucial role29,30. POI is typically sourced from online maps, geographic information platforms, and other sources, serving as coordinate points in high-resolution electronic maps linked to human activity information31. In this study, we utilized POI data related to “chemical industry” along the river obtained from the Baidu Big Data platform (https://map.baidu.com/) as the basis for identifying CIPs.

Chemical industrial parks in remote sensing imagery are often large, complex targets containing various features with strong semantic information, including chemical storage tanks, production plants, and water treatment facilities. The scale difference between large and small chemical industrial facilities is also apparent, with dense storage tanks being a typical feature in remote sensing imagery. It is crucial to initially use chemical industrial park POI for manual annotation to delineate the boundaries of these parks. Subsequently, combining this information with a RF classifier for partitioned recognition enables the rapid detection of CIPs on a large scale.

China administrative boundary data

The boundaries of China’s administrative divisions can be downloaded from the National Geographical Information Catalog Service (www.webmap.cn). This data will be used to conduct statistical analysis of the distribution of CIPs within city and county units along the Yangtze River. Along the main stream of the Yangtze River, there are a total of 27 cities and 142 county-level units. Table 1 shows the main information and sources of all data used in this study.

Table 1 Categories of data used for classifying and analysing CIPs along the Yangtze River.

Modelling

Random forest classifier

The main reason for choosing the RF classifier to identify CIPs from a multidimensional feature space is its effectiveness in modelling nonlinear features32, and RF has been widely applied in remote sensing fields such as vegetation mapping, water body extraction, and urban facility recognition33,34,35.

RF can be considered as an ensemble classifier consisting of a large number of decision trees, where the classification result of each decision tree is determined by the vote of all decision trees36. Additionally, there are two random sampling processes in RF. The first one is bootstrapping, which is an inherent step in the RF method. Specifically, all initial training samples are resampled using bootstrapping to train each base classifier of the RF. The other process is the random selection of features used to train each decision tree. Therefore, the above random processes make RF robust to noise and outliers. Compared to other classifiers like support vector machines, the parameterization of RF is much simpler. Only a few parameters need to be adjusted, including the number of decision trees, the maximum number of splits for each tree, and the number of features used to train each tree. The optimal values for the three parameters mentioned above in this study were determined through grid search as follows: 100, 10, and 5, respectively.

Training details

We used the GEE API to build the RF classifier. Given that the study area covers the entire Yangtze River basin, the CIPs in different sections of the Yangtze River exhibit visual differences due to surrounding environmental and terrain factors. In particular, the upper reaches of the Yangtze are mountainous with steep and rugged terrain and dense vegetation, whereas the downstream areas are predominantly plains with surrounding farmlands. This leads to inconsistent intra-class differences, making it impossible to train a high-precision RF classifier to identify the entire Yangtze River basin. By modelling in regional divisions, we aimed to reduce intra-class differences and obtain more accurate local predictions, thereby improving overall classification performance. Considering the importance of spatial generalization of the model, we chose the administrative units of 27 cities along the Yangtze River as the modelling units for regional division.

In terms of sample selection, we first obtained POI of chemical industrial along the Yangtze River in 2021 through web crawling, and then conducted visual annotation to strictly select training samples. To ensure the correctness and representativeness of the collected CIPs samples, we iteratively checked the remote sensing images. Another important aspect was the selection of non-CIPs samples, as it would help reduce false positives in the classification (i.e., predicting non-CIPs as CIPs). Considering that non-CIPs samples consist of multiple land cover types, we collected these negative samples from more refined categories such as forests, grasslands, farmlands, water bodies, bare lands, as well as specific categories like steel plants, cement plants, manufacturing plants, and conventional residential areas, which are impervious surfaces. After completing the labelling process, we gathered 3000 samples of CIPs and an equivalent number of samples from non-CIPs areas.

Active learning

During the mapping process, it is challenging to generate an accurately classified map in a single attempt. In such cases, we aim to further improve the CIPs classification performance through an iterative process based on active learning, using a coarse-to-fine strategy. In fact, active learning is a preferable choice for remote sensing classification when labelled samples are limited37. Firstly, active learning requires training the classifier with initial labelled data, then using the trained classifier to predict the partition dataset. The incorrectly predicted and unlabelled data are then returned to experts for further annotation. Finally, the classifier is retrained using both the initial and newly labelled data to avoid potential prediction errors. This process runs iteratively until satisfactory classification results are obtained.

In this study, we combined the active learning strategy with the RF classifier to refine the CIPs classification results. Careful selection of misclassified pixels and subsequent re-labelling were performed to retrain the RF model. We focused on two types of errors: missed CIPs pixels and false positive prediction errors, to increase the robustness of CIPs against these typical errors. The above process iterates continuously until the accuracy of CIPs classification is met. By incorporating active learning, classification accuracy can be improved with only a small additional annotation cost, making it a good choice for large-scale high-precision CIPs mapping applications in this study. Figure 2 provides an overview of our workflow.

Fig. 2
figure 2

Overall workflow of this study. Based on the GEE and Google Earth, a comprehensive approach was employed, including POI data annotation, spectral index extraction, random forest classification, active learning iteration, etc., for partition prediction of CIPs. The overall accuracy of the model results was validated, and regional statistical analysis of the distribution and area of CIPs along the Yangtze River was conducted using ArcGIS.

Data Records

This study provides a 10-m resolution raster map of CIPs along the Yangtze River within a 5-km range for the year 2021. It is the first publicly released dataset of high-resolution chemical industrial parks data in the Yangtze River region. The data format is GeoTIFF, and the spatial reference system is WGS-84. The map contains two values, where 1 represents CIPs and 0 represents non-CIPs area. The CIPs map data can be loaded into GIS software such as ArcGIS and QGIS for data visualization and spatial analysis. This dataset will be freely available to all users on the figshare repository19 (https://figshare.com/articles/dataset/The_Yangtse_River_CIPs_10m_2021_rar/25566132).

In addition, we have zoomed in on nine typical CIPs along the Yangtze River for detailed analysis, as shown in Fig. 3. It can be observed that the distribution of CIPs is relatively concentrated, often formed by the aggregation of multiple small-scale chemical factories. Most of them exhibit characteristics of extending along the Yangtze River. This dataset provides a more accurate basis for understanding the specific locations and planning governance of chemical industries along the Yangtze River.

Fig. 3
figure 3

CIPs along the Yangtze River Map. The 1-km red boundary lines delineate the boundaries where construction of chemical plants is prohibited in the feature. The 5-km range showcases the distribution features of the CIPs dataset.

Technical Validation

This section describes the method for technical and accuracy validation of the CIPs map. Firstly, we meticulously constructed a comprehensive test dataset covering the entire Yangtze River to evaluate the accuracy of the mapping results. Subsequently, we will showcase several detailed CIPs classification maps to qualitatively assess the accuracy of the provided dataset. Through these steps, our aim is to ensure the reliability and precision of the generated CIPs map, thereby enhancing their utility and credibility for various applications following their release.

Specifically, we randomly selected samples of both CIPs and non-CIPs along the Yangtze River based on the CIPs dataset. We divided the data into 70% for training and 30% for testing, resulting in nearly 2000 test samples, including approximately 1000 positive samples (i.e., CIPs) and 1000 negative samples (non-CIPs). It’s important to note that all these test samples and training samples are mutually independent and have no spatial intersections.

We assessed the classification performance by computing the confusion matrix, overall accuracy, and Kappa coefficient for both the test samples (Table 2) and training samples (Table S2). The overall accuracy of the test dataset is approximately 79.37%. Here, for category CIP, the producer’s accuracy (PA) is approximately 80.20%; for category Non-CIP, the PA is approximately 78.54%. User’s accuracy (UA) refers to the accuracy of the model in correctly classifying the predicted category of features. For category CIP, the UA is approximately 79.06%; for category Non-CIP, the’UA is approximately 79.70%. The Kappa coefficient measures the consistency between the model’s predicted results and random predicted results. The value of the Kappa coefficient ranges from −1 to 1, where 1 indicates complete consistency, 0 indicates the same consistency as random, and −1 indicates complete inconsistency. Here, the Kappa coefficient is approximately 0.587, indicating that the model’s predicted results are more consistent than random predictions. The accuracy of the training set is very high, with an overall accuracy of 99.73% and a PA of 0.9933 for category CIP. Although the accuracy of the test set is lower than that of the training set, considering that the test dataset is distributed along the entire Yangtze River, these results provide us with substantial confidence in the identified distribution of CIPs.

Table 2 Confusion matrix derived from testing samples.

To better illustrate the classification results of CIPs, we collected several detailed CIPs identification results, including Suzhou, Wuxi, Changzhou, Nanjing, Wuhan, Jingzhou, and Chongqing, covering major cities along the upper, middle, and lower reaches of the Yangtze River and different terrains, as shown in Fig. 4. All CIPs are accurately identified and displayed using this research dataset.

Fig. 4
figure 4

Detailed Example of Chemical Industrial Park Identification Result.

Furthermore, based on the CIPs dataset19 provided by this study along the Yangtze River region and relevant Sentinel-2 imagery, a large number of image and vector mask samples can be obtained. In the future, further development of DL-based semantic segmentation models for CIPs in high-resolution remote sensing images will be pursued.

Usage Notes

Through this study, we have released the high-resolution map of CIPs19 along the Yangtze River in 2021. It is important to note that each pixel in the published CIPs map represents an area of 100 m2 (10 m × 10 m). We filtered out speckle noise smaller than 3 × 3 pixels (i.e., 900 m2), ensuring that the smallest identifiable chemical industrial park is 900 m2, which is much smaller than the actual statistical area of CIPs.

Due to the rapid urban expansion along the river cities and the industrialization without proper planning, CIPs have emerged along the river, posing one of the risks faced by China’s early rapid urbanization development38. With the provided CIPs dataset19, these parks’ locations and distributions can be easily identified. In the Supplementary file (Figs. S1, S2, S3), we have provided the detailed statistical information of the chemical industrial parks.

We have also plotted a scatter map showing the distribution and area of CIPs along the Yangtze River according to the axis of location (Fig. 5a). Based on segmentation into the upper, middle, and lower reaches of the Yangtze River, the area of CIPs in the upper reaches is 24.85 km2, sparsely distributed, with the smallest average area of CIPs at 27 hm2, mainly in the city of Chongqing. In the middle reaches, the area of CIPs is 65.62 km2, with an average park area of 41 hm2, mainly in Wuhan, Yichang, and Jingzhou cities. The downstream area has the largest CIPs area, with the most densely distributed, covering 133.87 km2, accounting for 61.20% of the total area of CIPs along the Yangtze River. The average area is also the largest at 44 hm2, mainly including contiguous areas such as Nanjing and Suzhou-Wuxi-Changzhou. Additionally, there is a segment in the upper reaches where no CIPs were identified. This segment corresponds to the distribution of the Three Gorges Dam (111.05°E, 30.84°N) and Gezhou Dam (111.29°E, 30.73°N), which are strictly protected by central government policies39 and influenced by steep terrain on both sides, making it no CIPs construction.

Fig. 5
figure 5

(a) Scatter plot showing the location and area of CIPs along the Yangtze River from upstream to downstream, with a marked blank area for CIPs at the Three Gorges Dam. (b) Schematic diagram of the city axis along the Yangtze River, divided into three levels of cities based on administrative levels and economic population scale. (c) Distribution status and probability density of CIPs along the upper, middle, and lower reaches of the Yangtze River.

It is worth considering that based on the precise distribution of CIPs along the Yangtze River, the risks of flood hazard they face in the context of future climate change should be of concern40. Especially in the flat terrain of the middle and lower reaches, where CIPs are highly concentrated and susceptible to direct attacks, the resulting hazards cannot be ignored41. Furthermore, secondary disasters triggered by flood hazards, such as collapses and landslides, mainly occur in mountainous areas in the upper reaches, which will exacerbate safety risks around CIPs.

The pattern of human settlement along riverbanks and coasts overlaps significantly with the distribution of CIPs, which heavily rely on water transportation for import and export. The challenge of chemical industrial pollution is prevalent in developing countries, particularly in regions with large rivers42,43. In 2021, the Chinese government enacted law explicitly prohibiting the construction or expansion of CIPs within 1-km of the Yangtze River, underscoring the urgent need to understand the current distribution of chemical industries along the river to support policy implementation. To realize the vision of the Yangtze River as a truly world-class green golden waterway, it is imperative to implement measures to forcibly shut down non-compliant chemical facilities, promoting green development in the Yangtze River Economic Belt44 and Beautiful China initiative. To sum up, we comprehensively detected CIPs within a 5-km range along key areas of the Yangtze River, producing a high-resolution distribution map. This provides authentic and reliable data to protect the Yangtze River while also serving as a valuable data source for various studies on the planning, land use, and ecological conservation along the river. Our research sets a paradigm for the precise identification and regulation of chemical pollution hotspots along waterways globally, benefiting other countries facing similar challenges.