Introduction

Accompanying the world’s fastest-growing industrialization and the consequent large amount of vehicle exhaust, China’s increasing occurrences of haze, especially PM 2.5 (particulate matter smaller than 2.5 μm), have been linked to de\creased visibility, negative effects on human health, and influence on global climate. Air pollution has been one of the world’s most important eco-environmental problems. In 2012, a new ambient air quality standard (GB 3095–2012) was set by the Chinese Environmental Protection Agency (EPA), which adds PM 2.5 into the existing list of regularly monitored species. PM 2.5 originates from many sources, such as road dust, vehicle exhaust, biomass burning, industrial emission and agriculture activities, as well as from regionally transported aerosols.

Regionally transported aerosols are an important factor for PM 2.5 pollution1,2,3,4,5,6. There are a number of studies on regional transport for PM 2.5. In2, it was found that the air quality of Shanghai is largely influenced by the air masses from the north, east and west directions, accounting for 44.8%, 30.4%, and 24.8% of all the air masses respectively. In3, the contribution of regional transport to PM 2.5 was estimated in Lingcheng on the North China Plain. The PM 2.5 from regional transport contributed 31.6% of the PM 2.5 concentrations, with only 15.4% from the local emissions.

It is debatable how far PM 2.5 can spread. A number of research works have studied PM 2.5 transport in local, regional, or long-range scale1,2,3,4,5,6,7,8. In the existing works on PM 2.5 transmission, it’s unclear how “local,” “regional,” and “long-range” transport are defined and distinguished. Relative geographic distances are indispensable factors for determining the pollution level. In this paper, if cities are separated by more than a certain distance, we assume their PM 2.5 has no influence on each other.

In addition, the pollution level is highly influenced by meteorological conditions such as wind speed and wind direction9,10,11,12,13,14, which dramatically influence the diffusion, accumulation, and transport of air pollutants15,16. Generally, greater wind speed leads to stronger turbulence, resulting in more favorable dispersion conditions for pollutants17. Wind direction significantly affects PM 2.5 transport because of the spatial distribution of pollution sources and air pollutants’ transportation18.

Mountains between cities are also a major factor influencing PM 2.5 concentration. Where mountains exist, air does not flow between the cities. As depicted in19, Beijing is surrounded by mountains in three directions, and polluted air can not be easily expelled in that special geographical environment. Chongqing lies in a mountainous area of China. Influenced by the specific topographic condition, Chongqing is in the region of lowest wind speed over China. In this paper, we considered thirteen major mountains in China to build a city-network, in which the PM 2.5 of any two cities has no reciprocal influence, if there is a mountain between them.

PM 2.5 has significant spatial and temporal characteristics in China5,20. Regionally, PM 2.5 concentrations are generally higher in northern regions than in southern regions and tend to be higher in inland regions than in the coastal regions. Seasonally, the level of PM 2.5 is highest in winter and lowest in summer. In wintertime, except for emissions from fossil fuel combustion and biomass burning, meteorological conditions largely contribute to the high concentrations of PM 2.5. More frequent occurrences of stagnant weather, less rainfall, and low temperature are not good for pollution dispersion. Therefore, we choose January of 2016 for this research.

Presently, most of the methods for studying PM 2.5 can be divided into two groups: deterministic and statistical approaches. Deterministic methods21,22 mainly focus on the formation mechanism of PM 2.5 from the respective of meteorological-chemistry. In comparison, the statistical approaches, such as linear regression models23,24, neural networks25, and nonlinear regression models26,27, aim to detect certain correlated patterns between air quality data and various selected predictors, thereby predicting the pollutant concentrations in future. Each approach addresses problems from different perspectives.

Network analysis is an important and global method to study relationships between objects28,29, that can be organized into a graph. In graph theory, objects are presented as nodes and relationships between two nodes are presented as edges. Network analysis can group nodes into clusters whose members have certain common characteristics. In general, there are more connections between the nodes within a cluster than between the nodes in different clusters. Yang et al.20 have applied the network tool in studying PM 2.5. In20, the correlation between two PM 2.5 emission profiles are investigated, and then network analysis is applied to cluster cities in China. The network structure in their work is depicted at the level of individual nodes and edges, which are considered to be lower-order connectivity patterns of complex networks.

Using higher-order organizations of complex networks as the basic building blocks of complex network can help us understand the fundamental structures of complex systems. The most common higher-order organization of complex networks is network motifs30,31. In particular, three-node motifs (Fig. 1) appear frequently in networks. In air traffic patterns, M 8M 13 are fundamental units of network. M 7M 7 are structural hubs in the brain. A generalized framework32 is developed for clustering networks based on higher-order connectivity patterns. In32, different network motifs can result in different higher-order clusters. Motifs (M 5, M 6, or M 8) depict differing hierarchical flow between species in the Florida Bay ecosystem food web.

Figure 1
figure 1

Triangular motifs.

In this paper, we apply motif-based higher-order organization of complex networks to study PM 2.5 transmission and analyze structures in each city-cluster by motif analysis. Specifically, this paper aims to cluster 189 cities in China and identify major potential PM 2.5 contributors and regional transport pathways in each cluster. We first build an adjacency matrix of the complex network, combining geographic distance, wind speed, wind direction, mountains, and PM 2.5 concentration. Then the cities are clustered by using the motif-based higher-order organization of complex networks. Then, we apply motif analysis to identify the structure in each cluster.

To our best knowledge, this is the first work to apply higher-order organization of complex networks to PM 2.5 transmission. Network analysis not only gives a global view to examine PM 2.5 transmission, but also reveals an internal structure of pollution between cities in China. This research can provide valuable information for the Chinese government to implement air pollution control.

Results

In Figs 3 and 5, a circle with its cluster number represents a city. The cities in a cluster are more densely connected with each other but sparsely connected with the cities in other clusters. In accordance with specific characteristics of PM 2.5 emissions in China, a cluster will usually consist of cities in the same province or close geographical proximity.

Figure 3
figure 2

Nine clusters obtained by m 8-motif spectral clustering algorithm. Tableau Public 10.3 (https://public.tableau.com/) was used to create the map.

Figure 5
figure 3

20 clusters obtained by m 9-motif spectral clustering algorithm. Tableau Public 10.3 (https://public.tableau.com/) was used to create the map.

Clustering 189 cities into groups and identifying major potential pollution contributors in each cluster by motif m 8

Motif m 8 is chosen to identify major potential pollution contributors in each cluster. After we perform a higher-order spectral clustering algorithm, three connected components and some isolated points are included (see the Supplementary Table S1) in the m8-motif adjacency matrix of 189 cities. The largest connected component contains 170 cities, which form seven clusters. The number of total clusters K = 7 makes SSE relatively smaller, which can be seen from Fig. 2(a). Here SSE is defined at the end of this paper. The other two connected components compose cluster 8 (Changchun, Daqing, Jilin, and Mudanjiang) and cluster 9 (Jinchang, Lanzhou, Xining, and Yinchuan) respectively. Thus, nine clusters (see Fig. 3 and Supplementary Table S1) are obtained by a motif m 8-based spectral clustering algorithm. The remaining 11 isolated cities can be explained by their geographic characteristics. Kelamayi, Wulumuqi, and Kuerler are located in the Mongolia Autonomous Region, and Jiayuguan is near the Mongolia Autonomous Region. Lhasa is a plateau area. Qiqihaer and Haerbin are in the most northerly province of China. Two representative clusters, cluster 2 (including Shanghai) and cluster 4 (including Beijing), are illustrated below.

Figure 2
figure 4

SSE varies with the number of clusters (K).

In cluster 4, there are 31 cities, covering most of northern China. They are shown in Fig. 4(a) and Supplementary Table S1. From the spy plots, we can observe that some cities of y-axis direction correspond to more dots in the horizontal line, which indicates that they have more out-direction arrow lines than other cities in the network subgraph of the cluster, such as Anyang, Baoding, Jiaozuo, Xingtai, and some cities, which are labeled in the spy plot. They are major potential PM 2.5 contributors of Cluster 4. This is in agreement with the results of5,33. They concluded that the above cities are the heavily haze-affected cities in Beijing-Tianjin-Hebei, and that pollution from Shandong and Henan provinces by regional transport is also an important factor for the PM 2.5 of North China.

Figure 4
figure 5

Spy plot of two representative clusters of Fig. 3 in January of 2016. The major potential PM 2.5 contributors in each cluster are marked in the plot. The number order in the spy plot is the ID in Supplementary Table S1.

In cluster 2, there are 51 cities, including all the cities from the Yangtze River delta and some of Shandong’s coastal cities, as shown in Fig. 4(b) and Supplementary Table S1. Jining, Xuzhou, Yancheng, Zhangjiagang, and some cities that are labeled in the spy plot are the potential PM 2.5 contributors of cluster 2. Most of the potential PM 2.5 contributors are in the north part of the cluster; this agrees with the spatial characteristics of PM 2.5 5,20. Note that although Shanghai is a metropolis, it is not a major source of pollution in the cluster. Our conclusion accords with the results of2.

Clustering 189 cities into groups and identifying transport pathways in each cluster by motif m 9

We choose motif m 9 to identify transport pathways in each cluster. We can see, from Fig. 5, that only 166 cities are shown and they are clustered into 20 groups. The remaining 23 cities are isolated and these isolated cities can also be explained by their geographic characteristics. Some of them are from the Mongolia Autonomous Region, the plateau area, the most northerly part of China, or from Gansu and Ningxia. All the clustering results are listed in Supplementary Table S2.

In Fig. 2, SSE is smaller when the number of total clusters K = 10. However, clustering with K = 10 leads to more cities in each cluster. It’s difficult to see the transport pathways clearly when more cities appear in each cluster. Therefore, we choose K = 20 to cluster cities and identify transport pathways in each cluster through motif analysis.

Cluster 4 (including Shanghai), cluster 14 (including Beijing), and cluster 16 (including Tianjin) are shown in Fig. 6. For cluster 4, the PM 2.5 transport pathway originates from Nantong and Huzhou in northwest to Wenzhou, Taizhou(s), Ningbo and Zhoushan in southeast. Shanghai is also generally downwind of the most developed and polluted YRD region in special meteorological conditions, which accords with2. For cluster 14, the main PM 2.5 transport pathway is from Shijiazhuang to the northeast of the cluster and the detailed PM 2.5 transport pathway is shown in Fig. 6(b). Shijiazhuang is a key controlling point because of its relative high PM 2.5 concentration and its location upwind of other cities in the cluster. Beijing’s pollution is partly from Shijiazhuang, as described in5. Cluster 16 includes Tianjin and cities of Liaotung peninsula, Shangdong peninsula. All of the cities are around Bohai; therefore wind affecting these cities varies frequently and wind directions are not all the same in different cities at the same time. One possible PM 2.5 transport pathway originates from Tianjin, halfway between Huludao, Yingkou,Wafangdian and Dalian in Liaotung peninsula, and arrives at Yantai, Weihai and some cities in Shangdong peninsula. Another possible PM 2.5 transport pathway is from Zhaoyuan to Yantai, Weihai and some cities in Shangdong peninsula, or to Huludao, Yingkou and some cities in Liaotung peninsula.

Figure 6
figure 6

m 9-motif analysis for three representative clusters obtained by motif spectral clustering algorithm based on January of 2016. Tableau Public 10.3 (https://public.tableau.com/) was used to create these maps.

Discussion

In this paper, higher-order organization of complex network and spectral clustering methods are used to group cities in China. We obtain two major conclusions: specifically, major potential PM 2.5 contributors and PM 2.5 transport pathways. Clustering of complex networks often provides a global view of the underlying networks. Through the new clustering method, we presents a new framework to investigate the transmission of PM 2.5 among major cities in China.

In general, statistical methods tend to apply data over a long period. The complex network we use in this paper intends to analyze the relationship of nodes in a certain state and reveals the essential structure of a complex system. We intend to use the clustering method to identify the city-network, to aggregate cities, and to identify major potential PM 2.5 contributors and transport pathways. As a result, in this study we collect data only for a short period. Specifically, only January data are used; this is justified for several reasons. PM 2.5’s concentration shows an apparent seasonal pattern. High–frequency and high–concentration PM 2.5 days usually occur in winter9,34. This is mainly due to meteorological conditions. In a short period, some meteorological conditions that affects the PM 2.5 can be thought to be relatively stable, and this is helpful for simplifying models. This is the main reason we choose data from one month for our study. As a result, less important factors such as temperatures and atmospheric pressure can be ignored; thus, more important factors can be considered in a relatively simple model. As in1,9, we can ignore atmospheric pressure and temperature and consider major meteorological factors such as wind speed and wind direction that influence PM 2.5 concentration in this paper. Wind speed and wind direction vary in each city constantly and they drive air pollution transport between cities. In constructing the adjacency matrix for the network, we choose the monthly prevailing wind direction and monthly average wind speed. This approach allows us to better describe the fact of the frequent change of wind speed and direction in the present study. We believe the data from January suffice for identifying major potential pollution contributors and pollution transport pathways.

We assume that PM 2.5 in city i has no influence on city j, if the straight-line geographical distance of the two cities is more than 500 kilometers. PM 2.5 flow will dissipate during the propagation. In addition, when the straight-line geographical distance of the two cities is more than 200 kilometers, we assume PM 2.5 in city i has influence on city j, only if city i’s PM 2.5 concentration is higher than city j’s at certain extent. We use “500 kilometers” and “200 kilometers” as the dividing values, mainly inspired by9, which concluded that aerosol nucleation and growth processes occur on the regional (several hundred kilometers) to urban (less than 100 kilometers) scales. Although there are many research works on regional transport of PM 2.5, it is an open question as to how far PM 2.5 can travel. In addition, we believe that many physical, biological and social models, for example35,36,37,38,39,40,41,42,43,44, could be used for estimating/predicting the long range transport of PM 2.5.

Because meteorological conditions are complex, additional factors affecting PM 2.5 should be considered in future studies. In addition, PM 2.5 in city i has influence on other cities and the incidence should be inversely proportional to the geographic distance. Therefore, it is more important that weighted complex networks should be considered in future.

In this paper, we consider some meteorological conditions and geographical data to cluster cities in China and identify the inner structure of each cluster. However, PM 2.5 transmission between cities is a very complex issue. Economy, population, in-vehicle commuting, and many others are also indispensable factors that influence PM 2.5 transmission. More economic factors and social factors will be considered in our future work.

Data and Methods

Data

In this paper, we focus on the top 189 pollution–monitoring cities in China’s mainland, which cover all 34 provincial-level regions of China. The most polluted and the major cities are all included, such as Beijing, Shanghai and Guangzhou.

Data from January 2016 are used in this work to identify major PM 2.5 pollution contributors and transport pathways in each cluster. The data that we collect in this paper are as follows: (1) PM 2.5 monthly average concentration is calculated based on ground air quality monitoring data from China’s National Environmental Monitoring. (2) The geo-location information in the forms of latitude and longitude of 189 cities are from Google Earth. (3) Thirteen major mountains with high altitudes in China (see Supplementary Table S3) are included in this paper. (4) Wind speed and wind direction data is from the China Meteorological Administration. Wind directions are classified into eight directions (e.g., N, E, S, W, W-S, E-S, W-N, E-N), We use the monthly prevailing wind direction of each city in January. The scaling of wind speed is based on the Jenks Natural Breaks Classification method10. Wind speed(ws) is divided into eight levels: \(ws\le 0.7m/s\) (Level-1), \(\mathrm{0.7 < }ws\le 1.1m/s\) (Level-2), \(1.1 < ws\le 1.6m/s\) (Level-3), \(\mathrm{1.6 < }ws\le 2.1m/s\) (Level-4), \(\mathrm{2.1 < }ws\le 2.7m/s\) (Level-5), \(\mathrm{2.7 < }ws\le 3.4m/s\) (Level-6), \(\mathrm{3.4 < }ws\le 4.4m/s\) (Level-7) and \(ws\mathrm{ > 4.4}m/s\) (Level-8). We use the monthly average wind speed. Here “monthly average” means the arithmetic average of the mean concentration levels or mean wind speed of each day in a calendar month.

Motif-based higher-order spectral clustering algorithm

The motif-based higher-order spectral clustering algorithm in the supplementary materials of32 unifies motif analysis30 and k-means spectral clustering45 to reveal new organizational patterns and modules in complex systems. We use the method to cluster 189 cities in China and identify major potential PM 2.5 contributors and PM 2.5 transport pathways in clusters. The major steps are listed below.

  1. (1)

    Building the adjacency matrix A of the network and choosing motif M of interest. Specifically, in this paper, matrix A is built as follows: \({m}_{8}\) and \({m}_{9}\) are chosen as the building modules to reveal the essential structures of the complex network; \({m}_{8}\) reflects reveal the relationship between source and victims; \({m}_{9}\) reflects the transmission route.

  2. (2)

    Computing the motif adjacency matrix \({W}_{M}\), whose entry \({W}_{M}(i,j)\) equals the number of the motif instances of motif M with node i and node j.

  3. (3)

    Clustering 189 cities by spectral clustering algorithm through the motif adjacency matrix \({W}_{M}\).

    1. a)

      Computing the normalized motif Laplacian \({L}_{M}=I-{D}_{M}^{-\mathrm{1/2}}{W}_{M}{D}_{M}^{-\mathrm{1/2}}\), where \({D}_{M}\) is diagonal matrix with \({({D}_{M})}_{ii}={\sum }_{j}{({W}_{M})}_{ij}\).

    2. b)

      Forming matrix \(X\), st.\(X=[{x}_{1},{x}_{2},\cdots ,{x}_{k}]\), where \({x}_{1},{x}_{2},\cdots ,{x}_{k}\) are the k largest eigenvectors of \({L}_{M}\).

    3. c)

      Calculating matrix Y, whose entry is \({Y}_{ij}={X}_{ij}/({\sum }_{j}{X}_{ij}^{2}{)}^{\mathrm{1/2}}\).

    4. d)

      Taking each row of Y as a point in \({R}^{k}\) and cluster the points into k clusters via k-means method45. In this paper, the optimal number of cluster(K) is chosen as follows, which is inspired by46.

    5. e)

      City j is assigned to cluster j if and only if row j of matrix Y is assigned to cluster j.

  4. (4)

    Analyzing every cluster using the motifs of (1). This paper applies motif \({m}_{8}\) to analyze major potential contributors. In the social network graph, a node with the largest numbers of edges is commonly considered as source, from which information begins to disperse47. After using motif \({m}_{8}\) to cluster 189 cities in the complex network, a city with more out-direction arrow lines shows that it has relatively high PM 2.5 concentration and it has a high influence ratio on the other cities of the cluster. The city can be regarded as one major PM 2.5 pollution contributor in the cluster. This can be seen through the spy plot, which illustrates the network structure of the cluster. Motif \({m}_{9}\) helps find PM 2.5 transport pathway in every cluster. In Fig. 1, \({m}_{9}\) corresponds to the PM 2.5 flow from city a to city b, then from city b to city c.

Building an Adjacency Matrix

A network can be represented as a matrix, which is called the sociomatrix44 or adjacency matrix. Suppose the number of nodes is n. Let V and E be the sets of nodes and edges in the network, respectively. Then the adjacency matrix of the network can be expressed by matrix A{0, 1} nn . An entry \({A}_{ij}\in \mathrm{\{0},\mathrm{1\}}\) denotes whether there is a link between node \({v}_{i}\) and node \({v}_{j}\). If node \({v}_{i}\) and node \({v}_{j}\) are adjacent, then \({A}_{ij}=1\). Otherwise, \({A}_{ij}=0\). If the network is undirected, the adjacency matrix A is symmetric. However, in some situations, interactions between two different individuals are directional. In Twitter, for example, one user x follows another user y, but user y does not necessarily follow user x. In this case, the follower- followee network is directed and asymmetrical.

Based on PM 2.5 monthly-average concentration, geographic distance between cities, monthly prevailing wind direction, monthly average wind speed, and mountains between cities (189 cities in China in January 2016), the detailed procedure for building the adjacency matrix is as follows:

(1) Adjacency matrix based on distance (A1): Based on the latitudes and longitudes of 189 cities, the relative geographic distances are calculated. The entry \({A}_{1}(i,j)=0\), if the relative geographic distance is more than 500 kilometers. Otherwise, \({A}_{1}(i,j)=1\). The assumption is plausible because PM 2.5 of each city has no effect on another city, if they are distant from each other.

We choose 500 kilometers, because it is an empirical value through numerical simulation. This is in agreement with9, which found that aerosol nucleation and growth processes occur on the regional (several hundred kilometers) to urban (less than 100 kilometers) scales.

(2) Adjacency matrix based on mountain (A2): In the planimetric map, a major mountain can be expressed by a line segment through its latitudes and longitudes, which is called a mountain-segment. Therefore the 13 major mountains we considered are depicted by 13 different line segments. Meanwhile, there is a line segment between any two cities, which is called a cities-segment. If the cities-segment between city i and city j has a cross point with any of the 13 mountain-segments, the entry \({A}_{2}(i,j)=0\). Otherwise, \({A}_{2}(i,j)=1\).

(3) Adjacency matrix based on wind (A3): Wind speed and wind direction jointly affect the propagation of PM 2.5 flow. Due to the wind direction, the effect from wind on PM 2.5 transmission is directional. Specifically, PM 2.5 of city i may flow to city j. But it is possible that PM 2.5 of city j may not be blown to city i.

We assume city i’s PM 2.5 has no effect on any other cities, if wind level is less than 2. When the speed level is more than 2 (more than 1.1 m/s), the wind direction is a key point, determining whether PM 2.5 flowing from city i affects city j. Specifically, in the planimetric map, there is a directional line segment from city i to city j. If the angle \({\theta }_{ij}\), between city i’s wind direction and the directional line segment from city i to city j, is less than 90 degree, we think city i’s wind can flow to city j.

The overall effect of wind from city i to city j is calculated by \({A}_{3}(i,j)={w}_{i}cos({\theta }_{ij})\), where \({w}_{i}\) is the wind speed of city i.

(4) Adjacency matrix based on PM 2.5 (A4): The paper aims to study the major pollution contributors and the pollution transport pathways. And we are particularly interested in that how a city with high PM 2.5 concentration affects a city with low concentration. Specifically, two situations are considered below.

Situation One: When the geographic distance is less than 200 kilometers, the PM 2.5 of city i has effect on city j, as long as the PM 2.5 concentration of it is higher than city j’s, then \({A}_{4}(i,j)=1\). Otherwise, \({A}_{4}(i,j)=0\).

Situation Two: PM 2.5 flow will dissipate during the propagation. Therefore, when the geographic distance is more than 200 kilometers, if and only if city i’s PM 2.5 concentration is α × d ij higher than city j’s, then \({A}_{4}(i,j)=1\). Otherwise, \({A}_{4}(i,j)=0\). Here d ij is the geographic distance between city i and city j and \(\alpha =0.01\) is an empirical threshold value through numerical simulation. Better α should be considered in future work according to the meteorological condition. α × d ij is a degree of PM 2.5 concentration, increasing with the geographic distance dij.

Clustering and motif analysis are based on the adjacency matrix \(A={A}_{1}\circ {A}_{2}\circ {A}_{3}\circ {A}_{4}\), where “\(\circ \)” is the Hadamard (entry-wise) product. Namely, PM 2.5’s propagation is the combined effects of geographic distance, mountain, wind and PM 2.5 concentration.

Selecting K

As in46, the sum of the squared distance between each member of a cluster and its cluster centroid (SSE) is defined as

$$SSE=\sum _{i\mathrm{=1}}^{K}\sum _{x\in {C}_{i}}dist{({c}_{i},x)}^{2},$$

where x is a city; c i is the centroid of cluster C i; C i is the i th cluster (cluster i); dist is the the standard Euclidean distance between two cities of Euclidean space. K is the number of clusters, and the optimal value is chosen from 2 to 50, which makes dist smallest. The K larger than 50 has not much meaning for clustering 189 cities, which leads to too-detailed clustering.